Indexing CSV records

Lingo4G can index text records red from CSV files, which is a convenient way to experiment with tabular data often available as spreadsheets.

CSV files contain simple, plain-text, comma-separated rows with tabular data. Any spreadsheet software (Microsoft Excel, LibreOffice) can export a CSV file (see limitations for caveats though).

To index one or more CSV files, perform the following steps:

  1. Copy your CSV files to an empty directory or the datasets/dataset-autoindex/data (an example present in Lingo4G's distribution), removing any example files present there.

    Lingo4G will read all files matching the pattern *.csv* present under that folder. Files may be compressed (.csv.gz or .csv.zst extensions).

    CSV files can (unfortunately!) come in different formats. Lingo4G reads plain text, UTF-8 encoded files with records separated by commas. The first record must contain column names. Here is an example CSV file with three columns and two records ("documents").

    id,title,abstract
    1,"Title of the first document","Abstract of the first document"
    2,"Title of the second document","Abstract of the second document"
  2. Edit the example's project descriptor (csv.project.json), adding column names and types present in your CSV files to the fields section. Not all columns present in CSV files need to be imported. For our example CSV file above, this could read:

    "fields": {
      "id":       { "id": true, "analyzer": "literal" },
      "title":    { "analyzer": "english" },
      "abstract": { "analyzer": "english" }
    },

    Then review and modify those sections of the project descriptor that reference fields, according to what your CSV fields contains and to your needs. Specifically, the features field configuring feature extraction strategies, the default query parser fields and the default analysis fields for API V1 (for the Explorer).

  3. In the command console, run the following command:

    l4g index -p datasets/dataset-csv -Dinput.dir=[absolute folder path]

    replacing [absolute folder path] with the path to the folder containing your documents.

    If Lingo4G cannot parse a file, it will print a warning to the console.

  4. Run the following command to start Lingo4G server:

    l4g server -p datasets/dataset-csv
  5. Point your browser at localhost:​8080 and choose a demo application to explore your data.

Limitations

CSV formats

CSV files can come in different formats and flavors. Make sure they use UTF-8 encoding and are comma-separated.