Document source

The source section of the project descriptor defines the source of documents to be imported into Lingo4G index.

{
  "classpath": null,
  "feed": null
}

classpath

Type
array of object or object or string
Default
undefined
Required
no

An optional class path to allow loading custom document source implementations. Path elements can be defined in any of the following ways:

direct path

A single file system path, absolute or relative to the project descriptor, for example:

"classpath": "lib/my-source.jar"
file matcher

A recursive file matcher, pointing at the root directory and with an optional glob expression to select a subset of files. For example:

"classpath": {
  "dir": "lib",
  "match": "*.jar"
}
array of other paths

An array (union) of any other path elements. For example:

"classpath": [
  "lib/my-source.jar",
  {
    "dir": "lib/dependencies",
    "match": "*.jar"
  }
]

feed

The feed element declares the source of documents that should be imported into and indexed by Lingo4G.

This object must conform to one of the built-in types or a custom document source implementation class.

json-records

The built-in document source for importing documents from JSON files. Supports incremental indexing.

document-source-class

A custom document source implementation, loaded from the provided class path and implementing I​Document​Source interface.

document-source-class

A custom document source implementation, loaded from the provided class path and implementing I​Document​Source interface. The document-source-class should be replaced with a fully qualified name of the class to be used.

{
  "type": "document-source-class"
}

Several example data sets provided with Lingo4G contain document source implementations. Lingo4G distribution contains full source code for the implementation of the built-in types.

Custom or built-in document source?

Writing a custom document source makes sense when you want to avoid the overhead of converting a large data set to JSON or when the data exists in an external store and can be queried efficiently. Writing and maintenance of custom document source (Java code) may not be worth the effort: we strongly suggest importing your data using one of the built-in document sources.

json-records

This built-in document source implementation can read documents from JSON files or gzip-compressed JSON files. With the help of the input option, the source documents can be automatically downloaded, decompressed and then read.

{
  "type": "json-records",
  "failOnBrokenJson": true,
  "fieldMapping": {},
  "input": {},
  "inputs": {}
}

Files containing JSON and JSON record files are supported. A JSON record file is a sequence of independent JSON objects lined up contiguously in one physical file. Such format is used by, for example, Apache Drill and elasticsearch-dump.

Documents inside the input files are expected to be structured in one of the following ways (field names are arbitrary in all examples below):

  • multiple documents in a single file (as an array)

    If the input file contains an array, that array is expected to contain individual documents:

    [
      { "id": "document 1", "field": "foo" },
      { "id": "document 2", "field": "bar" },
      { "id": "document 3", "field": "baz" }
    ]
  • one document in each file (as an object)

    If the input file contains an object, that object is expected to contain the document's fields.

    { "id": "document 1", "field": "foo" }
  • multiple documents in each file (json records objects)

    If the input file contains multiple, whitespace-separated objects, each object is expected to contain one document's fields. Note that the entire json records file is not really a valid JSON format: it's a concatenation of valid JSON objects:

    { "id": "document 1", "field": "foo" }
    { "id": "document 2", "field": "bar" }
    { "id": "document 3", "field": "baz" }

Each document consists of one or more field name-value pairs. Field names must be unique (JSON objects must contain unique keys). Field values can be:

  • strings or numbers

    A simple type (string or number). For example:

    {
      "id": "document 1",
      "string-field": "foo",
      "numeric-field": 4,
      "another-field": 4.5
    }
  • arrays

    An array of simple types means indexing multiple values in the same field (multi-valued fields). For example:

    {
      "id": "document 1",
      "field": ["foo", "bar"]
    }

By default, all top-level object's field-value pairs are imported. See the field mapping section on additional information on how to restrict the set of imported fields or how to select values from JSON files that have a more complex structure.

fail​On​Broken​Json

Type
boolean
Default
true
Required
no

If true, terminates document indexing if any of the input JSON files are malformed. If set to false, a console warning is emitted but indexing continues.

field​Mapping

The optional field mapping element provides a way to extract field name-value pairs from JSON files that don't conform to the expected "flat" structure described above. The field mapping contains a set of field names and JSON paths that "point" at the value (or values) that should be imported into that field.

For example, given the following input JSON:

{
  "info" : {
    "id": "67"
  },
  "type" : "question",
  "date": "2022",
  "tags" : [
    "windows",
    "pdf"
  ]
}

the field mapping that imports id, type and a multi-valued tag field should look like this:

"fieldMapping": {
  "id": "$.info.id",
  "type": "$.type",
  "tag": "$.tags.*"
}
Important

If the field mapping block is declared and not-empty, only the provided json paths are imported, any other JSON elements are ignored. You may have to provide an explicit mapping to top level elements as well (see the type element in the example above).

input

The input element provides the location of JSON files to be processed, with an added ability to read files from ZIP archives or download input files from the internet (this feature is frequently used in the example data sets provided with Lingo4G distribution). A full set of properties of this object is shown below.

{
  "autodownload": true,
  "dir": null,
  "httpRedirectsLimit": 8,
  "match": [],
  "matchInsideZip": [],
  "onMissing": null,
  "scanZips": true,
  "supportResume": true,
  "unpack": true,
  "zipScanning": "parallel"
}

The following example shows a more typical declaration, loading all JSON files (gzip-compressed or in plain text) from any directory under the data folder:

{
  "dir": "data",
  "match": "**/*.{json,json.gz}"
}

This example reads JSON files directly from ZIP files under the data folder:

{
  "dir": "data",
  "match": "**/*.zip",
  "scanZips": true,
  "matchInsideZip": "**/*.json"
}
autodownload
Type
boolean
Default
true
Required
no

If true, and the set of matching input files is empty, starts an automatic download of input files provided in the on​Missing property.

dir
Type
string
Default
null
Required
no

The root directory to scan for input files. By default, all files in this directory and subdirectories are accepted. You can alter the set of input files using the glob pattern specified in the match property.

http​Redirects​Limit
Type
integer
Default
8
Constraints
value >= 0
Required
no

Maximum number of HTTP protocol redirects for URLs specified in the on​Missing property if autodownload is true.

match
Type
string or array of string
Default
[]
Required
no

A glob pattern restricting input files to a subset of files scanned recursively under the dir property. The match pattern must follow Java's PathMatcher syntax.

Typical examples:

  • *.json: all files with the json extension under dir.

  • **/*.json: all files with the json extension under dir or any of its subdirectories.

  • **/*.{json,json.gz}: all files with the json or json.gz extension under dir or any of its subdirectories.

match​Inside​Zip
Type
string or array of string
Default
[]
Required
no

A glob pattern restricting input files inside any ZIP files if ZIP scanning is enabled.

The match pattern must follow Java's PathMatcher syntax (ZIP file acts as a separate filesystem).

on​Missing
Type
array of array of string
Default
null
Required
no

Provides a list of URLs to be downloaded if autodownload is enabled and the set of matching files is empty.

This property should contain an array of arrays of URLs. Each input array is processed sequentially. If the URLs inside that array fail to download, URLs from the subsequent array are used — this can be used to provide a list of backup mirror locations to use. Here is an example declaration of the on​Missing property from the clinical trials example. Three different alternative download locations are provided.

"onMissing": [
  ["https://data.carrotsearch.com/clinicaltrials/AACT201509_pipe_delimited_txt.7z"],
  ["http://data.carrotsearch.com/clinicaltrials/AACT201509_pipe_delimited_txt.7z"],
  ["https://library.dcri.duke.edu/dtmi/ctti/2015_Sept_Annual/AACT201509_pipe_delimited_txt.zip"]
]
scan​Zips
Type
boolean
Default
true
Required
no

If true, any ZIP files accepted as input files will be scanned recursively for input files. See match​Inside​Zip for a glob pattern to restrict scanning to a subset of files within the ZIP file.

support​Resume
Type
boolean
Default
true
Required
no

If true and autodownload is enabled, any partially downloaded files will be resumed (as opposed to downloaded from scratch).

unpack
Type
boolean
Default
true
Required
no

If true and autodownload is enabled, any files compressed with the following file formats will be automatically uncompressed to dir after they are downloaded:

  • ZIP files,
  • 7z files.
zip​Scanning
Type
string
Default
"parallel"
Constraints
one of [parallel, sequential]
Required
no

An expert configuration option to allow parallel scanning and extraction of content from ZIP files. ZIP decompression is quite slow so in many cases parallel scanning may yield indexing time improvements.

parallel

Scan and extract files from ZIP files in parallel.

sequential

Scan and extract files from ZIP files sequentially.

inputs

The inputs property is an alternative to using the (recommended) input specification. The only advantage is that several locations can be provided using inputs (not only files under one directory).

The syntax of this property is identical to that described in the classpath property, for example:

"inputs": [
  {
    "dir": "data-updates",
    "match": "*.{json,json.gz}"
  },
  {
    "dir": "data",
    "match": "*.{json,json.gz}"
  }
]