Document source
The source
section of the project descriptor defines the source of documents to be imported into
Lingo4G index.
{
"classpath": null,
"feed": null
}
classpath
An optional class path to allow loading custom document source implementations. Path elements can be defined in any of the following ways:
- direct path
-
A single file system path, absolute or relative to the project descriptor, for example:
"classpath": "lib/my-source.jar"
- file matcher
-
A recursive file matcher, pointing at the root directory and with an optional glob expression to select a subset of files. For example:
"classpath": { "dir": "lib", "match": "*.jar" }
- array of other paths
-
An array (union) of any other path elements. For example:
"classpath": [ "lib/my-source.jar", { "dir": "lib/dependencies", "match": "*.jar" } ]
feed
The feed
element declares the source of documents that should be imported into and indexed by
Lingo4G.
This object must conform to one of the built-in types or a custom document source implementation class.
-
json-records
-
The built-in document source for importing documents from JSON files. Supports incremental indexing.
-
document-source-class
-
A custom document source implementation, loaded from the provided class path and implementing
I​Document​Source
interface.
document-source-class
A custom document source implementation, loaded from the provided class path and
implementing I​Document​Source
interface. The document-source-class
should be replaced with a fully qualified name of the class to be used.
{
"type": "document-source-class"
}
Several example data sets provided with Lingo4G contain document source implementations. Lingo4G distribution contains full source code for the implementation of the built-in types.
Writing a custom document source makes sense when you want to avoid the overhead of converting a large data set to JSON or when the data exists in an external store and can be queried efficiently. Writing and maintenance of custom document source (Java code) may not be worth the effort: we strongly suggest importing your data using one of the built-in document sources.
json-records
This built-in document source implementation can read documents from JSON files or gzip-compressed JSON files. With the help of the input option, the source documents can be automatically downloaded, decompressed and then read.
{
"type": "json-records",
"failOnBrokenJson": true,
"fieldMapping": {},
"input": {},
"inputs": {}
}
Files containing JSON and JSON record files are supported. A JSON record file is a sequence of independent JSON objects lined up contiguously in one physical file. Such format is used by, for example, Apache Drill and elasticsearch-dump.
Documents inside the input files are expected to be structured in one of the following ways (field names are arbitrary in all examples below):
-
multiple documents in a single file (as an array)
If the input file contains an array, that array is expected to contain individual documents:
[ { "id": "document 1", "field": "foo" }, { "id": "document 2", "field": "bar" }, { "id": "document 3", "field": "baz" } ]
-
one document in each file (as an object)
If the input file contains an object, that object is expected to contain the document's fields.
{ "id": "document 1", "field": "foo" }
-
multiple documents in each file (json records objects)
If the input file contains multiple, whitespace-separated objects, each object is expected to contain one document's fields. Note that the entire json records file is not really a valid JSON format: it's a concatenation of valid JSON objects:
{ "id": "document 1", "field": "foo" } { "id": "document 2", "field": "bar" } { "id": "document 3", "field": "baz" }
Each document consists of one or more field name-value pairs. Field names must be unique (JSON objects must contain unique keys). Field values can be:
-
strings or numbers
A simple type (string or number). For example:
{ "id": "document 1", "string-field": "foo", "numeric-field": 4, "another-field": 4.5 }
-
arrays
An array of simple types means indexing multiple values in the same field (multi-valued fields). For example:
{ "id": "document 1", "field": ["foo", "bar"] }
By default, all top-level object's field-value pairs are imported. See the field mapping section on additional information on how to restrict the set of imported fields or how to select values from JSON files that have a more complex structure.
fail​On​Broken​Json
If true
, terminates document indexing if any of the input JSON files are malformed. If set to
false
, a console warning is emitted but indexing continues.
field​Mapping
The optional field mapping element provides a way to extract field name-value pairs from JSON files that don't conform to the expected "flat" structure described above. The field mapping contains a set of field names and JSON paths that "point" at the value (or values) that should be imported into that field.
For example, given the following input JSON:
{
"info" : {
"id": "67"
},
"type" : "question",
"date": "2022",
"tags" : [
"windows",
"pdf"
]
}
the field mapping that imports id
, type
and a multi-valued tag
field
should look like this:
"fieldMapping": {
"id": "$.info.id",
"type": "$.type",
"tag": "$.tags.*"
}
If the field mapping block is declared and not-empty, only the provided json paths are imported, any other
JSON elements are ignored. You may have to provide an explicit mapping to top level elements as well (see
the type
element in the example above).
input
The input element provides the location of JSON files to be processed, with an added ability to read files from ZIP archives or download input files from the internet (this feature is frequently used in the example data sets provided with Lingo4G distribution). A full set of properties of this object is shown below.
{
"autodownload": true,
"dir": null,
"httpRedirectsLimit": 8,
"match": [],
"matchInsideZip": [],
"onMissing": null,
"scanZips": true,
"supportResume": true,
"unpack": true,
"zipScanning": "parallel"
}
The following example shows a more typical declaration, loading all JSON files (gzip-compressed or in plain
text) from any directory under the data
folder:
{
"dir": "data",
"match": "**/*.{json,json.gz}"
}
This example reads JSON files directly from ZIP files under the data folder:
{
"dir": "data",
"match": "**/*.zip",
"scanZips": true,
"matchInsideZip": "**/*.json"
}
autodownload
If true
, and the set of matching input files is
empty, starts an automatic download of input files provided in the
on​Missing
property.
dir
The root directory to scan for input files. By default, all files in this directory and subdirectories are
accepted. You can alter the set of input files using the glob pattern specified in the
match
property.
http​Redirects​Limit
Maximum number of HTTP protocol redirects for URLs specified in the
on​Missing
property if autodownload
is true.
match
A glob pattern restricting input files to a
subset of files scanned recursively under the
dir
property. The match
pattern must
follow Java's
PathMatcher
syntax.
Typical examples:
match​Inside​Zip
A glob pattern restricting input files inside any ZIP files if ZIP scanning is enabled.
The match
pattern must follow Java's
PathMatcher
syntax (ZIP file acts as a separate filesystem).
on​Missing
Provides a list of URLs to be downloaded if
autodownload
is enabled and the set of
matching files is empty.
This property should contain an array of arrays of URLs. Each input array is processed sequentially. If the
URLs inside that array fail to download, URLs from the subsequent array are used — this can be used to
provide a list of backup mirror locations to use. Here is an example declaration of the
on​Missing
property from the
clinical trials example. Three
different alternative download locations are provided.
"onMissing": [
["https://data.carrotsearch.com/clinicaltrials/AACT201509_pipe_delimited_txt.7z"],
["http://data.carrotsearch.com/clinicaltrials/AACT201509_pipe_delimited_txt.7z"],
["https://library.dcri.duke.edu/dtmi/ctti/2015_Sept_Annual/AACT201509_pipe_delimited_txt.zip"]
]
scan​Zips
If true
, any ZIP files accepted as input files will be scanned recursively for input files. See
match​Inside​Zip
for a glob pattern to restrict scanning to a subset of files within the ZIP file.
support​Resume
If true
and autodownload
is enabled, any partially downloaded files will be resumed (as opposed to downloaded from scratch).
unpack
If true
and autodownload
is
enabled, any files compressed with the following file formats will be automatically uncompressed to
dir
after they are downloaded:
- ZIP files,
- 7z files.
zip​Scanning
An expert configuration option to allow parallel scanning and extraction of content from ZIP files. ZIP decompression is quite slow so in many cases parallel scanning may yield indexing time improvements.
parallel
-
Scan and extract files from ZIP files in parallel.
sequential
-
Scan and extract files from ZIP files sequentially.
inputs
The inputs
property is an alternative to using the (recommended)
input
specification. The only advantage is that several
locations can be provided using inputs
(not only files under one directory).
The syntax of this property is identical to that described in the
classpath
property, for example:
"inputs": [
{
"dir": "data-updates",
"match": "*.{json,json.gz}"
},
{
"dir": "data",
"match": "*.{json,json.gz}"
}
]