Fields

The fields section of the project descriptor defines how Lingo4G processes and stores values of fields returned by the document source.

The fields object consists of keys that denote field names and values containing the definition of how that field will be processed and indexed.

The set of fields you declare in the project descriptor may be a subset of the fields returned by the document source. During indexing, Lingo4G will process only the fields you declare in the fields section, ignoring any other fields returned by the document source.

Field values (definitions) must be objects of the following types:

text
The default value type denoting free text. Text fields can have associated search and feature analyzers.
date

A date-and-time type.

integer

An integer type (integer values between -2147483648 and 2147483647).

long

An integer type (integer values between -263 and 263-1.

double

A 64-bit floating point numeric value.

float

A 32-bit floating point numeric value.

float-vector

A vector of 32-bit float values, used to store vector embeddings computed from external sources.

A typical fields section may look like the following:

"fields":  {
  // Document identifier field (for updates).
  "id":      { "id": true, "type": "text", "analyzer": "literal" },

  // Simple values, will be lower-cased for query matching
  "author":  { "analyzer": "keyword" },
  "type":    { "analyzer": "keyword" },

  // Plain text in English.
  "title":   { "analyzer": "english" },
  "summary": { "analyzer": "english" },

  // Date, converted from incomplete information to full iso timestamp.
  "created": { "type": "date", "inputFormat": "yyyy-MM-dd HH:mm",
                               "indexFormat": "yyyy[-MM][-dd]['T'][HH][:mm][:ss][X]" },

  // A numeric score as an integer.
  "score":   { "type": "integer" }
}

Each type of field is described in more detail in the remaining sections of this chapter.

date

A date-and-time type. Lingo4G stores timestamps as integers, denoting the number of milliseconds since epoch (1970-01-01T00:00:00, UTC). The minimum representable date is (in UTC time zone) 292275055-05-16​T16:​47:​04.192​Z, the maximum representable date is 292278994-08-17​T07:​12:​55.807​Z. The granularity of expressed time is 1 millisecond (nanoseconds are not supported).

{
  "type": "date",
  "indexFormat": "yyyy-MM-dd'T'HH:mm:ss[.SSS][X]",
  "inputFormat": "yyyy-MM-dd'T'HH:mm:ss[.SSS][X]",
  "queryAnalyzer": null
}

Input date strings must parse correctly according to the pattern provided in input​Format, which is then converted to a Java Instant. The Instant is then converted to epoch milliseconds and stored in the index.

Each date field will have an associated analyzer, which converts input strings into milliseconds since epoch. The index​Format field stores the date and time format used for this conversion. Date query parser may also support Apache Solr-style date-math expressions.

index​Format

Type
string
Default
"yyyy-MM-dd'T'HH:mm:ss[.SSS][X]"
Required
no

The format of the timestamp stored in the index, as per the specification in Date​Time​Formatter.

Date fields perform input string validation. An index format specification should declare date fields as optional if prefix queries are allowed. For example, an index format like this: yyyy-​M​M-dd will fail for a prefix query 2021-02. An index format allowing prefixes: yyyy[-​M​M][-dd] can parse partial queries like 2021 or 2021-02.

input​Format

Type
string
Default
"yyyy-MM-dd'T'HH:mm:ss[.SSS][X]"
Required
no
The date format with which the input field's string value is parsed into a timestamp to be stored in Lingo4G, as per the specification in Date​Time​Formatter.

query​Analyzer

Type
string
Default
null
Required
no

An analyzer chain used for converting query string into the index storage format.

This property should contain a reference to a date analyzer type. The default date analyzer supports date math expressions and validation.

double

A 64-bit floating point numeric value.

{
  "type": "double"
}

float

A 32-bit floating point numeric value.

{
  "type": "float"
}

float-vector

A vector of 32-bit floating point values.

{
  "type": "float-vector",
  "length": null,
  "normalize": true
}

Fields of this type can store vector embeddings computed from external sources (for example, sentence embeddings obtained using large language models). The document source should provide a text value containing floating point numbers, separated by commas. Lingo4G will split the input numbers and store them as a vector, internally.

length

Type
integer
Default
undefined
Required
yes

The maximum number of each document's vector components.

normalize

Type
boolean
Default
true
Required
no

If true, Lingo4G normalizes all vectors in this field to have the Euclidean length of 1.0.

integer

An integer type (integer values between -2147483648 and 2147483647).

{
  "type": "integer"
}

long

An integer type (integer values between -263 and 263-1.

{
  "type": "long"
}

text

A text field. Text fields are the most common type of fields in Lingo4G.

{
  "type": "text",
  "analyzer": null,
  "featureAnalyzer": null,
  "id": false,
  "indexPositions": false,
  "queryAnalyzer": null
}

The value of a text field can be analyzed, that is processed and split into a stream of tokens. These tokens can be used to:

  • build an inverted index that allows fast searching for documents containing a given token or an expression involving multiple tokens,

  • detect frequent, non-trivial sequences of tokens that can potentially convey some meaning. These sequences become document features.

analyzer

Type
string
Default
null
Required
no

An analyzer chain used for indexing and querying, unless a different analyzer is provided in queryAnalyzer. Defines how input text is split (tokenized) into separate tokens eventually stored in the inverted index used for searches.

This property should contain one of the analyzer types. Fields without an analyzer cannot be used in search queries (but their whole field values are still available for retrieval).

feature​Analyzer

Type
string
Default
null
Required
no

An analyzer chain used for automatic feature discovery. Defines how input values are tokenized into separate tokens. Automatic feature discovery then looks for longer, frequent sequences of tokens that may eventually become document labels.

This property should contain one of the analyzer types. Fields without a feature analyzer cannot be used in many types of analysis requests.

Feature and search analyzers

A feature analyzer is separate from the search analyzer because their configurations will often be different. For example, a feature analyzer for a field may include a much larger stop word dictionary to ignore tokens and phrases that are irrelevant for analytical needs (but may occasionally be useful for document retrieval in search queries).

id

Type
boolean
Default
false
Required
no

If true, the field is designated as containing a non-empty, unique document identifier. Only one field can be marked as an identifier. Document identifiers are required for incremental indexing and for document deletions.

index​Positions

Type
boolean
Default
false
Required
no

If true, inverted search indexes will store token positions in the stream of tokens. Positions are required for certain type of proximity queries (and for proper search scope highlighting).

query​Analyzer

Type
string
Default
null
Required
no

An analyzer chain used for querying the field only. Defaults to the same value as the analyzer. This property can be used if different analyzer pipelines are used for the inverted index and for querying.

This property should contain one of the analyzer types. Fields without an analyzer cannot be used in search queries (but their whole field values are still available for retrieval).