Fields

The fields section of the project descriptor defines how Lingo4G processes and stores values of fields returned by the document source.

The fields object consists of keys that denote field names and values containing the definition of how that field will be processed and indexed.

The set of fields you declare in the project descriptor may be a subset of the fields returned by the document source. During indexing, Lingo4G will process only the fields you declare in the fields section, ignoring any other fields returned by the document source.

Field values (definitions) must be objects of the following types:

text
The default value type denoting free text. Text fields can have associated search and feature analyzers.
date

A date-and-time type.

integer

An integer type (integer values between -2147483648 and 2147483647).

long

An integer type (integer values between -263 and 263-1.

double

A 64-bit floating point numeric value.

float

A 32-bit floating point numeric value.

A typical fields section may look like the following:

"fields":  {
  // Document identifier field (for updates).
  "id":      { "id": true, "type": "text", "analyzer": "literal" },

  // Simple values, will be lower-cased for query matching
  "author":  { "analyzer": "keyword" },
  "type":    { "analyzer": "keyword" },

  // Plain text in English.
  "title":   { "analyzer": "english" },
  "summary": { "analyzer": "english" },

  // Date, converted from incomplete information to full iso timestamp.
  "created": { "type": "date", "inputFormat": "yyyy-MM-dd HH:mm",
                               "indexFormat": "yyyy-MM-dd'T'HH:mm:ss[X]" },

  // A numeric score as an integer.
  "score":   { "type": "integer" }
}

Each type of field is described in more detail in the remaining sections of this chapter.

date

A date-and-time type. Lingo4G stores timestamps as strings. The input date string must parse correctly according to the pattern provided in input‚ÄčFormat. The timestamp is stored in the index using the date pattern specified in the index‚ÄčFormat.

{
  "type": "date",
  "indexFormat": "yyyy-MM-dd'T'HH:mm:ss[.SSS][X]",
  "inputFormat": "yyyy-MM-dd'T'HH:mm:ss[.SSS][X]"
}

index‚ÄčFormat

Type
string
Default
"yyyy-MM-dd'T'HH:mm:ss[.SSS][X]"
Required
no

The format of the timestamp stored in the index, as per the specification in Date‚ÄčTime‚ÄčFormatter.

input‚ÄčFormat

Type
string
Default
"yyyy-MM-dd'T'HH:mm:ss[.SSS][X]"
Required
no
The date format with which the input field's string value is parsed into a timestamp to be stored in Lingo4G, as per the specification in Date‚ÄčTime‚ÄčFormatter.

double

A 64-bit floating point numeric value.

{
  "type": "double"
}

float

A 32-bit floating point numeric value.

{
  "type": "float"
}

integer

An integer type (integer values between -2147483648 and 2147483647).

{
  "type": "integer"
}

long

An integer type (integer values between -263 and 263-1.

{
  "type": "long"
}

text

A text field. Text fields are the most common type of fields in Lingo4G.

{
  "type": "text",
  "analyzer": null,
  "featureAnalyzer": null,
  "id": false,
  "indexPositions": false
}

The value of a text field can be analyzed, that is processed and split into a stream of tokens. These tokens can be used to:

  • build an inverted index that allows fast searching for documents containing a given token or an expression involving multiple tokens,

  • detect frequent, non-trivial sequences of tokens that can potentially convey some meaning. These sequences become document features.

analyzer

Type
string
Default
null
Required
no

Search index analyzer chain. Defines how input values are tokenized into separate tokens eventually stored in the inverted index used for searches.

This property should contain one of the analyzer types. Fields without an analyzer cannot be used in search queries (but their whole field values are still available for retrieval).

feature‚ÄčAnalyzer

Type
string
Default
null
Required
no

An analyzer chain used for automatic feature discovery. Defines how input values are tokenized into separate tokens. Automatic feature discovery then looks for longer, frequent sequences of tokens that may eventually become document labels.

This property should contain one of the analyzer types. Fields without a feature analyzer cannot be used in many types of analysis requests.

Feature and search analyzers

A feature analyzer is separate from the search analyzer because their configurations will often be different. For example, a feature analyzer for a field may include a much larger stop word dictionary to ignore tokens and phrases that are irrelevant for analytical needs (but may occasionally be useful for document retrieval in search queries).

id

Type
boolean
Default
false
Required
no

If true, the field is designated as containing a non-empty, unique document identifier. Only one field can be marked as an identifier. Document identifiers are required for incremental indexing and for document deletions.

index‚ÄčPositions

Type
boolean
Default
false
Required
no

If true, inverted search indexes will store token positions in the stream of tokens. Positions are required for certain type of proximity queries (and for proper search scope highlighting).