Fields
The fields
section of the project descriptor defines how Lingo4G processes and stores values of fields
returned by the document source.
The fields
object consists of keys that denote field names and values containing the definition of how
that field will be processed and indexed.
fields
section, ignoring any other fields returned by the document source.
Field values (definitions) must be objects of the following types:
-
text
- The default value type denoting free text. Text fields can have associated search and feature analyzers.
-
date
-
A date-and-time type.
-
integer
-
An integer type (integer values between -2147483648 and 2147483647).
-
long
-
An integer type (integer values between -263 and 263-1.
-
double
-
A 64-bit floating point numeric value.
-
float
-
A 32-bit floating point numeric value.
A typical fields
section may look like the following:
"fields": {
// Document identifier field (for updates).
"id": { "id": true, "type": "text", "analyzer": "literal" },
// Simple values, will be lower-cased for query matching
"author": { "analyzer": "keyword" },
"type": { "analyzer": "keyword" },
// Plain text in English.
"title": { "analyzer": "english" },
"summary": { "analyzer": "english" },
// Date, converted from incomplete information to full iso timestamp.
"created": { "type": "date", "inputFormat": "yyyy-MM-dd HH:mm",
"indexFormat": "yyyy-MM-dd'T'HH:mm:ss[X]" },
// A numeric score as an integer.
"score": { "type": "integer" }
}
Each type of field is described in more detail in the remaining sections of this chapter.
date
A date-and-time type. Lingo4G stores timestamps as strings. The input date string must parse correctly according
to the pattern provided in
inputFormat
. The timestamp is stored in the index using the date pattern specified in the
indexFormat
.
{
"type": "date",
"indexFormat": "yyyy-MM-dd'T'HH:mm:ss[.SSS][X]",
"inputFormat": "yyyy-MM-dd'T'HH:mm:ss[.SSS][X]"
}
indexFormat
The format of the timestamp stored in the index, as per the specification in
DateTimeFormatter
.
The date format with which the input field's string value is parsed into a timestamp to be stored in Lingo4G, as per the specification in inputFormat
DateTimeFormatter
.
double
A 64-bit floating point numeric value.
{
"type": "double"
}
float
A 32-bit floating point numeric value.
{
"type": "float"
}
integer
An integer type (integer values between -2147483648 and 2147483647).
{
"type": "integer"
}
long
An integer type (integer values between -263 and 263-1.
{
"type": "long"
}
text
A text field. Text fields are the most common type of fields in Lingo4G.
{
"type": "text",
"analyzer": null,
"featureAnalyzer": null,
"id": false,
"indexPositions": false
}
The value of a text field can be analyzed, that is processed and split into a stream of tokens. These tokens can be used to:
-
build an inverted index that allows fast searching for documents containing a given token or an expression involving multiple tokens,
-
detect frequent, non-trivial sequences of tokens that can potentially convey some meaning. These sequences become document features.
analyzer
Search index analyzer chain. Defines how input values are tokenized into separate tokens eventually stored in the inverted index used for searches.
This property should contain one of the analyzer types. Fields without an analyzer cannot be used in search queries (but their whole field values are still available for retrieval).
featureAnalyzer
An analyzer chain used for automatic feature discovery. Defines how input values are tokenized into separate tokens. Automatic feature discovery then looks for longer, frequent sequences of tokens that may eventually become document labels.
This property should contain one of the analyzer types. Fields without a feature analyzer cannot be used in many types of analysis requests.
A feature analyzer is separate from the search analyzer because their configurations will often be different. For example, a feature analyzer for a field may include a much larger stop word dictionary to ignore tokens and phrases that are irrelevant for analytical needs (but may occasionally be useful for document retrieval in search queries).
id
If true
, the field is designated as containing a non-empty, unique document identifier. Only one
field can be marked as an identifier. Document identifiers are required for
incremental indexing and for
document deletions.
indexPositions
If true
, inverted search indexes will store token positions in the stream of tokens. Positions are
required for certain type of proximity queries (and for proper search scope highlighting).