Analyzers
Lingo4G uses analyzers to split text into smaller units, such as words or punctuation, which then undergo further analysis or indexing, including phrase detection or dictionary matching.
Properties in this object must be objects of the following types:
-
english
-
A flexible analyzer for processing text in the English language, with options to control stemming and other aspects of token formation.
-
whitespace
-
This analyzer can be used to process whitespace-separated tokens and other character sequences.
-
keyword
-
This analyzer considers each input value to be a single token.
Analyzers are most commonly referenced from the fields section of the
project description, where document fields are described. There are also REST API requests that reference analyzers
explicitly (for example, from​Text
label retrieval from input text not stored in the index).
Lingo4G comes with a set of default analyzers, where each analyzer's key corresponds to its type
(english
, whitespace
, keyword
). An additional literal
analyzer
is of type keyword
but has letter case folding set to false
. These default settings can be
changed by declaring the corresponding key
explicitly. Alternatively, a new analyzer key can be added
and referenced where appropriate.
An example definition of the analyzers section in the project descriptor can look like this:
"analyzers": {
// tweak the default analyzer's configuration.
"english": {
"stemmerDictionary": [
"data/english.txt",
"data/common.txt"
]
},
// configure a separate analyzer key: English with no stemming.
"english-nostemming": {
"type": "english"
"useHeuristicStemming": false,
"stemmerDictionary": []
}
}
english
The default English analyzer (key: english
) is best suited to processing text written in English. It
normalizes word forms and applies heuristic stemming to unify various spelling variants of the same word (lemma).
The default definition has the following properties:
{
"type": "english",
"positionGap": 1000,
"requireResources": true,
"stemmerDictionary": [
"${l4g.home}/resources/indexing/words.en.dict"
],
"stopwords": [
"${l4g.home}/resources/indexing/stopwords.utf8.txt"
],
"useHeuristicStemming": true
}
position​Gap
An expert setting that adds position gap spacing between tokens from multiple values of the same field. For example, a position gap of 10 would mean 10 "empty" token positions are inserted between the last token and the first token of adjacent values stored in the same field .
The position gap is needed for queries where token positions are taken into account: phrase queries, proximity
queries, interval queries. A non-zero position gap prevents false-positives when the query would match field
positions from separate values. For example, a phrase query
"foo bar"
could match a document with two separate values foo
and
bar
indexed in the same text field.
require​Resources
If true
, all external resources for this analyzer are required and will cause an error if not
found. If false
, all resources are optional. The predefined english
analyzer makes all
resources optional and points at resources shipped with the distribution under the
l4g.home
directory.
stemmer​Dictionary
The location of a precompiled Morfologik FSA (automaton file) with inflected-base form mappings and part of speech tags. Lingo4G comes with a reasonably-sized default dictionary. This dictionary can be decompiled (or recompiled) using the morfologik-stemming library.
stopwords
Zero or more locations of stopword files. A stopword file is a plain-text, UTF-8 encoded file with each word on a single line.
Analyzer stopwords decrease the amount of data to be indexed and mark phrase boundaries: stopwords by definition cannot occur at the beginning or end of a phrase in automatic feature discovery.
The primary difference between analyzer stop words and request-time exclusion dictionaries is that stop words provided to the analyzer will be skipped entirely while indexing documents (will not be stored in inverted indexes or features). They cannot be used in queries and cannot be dynamically excluded or included in analyses (using ad-hoc dictionaries).
use​Heuristic​Stemming
If true
, the analyzer will apply a heuristic stemming algorithm to each token (Porter stemmer).
This typically brings different surface forms of the same lemma to the same token image. For example,
flowers
and flower
would both be transformed to flower
in the index and
make the document retrievable with a query for any of these words.
keyword
The keyword analyzer does not perform any token splitting at all, returning the full content of a field for indexing (or feature detection). This analyzer is useful to index identifiers or other non-textual information that shouldn't be split into smaller units.
{
"type": "keyword",
"lowercase": true,
"positionGap": 1000
}
Lingo4G declares two default analyzers of this type:
keyword
and literal
. The only difference between them is in letter case handling flag:
"analyzers": {
"keyword": {
"type": "keyword",
"lowercase": true,
"positionGap": 1000
},
"literal": {
"type": "keyword",
"lowercase": false,
"positionGap": 1000
}
}
lowercase
If true
, each token will be converted to lower case (according to the
Locale.​R​O​O​T
locale of the JDK and Unicode rules, no localized rules apply).
position​Gap
The value position gap has the same meaning as in the english analyzer.
whitespace
The whitespace analyzer can be useful to break up a field that consists of whitespace-separated tokens or terms. Any punctuation will remain together with the tokens (or will be returned as individual tokens). The default definition of this analyzer is as follows:
{
"type": "whitespace",
"lowercase": true,
"positionGap": 1000
}
lowercase
If true
, each token will be converted to lower case (according to the
Locale.​R​O​O​T
locale of the JDK and Unicode rules, no localized rules apply).
position​Gap
The value position gap has the same meaning as in the english analyzer.