Analyzers
Lingo4G uses analyzers to split text into smaller units, such as words or punctuation, which then undergo further analysis or indexing, including phrase detection or dictionary matching.
Properties in this object must be objects of the following types:
-
english
-
A flexible analyzer for processing text in the English language, with options to control stemming and other aspects of token formation.
-
whitespace
-
This analyzer can be used to process whitespace-separated tokens and other character sequences.
-
keyword
-
This analyzer considers each input value to be a single token.
-
custom
-
This type enables providing an analyzer written in Java code.
-
date
-
This type of analyzer configures query parsing for date fields.
Analyzers are most commonly referenced from the fields section of the
project description, where document fields are described. There are also REST API requests that reference analyzers
explicitly (for example, fromText
label retrieval from input text not stored in the index).
Lingo4G comes with a set of default analyzers, where each analyzer's key corresponds to its type
(english
, whitespace
, keyword
). An additional literal
analyzer
is of type keyword
but has letter case folding set to false
. These default settings can be
changed by declaring the corresponding key
explicitly. Alternatively, a new analyzer key can be added
and referenced where appropriate.
An example definition of the analyzers section in the project descriptor can look like this:
"analyzers": {
// tweak the default analyzer's configuration.
"english": {
"stemmerDictionary": [
"data/english.txt",
"data/common.txt"
]
},
// configure a separate analyzer key: English with no stemming.
"english-nostemming": {
"type": "english"
"useHeuristicStemming": false,
"stemmerDictionary": []
}
}
custom
With this configuration type, you can use a custom Java class implementing
java.util.function.Supplier<org.apache.lucene.analysis.Analyzer>
.
{
"type": "custom",
"analyzerSupplierClass": null
}
Custom analyzers can be useful for tokenizing atypical or structured text fields. Different analyzers can be provided for indexing and for querying the index.
analyzerSupplierClass
Fully qualified name of a public Java class implementing
java.util.function.Supplier<org.apache.lucene.analysis.Analyzer>
. The
document source's classpath is used to look up and load the
custom class.
You should have full control over the classes and project configuration. Loading arbitrary Java classes can be a security risk so review your project configuration and classpath entries carefully.
date
Date field analyzer for converting query strings into date strings stored in date fields.
{
"type": "date",
"dateMath": true,
"timeZone": "UTC"
}
A date field analyzer can validate input strings to verify they match the index format specified in field definition.
A date field analyzer can support date math expressions. These expressions are borrowed from
Apache Solr. A
date math expression can add or subtract a unit of time to a predefined ISO timestamp or the magic constant
NOW
. It can also round the time value by a specified unit. Expressions can be chained and are
evaluated left to right.
Example date expressions:
-
received:[2010 TO NOW/YEAR]
— matches any values between 2010 and the start of the current year in fieldreceived
. -
received:[NOW-5YEARS TO *]
— matches any values after a timestamp 5 years ago from the current moment. -
received:[2024-10-28T17:52:48.025Z-5YEARS/YEAR TO NOW]
— Matches any values between the start of year 2019 (note truncation operator/YEAR
) and now.
dateMath
If true
, date math expressions are enabled in queries.
timeZone
The default time zone name or offset for converting
input date
values into epoch milliseconds and for date math expressions (where the input expression is time-zone relative,
for example
NOW
). The time zone name must follow Java's TimeZone
naming convention.
english
The default English analyzer (key: english
) is best suited to processing text written in English. It
normalizes word forms and applies heuristic stemming to unify various spelling variants of the same word (lemma).
The default definition has the following properties:
{
"type": "english",
"positionGap": 1000,
"requireResources": true,
"stemmerDictionary": [
"${l4g.home}/resources/indexing/words.en.dict"
],
"stopwords": [
"${l4g.home}/resources/indexing/stopwords.utf8.txt"
],
"useHeuristicStemming": true
}
positionGap
An expert setting that adds position gap spacing between tokens from multiple values of the same field. For example, a position gap of 10 would mean 10 "empty" token positions are inserted between the last token and the first token of adjacent values stored in the same field .
The position gap is needed for queries where token positions are taken into account: phrase queries, proximity
queries, interval queries. A non-zero position gap prevents false-positives when the query would match field
positions from separate values. For example, a phrase query
"foo bar"
could match a document with two separate values foo
and
bar
indexed in the same text field.
requireResources
If true
, all external resources for this analyzer are required and will cause an error if not
found. If false
, all resources are optional. The predefined english
analyzer makes all
resources optional and points at resources shipped with the distribution under the
l4g.home
directory.
stemmerDictionary
The location of a precompiled Morfologik FSA (automaton file) with inflected-base form mappings and part of speech tags. Lingo4G comes with a reasonably-sized default dictionary. This dictionary can be decompiled (or recompiled) using the morfologik-stemming library.
stopwords
Zero or more locations of stopword files. A stopword file is a plain-text, UTF-8 encoded file with each word on a single line.
Analyzer stopwords decrease the amount of data to be indexed and mark phrase boundaries: stopwords by definition cannot occur at the beginning or end of a phrase in automatic feature discovery.
The primary difference between analyzer stop words and request-time exclusion dictionaries is that stop words provided to the analyzer will be skipped entirely while indexing documents (will not be stored in inverted indexes or features). They cannot be used in queries and cannot be dynamically excluded or included in analyses (using ad-hoc dictionaries).
useHeuristicStemming
If true
, the analyzer will apply a heuristic stemming algorithm to each token (Porter stemmer).
This typically brings different surface forms of the same lemma to the same token image. For example,
flowers
and flower
would both be transformed to flower
in the index and
make the document retrievable with a query for any of these words.
keyword
The keyword analyzer does not perform any token splitting at all, returning the full content of a field for indexing (or feature detection). This analyzer is useful to index identifiers or other non-textual information that shouldn't be split into smaller units.
{
"type": "keyword",
"lowercase": true,
"positionGap": 1000
}
Lingo4G declares two default analyzers of this type:
keyword
and literal
. The only difference between them is in letter case handling flag:
"analyzers": {
"keyword": {
"type": "keyword",
"lowercase": true,
"positionGap": 1000
},
"literal": {
"type": "keyword",
"lowercase": false,
"positionGap": 1000
}
}
lowercase
If true
, each token will be converted to lower case (according to the
Locale.ROOT
locale of the JDK and Unicode rules, no localized rules apply).
positionGap
The value position gap has the same meaning as in the english analyzer.
whitespace
The whitespace analyzer can be useful to break up a field that consists of whitespace-separated tokens or terms. Any punctuation will remain together with the tokens (or will be returned as individual tokens). The default definition of this analyzer is as follows:
{
"type": "whitespace",
"lowercase": true,
"positionGap": 1000
}
lowercase
If true
, each token will be converted to lower case (according to the
Locale.ROOT
locale of the JDK and Unicode rules, no localized rules apply).
positionGap
The value position gap has the same meaning as in the english analyzer.