Analyzers

Lingo4G uses analyzers to split text into smaller units, such as words or punctuation, which then undergo further analysis or indexing, including phrase detection or dictionary matching.

Properties in this object must be objects of the following types:

english

A flexible analyzer for processing text in the English language, with options to control stemming and other aspects of token formation.

whitespace

This analyzer can be used to process whitespace-separated tokens and other character sequences.

keyword

This analyzer considers each input value to be a single token.

Analyzers are most commonly referenced from the fields section of the project description, where document fields are described. There are also REST API requests that reference analyzers explicitly (for example, from​Text label retrieval from input text not stored in the index).

Lingo4G comes with a set of default analyzers, where each analyzer's key corresponds to its type (english, whitespace, keyword). An additional literal analyzer is of type keyword but has letter case folding set to false. These default settings can be changed by declaring the corresponding key explicitly. Alternatively, a new analyzer key can be added and referenced where appropriate.

An example definition of the analyzers section in the project descriptor can look like this:

"analyzers": {
  // tweak the default analyzer's configuration.
  "english": {
    "stemmerDictionary": [
      "data/english.txt",
      "data/common.txt"
    ]
  },
  // configure a separate analyzer key: English with no stemming.
  "english-nostemming": {
    "type": "english"
    "useHeuristicStemming": false,
    "stemmerDictionary": []
  }
}

english

The default English analyzer (key: english) is best suited to processing text written in English. It normalizes word forms and applies heuristic stemming to unify various spelling variants of the same word (lemma). The default definition has the following properties:

{
  "type": "english",
  "positionGap": 1000,
  "requireResources": true,
  "stemmerDictionary": [
    "${l4g.home}/resources/indexing/words.en.dict"
  ],
  "stopwords": [
    "${l4g.home}/resources/indexing/stopwords.utf8.txt"
  ],
  "useHeuristicStemming": true
}

position​Gap

Type
integer
Default
1000
Constraints
value >= 0
Required
no

An expert setting that adds position gap spacing between tokens from multiple values of the same field. For example, a position gap of 10 would mean 10 "empty" token positions are inserted between the last token and the first token of adjacent values stored in the same field .

The position gap is needed for queries where token positions are taken into account: phrase queries, proximity queries, interval queries. A non-zero position gap prevents false-positives when the query would match field positions from separate values. For example, a phrase query "foo bar" could match a document with two separate values foo and bar indexed in the same text field.

require​Resources

Type
boolean
Default
true
Required
no

If true, all external resources for this analyzer are required and will cause an error if not found. If false, all resources are optional. The predefined english analyzer makes all resources optional and points at resources shipped with the distribution under the l4g.home directory.

stemmer​Dictionary

Type
string or array of string
Default
[
  "${l4g.home}/resources/indexing/words.en.dict"
]
Required
no

The location of a precompiled Morfologik FSA (automaton file) with inflected-base form mappings and part of speech tags. Lingo4G comes with a reasonably-sized default dictionary. This dictionary can be decompiled (or recompiled) using the morfologik-stemming library.

stopwords

Type
string or array of string
Default
[
  "${l4g.home}/resources/indexing/stopwords.utf8.txt"
]
Required
no

Zero or more locations of stopword files. A stopword file is a plain-text, UTF-8 encoded file with each word on a single line.

Analyzer stopwords decrease the amount of data to be indexed and mark phrase boundaries: stopwords by definition cannot occur at the beginning or end of a phrase in automatic feature discovery.

The primary difference between analyzer stop words and request-time exclusion dictionaries is that stop words provided to the analyzer will be skipped entirely while indexing documents (will not be stored in inverted indexes or features). They cannot be used in queries and cannot be dynamically excluded or included in analyses (using ad-hoc dictionaries).

use​Heuristic​Stemming

Type
boolean
Default
true
Required
no

If true, the analyzer will apply a heuristic stemming algorithm to each token (Porter stemmer). This typically brings different surface forms of the same lemma to the same token image. For example, flowers and flower would both be transformed to flower in the index and make the document retrievable with a query for any of these words.

Heuristic stemming can occasionally lead to non-intuitive results when two unrelated words are transformed to the same stem. It's a tradeoff between precision and recall that should be solved depending on the application.

keyword

The keyword analyzer does not perform any token splitting at all, returning the full content of a field for indexing (or feature detection). This analyzer is useful to index identifiers or other non-textual information that shouldn't be split into smaller units.

{
  "type": "keyword",
  "lowercase": true,
  "positionGap": 1000
}

Lingo4G declares two default analyzers of this type: keyword and literal. The only difference between them is in letter case handling flag:

"analyzers": {
  "keyword": {
    "type": "keyword",
    "lowercase": true,
    "positionGap": 1000
  },
  "literal": {
    "type": "keyword",
    "lowercase": false,
    "positionGap": 1000
  }
}

lowercase

Type
boolean
Default
true
Required
no

If true, each token will be converted to lower case (according to the Locale.​R​O​O​T locale of the JDK and Unicode rules, no localized rules apply).

position​Gap

Type
integer
Default
1000
Constraints
value >= 0
Required
no

The value position gap has the same meaning as in the english analyzer.

whitespace

The whitespace analyzer can be useful to break up a field that consists of whitespace-separated tokens or terms. Any punctuation will remain together with the tokens (or will be returned as individual tokens). The default definition of this analyzer is as follows:

{
  "type": "whitespace",
  "lowercase": true,
  "positionGap": 1000
}

lowercase

Type
boolean
Default
true
Required
no

If true, each token will be converted to lower case (according to the Locale.​R​O​O​T locale of the JDK and Unicode rules, no localized rules apply).

position​Gap

Type
integer
Default
1000
Constraints
value >= 0
Required
no

The value position gap has the same meaning as in the english analyzer.