Dictionaries
Lingo3G uses dictionaries to improve the quality of clustering for a specific language. This section introduces the available dictionaries and the common syntax of dictionary entries.
Types of dictionaries
Lingo3G supports the following dictionaries:
- Label dictionary
-
Controls the labels Lingo3G chooses to describe clusters. You can use the label dictionary to remove certain labels, such as domain-common phrases or abusive language. You can also use the label dictionary to promote other labels, such as product or brand names.
- Synonym dictionary
-
Defines words and phrases that represent the same concept, such as photo and picture to let Lingo3G know to put documents containing synonymous words or phrases in the same cluster.
- Tag dictionary
-
Categorizes individual words based on their grammatical or semantic function, such as preposition, verb or proper name. You can reference the categories in the label dictionary to, for example, remove labels ending in a preposition, such as information about.
Scopes of dictionaries
Lingo3G applies dictionaries of all types in two complementary scopes:
- Global
-
Global dictionaries apply to all clustering requests. Lingo3G ships with basic global dictionaries suitable for most types of documents.
In typical use cases, the global dictionaries are static – Lingo3G loads them once on startup and reuses for all clustering requests.
- Per-request
-
Per-request dictionaries apply for a specific clustering request in addition to the global dictionaries. You can use per-request dictionaries to extend global dictionaries with temporary entries provided by end-end users at runtime.
Lingo3G supports per-request dictionaries both in the REST API and in the Java API.
Location of dictionary files
Document Clustering Server
The Document Clustering Server (REST API) reads dictionary files from
the web/service/resources directory in the server's
distribution directory. After you edit some of the files, restart the
Document Clustering Server for the changes to take effect.
Java API
By default, Lingo3G Java API tries to read dictionaries from the JAR
file from which the supplier of its corresponding
LanguageComponents implementation comes from. The following
table lists the locations of dictionary files for specific languages,
including the name of the JAR file and path within the JAR.
| Language | JAR file | JAR path |
|---|---|---|
| English |
lingo3g-2.3.2.jar
|
/ |
| Dutch |
lingo3g-lang-dutch-2.3.2.jar
|
/ |
| Polish |
lingo3g-lang-polish-2.3.2.jar
|
/ |
| Danish, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish |
lingo3g-lang-carrot2-2.3.2.jar
|
/ |
| Arabic, Armenian, Bulgarian, Croatian, Czech, Estonian, Galician, Greek, Hindi, Indonesian, Irish, Latvian, Lithuanian, Thai |
lingo3g-lang-lucene-2.3.2.jar
|
/ |
| Chinese simplified, Chinese traditional, Japanese, Korean |
lingo3g-lang-lucene-cjk-2.3.2.jar
|
/ |
Location of dictionary files in the standard Lingo3G distribution.
If you use Lingo3G Java API, don't modify the default dictionaries in-place. Instead, copy the relevant dictionaries to your application-specific location and load them from that location.
Label matchers
Label and synonym dictionaries use a common syntax based on label matcher entries. A matcher entry describes how Lingo3G decides whether to apply the dictionary-specific action to the specific word or phrase. For label dictionaries the action is removing or promoting the matching label. For synonym dictionaries, the action is including the matching label in the synonym set.
The following excerpt shows the available label matchers.
{
"exact": [
"Literal case-sensitive match"
],
"glob": [
"starts with phrase *",
"* contains words *"
],
"regexp": [
"(?).+BadLabel.+",
"(?)^[0-9]\\s*.*"
]
}
An example of three different label matcher types.
The exact, glob and
regexp properties are optional and can contain an array of
string entries for the specific matcher types, described in the following
sections. If a word or a label matches any matcher of any type,
Lingo3G applies the dictionary-specific action to the matching label.
Exact matcher
Exact matchers require exact, case-sensitive equality between the word or phrase and the dictionary entry. Exact matcher entries are fast to parse and very fast to apply during clustering.
{
"exact": [
"DevOps",
"Windows 2000"
]
}
An example label dictionary with two
exact matcher entries.
The above label dictionary definitions match labels DevOps and Windows 2000, but does not match Devops or Windows 2000 machine.
For case-insensitive matching, use glob matchers (preferably) or case-insensitive regular expression matchers.
Glob matcher
Glob matcher allows simple word-based wildcard matching. The primary use case of the glob matcher is case-insensitive matching of literal phrases, as well as "begins with…", "ends with…" or "contains…" types of expressions. Glob matcher entries are fast to parse and very fast to apply.
{
"glob": [
"more information",
"more information *",
"* about *",
"big ?",
"+ apple"
]
}
An example label dictionary with glob matcher entries.
Matching rules
-
Each entry must consist of one or more space-separated tokens.
-
A token is a sequence of arbitrary characters, such as words, numbers, identifiers.
-
Matching of unquoted tokens is case-, accent- and grammatical-form-insensitive.
-
Matching of quoted tokens is literal: case- and grammatical-form-sensitive.
For example,
"Rating***"matches only the Rating*** string, exactly comparing the case, grammatical form and occurrences of special characters. Glob matcher allows the*character in quoted tokens and matches them literally, not as a wildcard. -
To include quote characters in the token, escape them with the
\character, for example:\"information\". -
Glob matcher recognizes the following wildcard-matching tokens:
-
?matches exactly one (any) word. -
*matches zero or more words. -
+matches one or more words. This token is functionally equivalent to:? *.
The
*and+wildcards are possessive in the regular expression matching sense: they match the maximum sequence of tokens until the next token in the pattern. These wildcards will be suitable in most label matching scenarios. In rare cases, you may need to use the reluctant wildcards. -
-
Glob matcher recognizes the following reluctant wildcard-matching tokens:
-
*?matches zero or more words (reluctant). -
+?matches one or more words (reluctant). This token is functionally equivalent to:? *?.
The reluctant wildcards match the minimal sequence of tokens until the next token in the pattern.
-
-
Glob matcher recognizes the following word category tokens:
-
{fnc}: matches a function word (about, have) -
{verb}: matches a verb (have, allows) -
{noun}: matches a noun (website, test) {adj}: matches an adjective (cool){adv}: matches an adverb (fully)-
{geo}: matches a geographical reference (London) -
{name}: matches a proper noun (John) -
numeric: matches numeric values (10.4, 2021, 14:30, 3/4)
The specific set of words each category token will match depends on the language in which you perform clustering and on the contents of the tag dictionaries. Note that category-based matching may be incorrect for some inputs due to the simplicity of Lingo3G's word tagger.
-
-
Glob matcher imposes the following restrictions on wildcard operators:
-
You cannot use wildcards (
*,+) to express string prefixes or suffixes. For example,programm*, is not supported. -
Glob matcher does not support greedy operators.
-
Glob matcher normalizes the case, accents and grammatical form of
unquoted tokens before trying to match them against labels. For
example, under default settings and the English language, the token
naïve matches, among others, the following labels:
naïve, naive, naïvely or
NAIVELY.
Glob matcher normalizes character case based on the
ROOT Java locale, which performs language-neutral case
conflation according to Unicode rules.
Grammatical form and accent normalization depends on the related Lingo3G parameters: useHeuristicStemming, useBuiltInWordDatabaseForStemming and accentFolding. By default, both grammatical form and accent normalization is enabled.
Example entries
The following table shows a number of example glob entries. The "Non-matching strings" column also has an explanation why there is no match.
All examples assume that the processing language is English and that normalization of grammatical forms and accents is enabled, which is the Lingo3G default.
| Entry | Matching strings | Non-matching strings |
|---|---|---|
more information |
|
|
more information * |
|
|
* information * |
|
|
naïve * |
|
|
{fnc} |
|
|
{adj} {noun} |
|
|
to {verb} a {noun} |
|
|
+ information |
|
|
"Information" * |
|
|
data ? |
|
|
"Programm*" |
|
|
\"information\" |
|
|
* protein protein * |
This pattern will never match any input.
The reason for this is that To match labels with a doubled occurrence of some word, use the reluctant variant of the wildcard. |
|
*? protein protein * |
|
|
programm* |
Illegal pattern, combinations of the * wildcard and
other characters are not supported.
|
|
"information |
Illegal pattern, unbalanced double quotes. | |
* |
Illegal pattern, there must be at least one non-wildcard token. | |
Regular expression matcher
The regular expression matcher checks words or labels against a list of regular expressions you provide.
{
"regexp": [
"Windows 9[58]",
"(?)^[0-9]\\s*.*"
]
}
An example label dictionary with two
regexp matcher entries.
The regular expressions must follow Java syntax. If any fragment of a label matches any regular expression in the dictionary, Lingo3G will apply the dictionary-specific action to the label.