Dictionaries

Lingo3G uses dictionaries to improve the quality of clustering for a specific language. This section introduces the available dictionaries and the common syntax of dictionary entries.

Types of dictionaries

Lingo3G supports the following dictionaries:

Label dictionary

Controls the labels Lingo3G chooses to describe clusters. You can use the label dictionary to remove certain labels, such as domain-common phrases or abusive language. You can also use the label dictionary to promote other labels, such as product or brand names.

Synonym dictionary

Defines words and phrases that represent the same concept, such as photo and picture to let Lingo3G know to put documents containing synonymous words or phrases in the same cluster.

Tag dictionary

Categorizes individual words based on their grammatical or semantic function, such as preposition, verb or proper name. You can reference the categories in the label dictionary to, for example, remove labels ending in a preposition, such as information about.

Scopes of dictionaries

Lingo3G applies dictionaries of all types in two complementary scopes:

Global

Global dictionaries apply to all clustering requests. Lingo3G ships with basic global dictionaries suitable for most types of documents.

In typical use cases, the global dictionaries are static – Lingo3G loads them once on startup and reuses for all clustering requests.

Per-request

Per-request dictionaries apply for a specific clustering request in addition to the global dictionaries. You can use per-request dictionaries to extend global dictionaries with temporary entries provided by end-end users at runtime.

Lingo3G supports per-request dictionaries both in the REST API and in the Java API.

Location of dictionary files

Document Clustering Server

The Document Clustering Server (REST API) reads dictionary files from the web/service/resources directory in the server's distribution directory. After you edit some of the files, restart the Document Clustering Server for the changes to take effect.

Java API

By default, Lingo3G Java API tries to read dictionaries from the JAR file from which the supplier of its corresponding LanguageComponents implementation comes from. The following table lists the locations of dictionary files for specific languages, including the name of the JAR file and path within the JAR.

Language JAR file JAR path
English lingo3g-2.0.0.jar /
Dutch lingo3g-lang-dutch-2.0.0.jar /
Polish lingo3g-lang-polish-2.0.0.jar /
Danish, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish lingo3g-lang-carrot2-2.0.0.jar /
Arabic, Armenian, Bulgarian, Croatian, Czech, Estonian, Galician, Greek, Hindi, Indonesian, Irish, Latvian, Lithuanian, Thai lingo3g-lang-lucene-2.0.0.jar /
Chinese simplified, Chinese traditional, Japanese, Korean lingo3g-lang-lucene-cjk-2.0.0.jar /

Location of dictionary files in the standard Lingo3G distribution.

Heads up, don't edit dictionaries inside JARs!

If you use Lingo3G Java API, don't modify the default dictionaries in-place. Instead, copy the relevant dictionaries to your application-specific location and load them from that location.

Label matchers

Label and synonym dictionaries use a common syntax based on label matcher entries. A matcher entry describes how Lingo3G decides whether to apply the dictionary-specific action to the specific word or phrase. For label dictionaries the action is removing or promoting the matching label. For synonym dictionaries, the action is including the matching label in the synonym set.

The following excerpt shows the available label matchers.

{
  "exact": [
    "Literal case-sensitive match"
  ],
  "glob": [
    "starts with phrase *",
    "* contains words *"
  ],
  "regexp": [
    "(?).+BadLabel.+",
    "(?)^[0-9]\\s*.*"
  ]
}

An example of three different label matcher types.

The exact, glob and regexp properties are optional and can contain an array of string entries for the specific matcher types, described in the following sections. If a word or a label matches any matcher of any type, Lingo3G applies the dictionary-specific action to the matching label.

Exact matcher

Exact matchers require exact, case-sensitive equality between the word or phrase and the dictionary entry. Exact matcher entries are fast to parse and very fast to apply during clustering.

{
  "exact": [
    "DevOps",
    "Windows 2000"
  ]
}

An example label dictionary with two exact matcher entries.

The above label dictionary definitions match labels DevOps and Windows 2000, but does not match Devops or Windows 2000 machine.

For case-insensitive matching, use glob matchers (preferably) or case-insensitive regular expression matchers.

Glob matcher

Glob matcher allows simple word-based wildcard matching. The primary use case of the glob matcher is case-insensitive matching of literal phrases, as well as "begins with…", "ends with…" or "contains…" types of expressions. Glob matcher entries are fast to parse and very fast to apply.

{
  "glob": [
    "more information",
    "more information *",
    "* about *",
    "big ?",
    "+ apple"
  ]
}

An example label dictionary with glob matcher entries.

Matching rules

  • Each entry must consist of one or more space-separated tokens.

  • A token is a sequence of arbitrary characters, such as words, numbers, identifiers.

  • Matching of unquoted tokens is case-, accent- and grammatical-form-insensitive.

  • Matching of quoted tokens is literal: case- and grammatical-form-sensitive.

    For example, "Rating***" matches only the Rating*** string, exactly comparing the case, grammatical form and occurrences of special characters. Glob matcher allows the * character in quoted tokens and matches them literally, not as a wildcard.

  • To include quote characters in the token, escape them with the \ character, for example: \"information\".

  • Glob matcher recognizes the following wildcard-matching tokens:

    • ? matches exactly one (any) word.

    • * matches zero or more words.

    • + matches one or more words. This token is functionally equivalent to: ? *.

    The * and + wildcards are possessive in the regular expression matching sense: they match the maximum sequence of tokens until the next token in the pattern. These wildcards will be suitable in most label matching scenarios. In rare cases, you may need to use the reluctant wildcards.

  • Glob matcher recognizes the following reluctant wildcard-matching tokens:

    • *? matches zero or more words (reluctant).

    • +? matches one or more words (reluctant). This token is functionally equivalent to: ? *?.

    The reluctant wildcards match the minimal sequence of tokens until the next token in the pattern.

  • Glob matcher recognizes the following word category tokens:

    • {fnc}: matches a function word (about, have)
    • {verb}: matches a verb (have, allows)
    • {noun}: matches a noun (website, test)
    • {adj}: matches an adjective (cool)
    • {adv}: matches an adverb (fully)
    • {geo}: matches a geographical reference (London)
    • {name}: matches a proper noun (John)
    • numeric: matches numeric values (10.4, 2021, 14:30, 3/4)

    The specific set of words each category token will match depends on the language in which you perform clustering and on the contents of the tag dictionaries. Note that category-based matching may be incorrect for some inputs due to the simplicity of Lingo3G's word tagger.

  • Glob matcher imposes the following restrictions on wildcard operators:

    • You cannot use wildcards (*, +) to express string prefixes or suffixes. For example, programm*, is not supported.

    • Glob matcher does not support greedy operators.

Heads up, normalization of unquoted tokens!

Glob matcher normalizes the case, accents and grammatical form of unquoted tokens before trying to match them against labels. For example, under default settings and the English language, the token naïve matches, among others, the following labels: naïve, naive, naïvely or NAIVELY.

Glob matcher normalizes character case based on the ROOT Java locale, which performs language-neutral case conflation according to Unicode rules.

Grammatical form and accent normalization depends on the related Lingo3G parameters: useHeuristicStemming, useBuiltInWordDatabaseForStemming and accentFolding. By default, both grammatical form and accent normalization is enabled.

Example entries

The following table shows a number of example glob entries. The "Non-matching strings" column also has an explanation why there is no match.

All examples assume that the processing language is English and that normalization of grammatical forms and accents is enabled, which is the Lingo3G default.

Entry Matching strings Non-matching strings
more information

More information

MORE INFORMATION

more informations informations is a different grammatical form of information.

more information about Pattern does not contain wildcards, only 2-word strings can match.

some more information Pattern does not contain wildcards, only 2-word strings can match.

more information *

more information

More information about

More information about a

more informations

more informations about

some more information Pattern does not have wildcards at the beginning, matching strings must start with more information.

* information *

information

more information

information about

a lot more information on

informations

more informations about

some more informations

more info info is not a different grammatical of information, so it does not match.

naïve *

naive naive is an accent-normalized variant of naïve

naïvely naïvely is a different grammatical form of naïve

naive approach

Naively

naiive Accent normalization does not correct spelling errors.

{fnc}

of

for

clustering clustering is not a function word

{adj} {noun}

fast algorithm

factorization algorithm factorization is not an adjective

to {verb} a {noun}

to visit a hotel

to plan a vacation This string should also match the rule. It doesn't due to the limitation of Lingo3G word categorization algorithm. Lingo3G categorizes all occurrences of plan as a noun, regardless of the context of the word. In this case, the context implies plan is a verb, but Lingo3G does not take this into account.

+ information

too much information

more information

information + requires at least one word before information.

more information about about is an extra word not covered by the pattern.

"Information" *

Information

Information about

Information ABOUT

information The Information token is case-sensitive, it does not match information.

information about The Information token is case-sensitive, it does not match information.

Informations about Normalization of grammatical forms does not apply to quoted tokens, so Informations does not match pattern token "Information".

data ?

data mining

data ? requires a word after data.

data mining research ? matches one word, so research does not match

"Programm*"

Programm*

Programmer Programm* token is taken literally, it matches only Programm*.

Programming Programm* token is taken literally, it matches only Programm*.

\"information\"

"information"

"INFOrmation" Escaped quotes are taken literally, so match is case-insensitive

information Escaped quotes not found in the string being matched.

"information Escaped quotes not found in the string being matched.

* protein protein *

This pattern will never match any input.

The reason for this is that * makes a possessive match, that is matches the maximum number of words until the next token in the pattern. Therefore, the first occurrence of the protein token in the pattern will correspond to the last occurrence of that word in the input label, leaving no content to match the second occurrence of protein in the pattern. As a result, there is no such sequence that can ever match this a pattern.

To match labels with a doubled occurrence of some word, use the reluctant variant of the wildcard.

*? protein protein *

protein protein

selective protein protein interaction

protein protein protein

protein Two occurrences of protein on input are required.

selective protein-protein interaction The "protein-protein" string counts as one token and it therefore does not match the two-token protein protein part of the pattern.

programm* Illegal pattern, combinations of the * wildcard and other characters are not supported.
"information Illegal pattern, unbalanced double quotes.
* Illegal pattern, there must be at least one non-wildcard token.

Regular expression matcher

The regular expression matcher checks words or labels against a list of regular expressions you provide.

{
  "regexp": [
    "Windows 9[58]",
    "(?)^[0-9]\\s*.*"
  ]
}

An example label dictionary with two regexp matcher entries.

The regular expressions must follow Java syntax. If any fragment of a label matches any regular expression in the dictionary, Lingo3G will apply the dictionary-specific action to the label.

Heads up, performance impact!

Regular expression-based matching is a powerful mechanism, but it can also result in a dramatic decrease of clustering performance. Therefore, use it only when a similar effect cannot be achieved by a reasonable number of exact and glob matching entries.