Types of resources

Dictionaries and other language resources help Lingo3G to improve the quality of clustering for a specific language.

Lingo3G supports the following language resources:

Stemmers
Both exact and heuristic word conflation algorithms are provided.
Tag (word category) dictionaries

Lingo3G uses tags to determine the category of each word and to use this information filtering label candidates to improve clustering.

Tags have some resemblance to part of speech categories.

Label filters

This dictionary defines rules for applying weights to a set of cluster candidate words or phrases. You can use this label filtering to suppress offensive or non-informative labels or to boost certain phrases making them more likely to become clusters.

Synonyms

Synonyms provide a hint to Lingo3G that a set of words or phrases represents the same thing and any occurrence of its members should be treated as synonymous during clustering.

The default resources shipped with Lingo3G Java API are embedded inside JAR files. For production purposes these defaults should be extended with content that suits a particular document domain to improve clustering quality. The Java API supports loading language resources from custom locations. In the DCS, resources are already shipped as top-level assets and can be modified in place.

In the following sections we will explain the structure of each type of customizable language resource.