Dictionaries
Lingo3G uses dictionaries to improve the quality of clustering for a specific language. This article shows how to customize dictionaries in the Java API.
Lingo3G delegates the management of dictionaries and other language
resources to the LanguageComponent
instance. You can use its
methods to list the available languages, customize the location of
dictionaries or even replace the Lingo3G's built-in language resources, such
as stemmers. This chapter shows the basic use cases related to dictionary
customizations.
Customizing dictionaries
The simplest way to customize Lingo3G
dictionaries is to copy the
default dictionaries
to your application-specific location, make the necessary changes and
provide a custom ResourceLookup
implementation when loading
language resources. The following example loads English resources from a
class-relative classpath location.
Using ephemeral dictionaries
You can provide extra ephemeral dictionary entries for a specific clustering request. Lingo3G applies these extra entries as an addition to the default dictionaries. For example, if the end-user wants to remove specific labels from the clustering result they are currently viewing, your software can add such labels to the ephemeral label dictionary and rerun the clustering.
The dictionaries
field of
Lingo3GClusteringAlgorithm
groups all ephemeral dictionaries. You can add entries to the label,
synonym and word dictionaries.
Label dictionaries
To add an ephemeral label dictionary, create a new instance of the
LabelMatcher
class and add entries to its
glob
, regex
or exact
fields. Then
set the matcher instance on your
Lingo3GClusteringAlgorithm
instance.
The following example adds two entries to the glob matcher of the label dictionary.
By default, LabelMatcher
instances assign a zero weight to
the labels they match, removing them from the result. You can set the
weight
field of the matcher to a value larger than 1 to
promote the labels.
Synonym dictionaries
To add an ephemeral synonym dictionary, do the following:
-
For each set of synonymous labels, create a
SynonymSet
instance, providing the list of labels Lingo3G should treat as synonymous. Note that Lingo3G allows only glob-style label matchers in synonym definitions.Additionally, you can provide one label to represent all the synonymous labels in cluster labels.
-
Set a list of synonym sets on the
dictionaries.synonyms
field of yourLingo3GClusteringAlgorithm
instance.
The following example adds one entry making the software and tools label synonymous and represented by Tools & Software for cluster labeling purposes.
Tag dictionaries
To add an ephemeral tag dictionary, create one or more instances of the
Tag
class defining the list of words and the part-of-speech
tag for those words. Then set the list of tags on instance on your
Lingo3GClusteringAlgorithm
instance.
The following example adds the word whereas with the
fnc
tag.
Listing supported languages
The following code lists all languages supported by Lingo3G:
It prints:
Lingo3G supports:
Arabic, Armenian, Bulgarian, Chinese-Simplified, Chinese-Traditional, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish
Note that the code uses the limitToAlgorithms
method to limit
the list to the languages Lingo3G supports. The unfiltered list contains
all languages defined in the
Carrot2
framework; Lingo3G does not support some of those languages.