Dictionaries

Lingo3G uses dictionaries to improve the quality of clustering for a specific language. This article shows how to customize dictionaries in the Java API.

Lingo3G delegates the management of dictionaries and other language resources to the LanguageComponent instance. You can use its methods to list the available languages, customize the location of dictionaries or even replace the Lingo3G's built-in language resources, such as stemmers. This chapter shows the basic use cases related to dictionary customizations.

Customizing dictionaries

The simplest way to customize Lingo3G dictionaries is to copy the default dictionaries to your application-specific location, make the necessary changes and provide a custom ResourceLookup implementation when loading language resources. The following example loads English resources from a class-relative classpath location.

ResourceLookup resourceLookup = new ClassRelativeResourceLookup(getClass());
LanguageComponents english =
    LanguageComponents.loader()
        // We restrict the loaded resources to just English.
        .limitToLanguages("English")
        // Note how we chain the overridden resource lookup location with the
        // provider's default resources. Otherwise we'd have to copy over
        // all resources from their default locations.
        .withResourceLookup(
            prov ->
                new ChainedResourceLookup(
                    List.of(resourceLookup, prov.defaultResourceLookup())))
        .load()
        .language("English");
Loading English language components from a class-relative classpath location.

Using ephemeral dictionaries

You can provide extra ephemeral dictionary entries for a specific clustering request. Lingo3G applies these extra entries as an addition to the default dictionaries. For example, if the end-user wants to remove specific labels from the clustering result they are currently viewing, your software can add such labels to the ephemeral label dictionary and rerun the clustering.

The dictionaries field of Lingo3GClusteringAlgorithm groups all ephemeral dictionaries. You can add entries to the label, synonym and word dictionaries.

Label dictionaries

To add an ephemeral label dictionary, create a new instance of the LabelMatcher class and add entries to its glob, regex or exact fields. Then set the matcher instance on your Lingo3GClusteringAlgorithm instance.

The following example adds two entries to the glob matcher of the label dictionary.

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();

// Create an ephemeral label filter that excludes two glob patterns.
LabelMatcher dictionary = new LabelMatcher();
dictionary.glob.set("* mining *", "* data-mining *");

// Configure Lingo3G to use the ephemeral filters.
algorithm.dictionaries.labels.set(List.of(dictionary));

Adding entries to the ephemeral label dictionary in Java API.

By default, LabelMatcher instances assign a zero weight to the labels they match, removing them from the result. You can set the weight field of the matcher to a value larger than 1 to promote the labels.

Synonym dictionaries

To add an ephemeral synonym dictionary, do the following:

  1. For each set of synonymous labels, create a SynonymSet instance, providing the list of labels Lingo3G should treat as synonymous. Note that Lingo3G allows only glob-style label matchers in synonym definitions.

    Additionally, you can provide one label to represent all the synonymous labels in cluster labels.

  2. Set a list of synonym sets on the dictionaries.synonyms field of yourLingo3GClusteringAlgorithm instance.

The following example adds one entry making the software and tools label synonymous and represented by Tools & Software for cluster labeling purposes.

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();

// Create an ephemeral synonym dictionary.
List<String> synonyms = List.of("software", "tools");
SynonymSet synonymSet = new SynonymSet("Tools & Software", synonyms);

// Configure Lingo3G to use the extra synonyms.
algorithm.dictionaries.synonyms.set(List.of(synonymSet));

Adding entries to the ephemeral synonym dictionary in Java API.

Tag dictionaries

To add an ephemeral tag dictionary, create one or more instances of the Tag class defining the list of words and the part-of-speech tag for those words. Then set the list of tags on instance on your Lingo3GClusteringAlgorithm instance.

The following example adds the word whereas with the fnc tag.

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();

// Create an ephemeral tag dictionary.
Tag fncTag = new Tag("fnc", List.of("whereas"));

// Configure Lingo3G to use the extra tags.
algorithm.dictionaries.tags.set(List.of(fncTag));

Adding entries to the ephemeral word tag dictionary in Java API.

Listing supported languages

The following code lists all languages supported by Lingo3G:

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
Set<String> supported =
    LanguageComponents.loader()
        // We can load language resources only for the given algorithm(s)
        // which also limits the returned set of supported languages.
        .limitToAlgorithms(algorithm)
        .load()
        .languages();
System.out.println("Lingo3G supports:\n  " + String.join(", ", supported));

Listing all supported languages in the Java API.

It prints:

Lingo3G supports:
  Arabic, Armenian, Bulgarian, Chinese-Simplified, Chinese-Traditional, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish

Note that the code uses the limitToAlgorithms method to limit the list to the languages Lingo3G supports. The unfiltered list contains all languages defined in the Carrot2 framework; Lingo3G does not support some of those languages.