Languages

This section describes language support in Lingo3G Java API.

Lingo3G language support is centered around Carrot2 LanguageComponent class. Instances of this class provide assistance and hints to improve the quality of clustering for a specific language. The resources associated with this task typically require costly processing to load and parse, so LanguageComponent instances should be created early and reused for all subsequent clustering calls.

A single LanguageComponent instance provides a number of resources to aid the algorithm in improving clustering quality:

  • lemmatisation (stemming) routines,
  • tokenisation (word decomposition) and decompounding routines,
  • tag dictionaries,
  • ignore lists (stop word lists).

Lingo3G comes with implementations and a default set of lexical resources for several languages. The syntax and function of lexical resources is described later on.

List supported languages

To list all languages available via Carrot2 infrastructure we can use the utility function on LanguageComponents class:

System.out.println(
    "Language components for the following languages are available:\n  "
        + String.join(", ", LanguageComponents.loader().load().languages()));

which prints:

Language components for the following languages are available:
  Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish

However, Lingo3G offers support for a subset of all these languages (not every Carrot2 algorithm is required to support each and every language). We can load only those languages Lingo3G supports by adding a restriction on the component loader:

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
Set<String> supported =
    LanguageComponents.loader()
        // We can load language resources only for the given algorithm(s)
        // which also limits the returned set of supported languages.
        .limitToAlgorithms(algorithm)
        .load()
        .languages();
System.out.println("Lingo3G supports:\n  " + String.join(", ", supported));

which prints:

Lingo3G supports:
  Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish

Customizing resource locations

By default language component loader looks up language resources of each algorithm from their respective JARs. In most practical applications these resources need to be fine-tuned for a particular document domain to improve clustering quality. You can copy the default resources out of the Lingo3G JAR (the DCS service in the distribution package contains an unpacked, explicit set of required resources too), modify them and instruct the loader to use these new resources for the algorithm.

In this example (named E03_Languages in the distribution), we point at resources relative to the code's class:

ResourceLookup resourceLookup = new ClassRelativeResourceLookup(getClass());
LanguageComponents english =
    LanguageComponents.loader()
        // We restrict the loaded resources to just English.
        .limitToLanguages("English")
        // Note how we chain the overridden resource lookup location with the
        // provider's default resources. Otherwise we'd have to copy over
        // all resources from their default locations.
        .withResourceLookup(
            prov ->
                new ChainedResourceLookup(
                    List.of(resourceLookup, prov.defaultResourceLookup())))
        .load()
        .language("English");

Because the overridden resources are mostly empty, the algorithm won't know which words constitute stop words or ignorable phrases, resulting in very noisy output:

Data Mining Is [docs: 0, score: 1.00]
  Process [docs: 7, score: 1.00]
  Can [docs: 6, score: 0.97]
  Data Mining Software, Data-mining Software [docs: 4, score: 0.91]
  For Relationships [docs: 3, score: 0.85]
  Developing [docs: 3, score: 0.84]
  Field [docs: 3, score: 0.84]
  The Current [docs: 2, score: 0.79]
  Analysis [docs: 2, score: 0.78]
  Data Mining Applications [docs: 2, score: 0.77]
  And Business [docs: 2, score: 0.77]
  Home [docs: 2, score: 0.76]
  Success [docs: 2, score: 0.76]
Knowledge Discovery [docs: 0, score: 0.92]
  Page [docs: 4, score: 1.00]
  Information [docs: 4, score: 0.99]
  Software [docs: 4, score: 0.99]
  Application Areas [docs: 2, score: 0.89]
  Process [docs: 2, score: 0.87]
  Data Mining By [docs: 2, score: 0.86]
  What Is Data Mining [docs: 4, score: 0.73]
Techniques [docs: 10, score: 0.85]
Text Mining [docs: 8, score: 0.85]
Data Mining Solutions [docs: 7, score: 0.81]
Data Mining Group [docs: 7, score: 0.78]
Data Mining Technology [docs: 5, score: 0.76]
Data Management [docs: 5, score: 0.76]
Web [docs: 5, score: 0.75]
Oracle Data Mining [docs: 4, score: 0.74]
Predictive Modeling [docs: 4, score: 0.74]
Data Warehousing [docs: 3, score: 0.69]
Data Mining Institute [docs: 3, score: 0.69]
Analytic [docs: 3, score: 0.67]
Statistical [docs: 3, score: 0.67]
Including [docs: 3, score: 0.66]
Microsoft SQL Server [docs: 2, score: 0.64]
Data Mining Project [docs: 2, score: 0.64]
Predictive Analytics [docs: 2, score: 0.64]
Be Found [docs: 2, score: 0.62]
Data Miners [docs: 2, score: 0.62]
Data Mining Resources [docs: 2, score: 0.62]
Information from Large [docs: 2, score: 0.62]
The Free Encyclopedia [docs: 2, score: 0.62]
Call for [docs: 2, score: 0.62]
Association [docs: 2, score: 0.62]
Case [docs: 2, score: 0.61]
SPSS [docs: 2, score: 0.60]