Java API Basics

You can use Lingo3G Java API to call clustering from your Java software. This article explains the basics of the API.

Dependency setup

To set up Lingo3G as a dependency in your project, perform the following steps:

  1. Install Lingo3G POM file in your local Maven repository by running:

    mvn install:install-file -Dfile=lib/lingo3g-*.jar -DpomFile=lib/lingo3g-*.pom

    Alternatively, install the dependency on your company's shared-artifact servers, such as Sonatype Nexus.

  2. Add a Lingo3G dependency to your project.

    • For Maven projects, add the following dependency:

      <dependency>
        <groupid>com.carrotsearch.lingo3g</groupid>
        <artifactid>lingo3g</artifactid>
        <version>2.0.0</version>
      </dependency>

      Lingo3G Maven dependency.

    • For Gradle projects, add the following dependency:

      dependencies {
        implementation("com.carrotsearch.lingo3g:lingo3g:2.0.0")
      }

      Lingo3G Gradle dependency

      (Here we useimplementation as a particular configuration to attach the dependency to.)

      Alternatively, a gradle dependency can just include the set of JAR files from under the distribution's lib/ folder:

      dependencies {
        implementation fileTree("lingo3g-distribution/lib").include("**/*.jar")
      }

API overview

The Java code invoking Lingo3G clustering needs to perform the following steps:

  1. Load and cache the reusable component – the LanguageComponents instance.

    Load LanguageComponents only once.

    LanguageComponents contains various linguistic resources, such as dictionaries and stemmers, and may take significant time to load. Load LanguageComponents once and reuse for all clustering calls. The class is thread-safe, so share one instance between all processing threads.

  2. For each clustering call:

    1. Create the disposable component – the Lingo3GClusteringAlgorithm instance.

    2. Prepare input for clustering – a stream of Document instances.

    3. Perform clustering, reusing theLanguageComponents instance loaded previously.

The Basic clustering section shows code examples that implement the above steps.

Basic clustering

This section shows how to invoke Lingo3G clustering, illustrating each step with code taken from the E01_ClusteringBasics.java example class. It shows just the key elements required to process a stream of documents in English, without any parameter or language resource tuning.

Make sure you install the license file before running any Java API code examples. For quick tests, you can copy the license file to your home directory.

Load reusable components

First, load the reusable components – the default language resources for the English language. The LanguageComponents instance is thread-safe, so you should reuse it for all subsequent clustering calls.

// Our documents are in English so we load appropriate language resources.
// This call can be heavy and an instance of LanguageComponents should be
// created once and reused across different clustering calls.
LanguageComponents languageComponents = LanguageComponents.loader().load().language("English");

Create disposable components

Next, create the disposable component – a Lingo3GClusteringAlgorithm instance:

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();

Prepare documents

Next, assemble the input for clustering – a stream of Document instances. Each document must implement a single method that presents its clusterable text fields to the algorithm:

void visitFields(BiConsumer<String, String> fieldConsumer);

This example uses hardcoded values from an array defined in the ExamplesData class:

static final String[][] DOCUMENTS_DATA_MINING = {
  {
    "http://en.wikipedia.org/wiki/Data_mining",
    "Data mining -" + " Wikipedia, the free " + "encyclopedia",
    "Article about knowledge-discovery in databases (KDD), the practice "
        + "of automatically "
        + "searching large stores of data for patterns."
  },

Use Java streams for an easy way to covert data into Document instances. Note that the code only submits the title and snippet fields for clustering, it omits the URL because it is not clusterable text.

// Create a stream of "documents" for clustering.
// Each such document provides text content fields to a visitor.
Stream<Document> documentStream =
    Arrays.stream(ExamplesData.DOCUMENTS_DATA_MINING)
        .map(
            fields ->
                (fieldVisitor) -> {
                  fieldVisitor.accept("title", fields[1]);
                  fieldVisitor.accept("content", fields[2]);
                });

Because the Document interface has a single method, you can implement it with a lambda expression.

Perform clustering

Finally, invoke clustering, passing the document stream and the language components instance. This example just prints the top-level cluster labels and document counts to the console:

List<Cluster<Document>> clusters;
clusters = algorithm.cluster(documentStream, languageComponents);

ExamplesCommon.printClusters(clusters);

When executed, this example should result in this output:

Knowledge Discovery [docs: 0, score: 1.00]
  Software [docs: 4, score: 1.00]
  Databases [docs: 3, score: 0.94]
  Application Areas [docs: 2, score: 0.89]
  Process [docs: 2, score: 0.87]
  Market [docs: 2, score: 0.86]
Data-mining Software [docs: 0, score: 0.98]
  Model [docs: 3, score: 1.00]
  Data Mining and Knowledge Discovery [docs: 4, score: 0.99]
  Data Mining Applications [docs: 2, score: 0.95]
  Development [docs: 2, score: 0.90]
  Field [docs: 2, score: 0.90]
Applications [docs: 0, score: 0.94]
  Application Areas [docs: 3, score: 1.00]
  Algorithms [docs: 2, score: 0.92]
  Data Mining Software [docs: 2, score: 0.91]
  SIAM International Conference on Data Mining [docs: 2, score: 0.90]
  Customers [docs: 2, score: 0.90]
  Standard [docs: 2, score: 0.90]
  Trends [docs: 2, score: 0.90]
Text Mining [docs: 8, score: 0.92]
Data Mining Tools [docs: 9, score: 0.92]
Techniques [docs: 10, score: 0.92]
Data Mining Technology [docs: 8, score: 0.89]
Data Mining Solutions [docs: 7, score: 0.88]
Conference [docs: 8, score: 0.88]
Data Mining Group [docs: 7, score: 0.86]
Book [docs: 6, score: 0.84]
Data Analysis [docs: 6, score: 0.83]
Practice [docs: 6, score: 0.83]
Data Management [docs: 5, score: 0.82]
Oracle Data Mining [docs: 4, score: 0.80]
Data Warehousing [docs: 3, score: 0.75]
Association [docs: 3, score: 0.73]
Microsoft SQL Server [docs: 2, score: 0.69]
Data Mining Project [docs: 2, score: 0.69]
Central Connecticut State University [docs: 2, score: 0.69]
Process of Extracting [docs: 2, score: 0.68]
CCSU [docs: 2, score: 0.65]
SPSS [docs: 2, score: 0.65]
Dan [docs: 2, score: 0.65]
Science [docs: 2, score: 0.65]
Success [docs: 2, score: 0.65]
Business Intelligence [docs: 3, score: 0.65]
Predictive Analytics [docs: 3, score: 0.65]

Iterating through results

The Cluster class comes with utility methods to retrieve the label, score, list of contained documents and sub-clusters of a cluster. For example:

/** Returns all documents that belong directly to this cluster. */
public List<T> getDocuments() {
  return documents;
}

A method printing a cluster summary, along with subclusters, can look like this:

static <T> void printClusters(List<Cluster<T>> clusters, String indent) {
  for (Cluster<T> c : clusters) {
    System.out.printf(
        Locale.ROOT,
        indent + "%s [docs: %,d, score: %.2f]\n",
        String.join(", ", c.getLabels()),
        c.getDocuments().size(),
        c.getScore());
    printClusters(c.getClusters(), indent + "  ");
  }
}

Note that the document within the cluster has a generic type identical to the type of the documents you pass for clustering. Therefore, you can add domain-specific fields and methods to your Document implementations and use them when consuming the clusters.

Changing parameters

Lingo3G comes with a number of parameters to adjust its behavior. The Lingo3GClusteringAlgorithm class exposes the parameters as public fields. To modify a parameter, use the set() method of the corresponding field.

The following example restricts the cluster hierarchy depth to one level and filters out one-word labels:

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
algorithm.labels.queryHint.set("data mining");
algorithm.labels.minLabelWords.set(2);
algorithm.hierarchy.maxHierarchyDepth.set(1);

and produces this output:

Knowledge Discovery [docs: 13, score: 1.00]
Data-mining Software [docs: 13, score: 0.98]
Data Mining Applications [docs: 11, score: 0.93]
Text Mining [docs: 8, score: 0.92]
Data Mining Tools [docs: 9, score: 0.92]
Data Mining Techniques [docs: 10, score: 0.91]
Data Mining Technology [docs: 8, score: 0.88]
Data Mining Solutions [docs: 7, score: 0.88]
Data Mining Conference [docs: 8, score: 0.86]
Data Mining Group [docs: 7, score: 0.85]
Data Analysis [docs: 6, score: 0.83]
Book on Data Mining [docs: 6, score: 0.82]
Data Management [docs: 5, score: 0.81]
Oracle Data Mining [docs: 4, score: 0.80]
Introduction to Data Mining [docs: 4, score: 0.77]
Data Warehousing [docs: 3, score: 0.75]
Predictive Analytics [docs: 3, score: 0.74]
Microsoft SQL Server [docs: 2, score: 0.69]
Central Connecticut State University [docs: 2, score: 0.69]
Data Mining Project [docs: 2, score: 0.69]
Process of Extracting [docs: 2, score: 0.68]
Business Intelligence [docs: 3, score: 0.65]

Note that Lingo3G validates parameter values when you call the set() method. The method throws exceptions when you submit an incorrect, such as out-of-range, value.

See JavaDoc documentation and the parameters reference page for a list and descriptions of the parameters.