Java API Basics

This article will walk you through the basics of the Lingo3G Java API.

API workflow

Lingo3G API uses Carrot2 project's infrastructure and integrates with other algorithms available in that open-source project. All of Carrot2 documentation is applicable to Lingo3G.

Using Lingo3G Java API consists of the following steps:

  1. setting up heavy, thread-safe, reusable components (language dictionaries and auxiliary resources combined in an instance of the LanguageComponents class),

  2. setting up lightweight, per-thread components (Lingo3G clustering algorithm instance),

  3. preparing input documents and performing the actual clustering.

Heavy and lightweight components

Initialization of heavy components (LanguageComponents) may take significant time. Load them once and then reuse for all subsequent clustering calls. Heavy components are thread-safe and can be reused or shared between threads.

Lightweight components are cheap to instantiate so you can create a throw-away instance on-demand for each clustering call.

Clustering

The following example discusses code from the E01_ClusteringBasics.java example. It shows just the key elements required to process a stream of documents in English, without any parameter or language resource tuning.

First, let's load the heavy components: the default resources for the English language. The loaded LanguageComponents instance is thread-safe and should be reused for any subsequent calls to clustering algorithms.

// Our documents are in English so we load appropriate language resources.
// This call can be heavy and an instance of LanguageComponents should be
// created once and reused across different clustering calls.
LanguageComponents languageComponents = LanguageComponents.loader().load().language("English");

Now it's time to create the lightweight component: an instance of the Lingo3G clustering algorithm:

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();

Once we have the heavy and lightweight components initialized, we can assemble the input for clustering: a stream of Document instances. Each document must implement a single method that presents its clusterable text fields to the algorithm:

void visitFields(BiConsumer<String, String> fieldConsumer);

In this example we will use hardcoded values from an array available in the ExamplesData class:

static final String[][] DOCUMENTS_DATA_MINING = {
  {
    "http://en.wikipedia.org/wiki/Data_mining",
    "Data mining -" + " Wikipedia, the free " + "encyclopedia",
    "Article about knowledge-discovery in databases (KDD), the practice "
        + "of automatically "
        + "searching large stores of data for patterns."
  },

Because the field visitor interface is a single method, it can be implemented using a closure. We convert the above data array into document instances dynamically using Java streams. Note how we only expose the title and the snippet fields, the URL is omitted because it is not really clusterable text content.

// Create a stream of "documents" for clustering.
// Each such document provides text content fields to a visitor.
Stream<Document> documentStream =
    Arrays.stream(ExamplesData.DOCUMENTS_DATA_MINING)
        .map(
            fields ->
                (fieldVisitor) -> {
                  fieldVisitor.accept("title", fields[1]);
                  fieldVisitor.accept("content", fields[2]);
                });

Everything is now ready to call the clustering algorithm and consume the result. Here we just print the top-level cluster labels and document counts to the console:

List<Cluster<Document>> clusters;
clusters = algorithm.cluster(documentStream, languageComponents);

ExamplesCommon.printClusters(clusters);

When executed, this example should result in this output:

Knowledge Discovery [docs: 0, score: 1.00]
  Software [docs: 4, score: 1.00]
  Databases [docs: 3, score: 0.94]
  Application Areas [docs: 2, score: 0.89]
  Process [docs: 2, score: 0.87]
  Market [docs: 2, score: 0.86]
Data-mining Software [docs: 0, score: 0.98]
  Model [docs: 3, score: 1.00]
  Data Mining and Knowledge Discovery [docs: 4, score: 0.99]
  Data Mining Applications [docs: 2, score: 0.95]
  Development [docs: 2, score: 0.90]
  Field [docs: 2, score: 0.90]
Applications [docs: 0, score: 0.94]
  Application Areas [docs: 3, score: 1.00]
  Algorithms [docs: 2, score: 0.92]
  Data Mining Software [docs: 2, score: 0.91]
  SIAM International Conference on Data Mining [docs: 2, score: 0.90]
  Customers [docs: 2, score: 0.90]
  Standard [docs: 2, score: 0.90]
  Trends [docs: 2, score: 0.90]
Text Mining [docs: 8, score: 0.92]
Data Mining Tools [docs: 9, score: 0.92]
Techniques [docs: 10, score: 0.92]
Data Mining Technology [docs: 8, score: 0.89]
Data Mining Solutions [docs: 7, score: 0.88]
Conference [docs: 8, score: 0.88]
Data Mining Group [docs: 7, score: 0.86]
Book [docs: 6, score: 0.84]
Data Analysis [docs: 6, score: 0.83]
Practice [docs: 6, score: 0.83]
Data Management [docs: 5, score: 0.82]
Oracle Data Mining [docs: 4, score: 0.80]
Data Warehousing [docs: 3, score: 0.75]
Association [docs: 3, score: 0.73]
Microsoft SQL Server [docs: 2, score: 0.69]
Data Mining Project [docs: 2, score: 0.69]
Central Connecticut State University [docs: 2, score: 0.69]
Process of Extracting [docs: 2, score: 0.68]
CCSU [docs: 2, score: 0.65]
SPSS [docs: 2, score: 0.65]
Dan [docs: 2, score: 0.65]
Science [docs: 2, score: 0.65]
Success [docs: 2, score: 0.65]
Business Intelligence [docs: 3, score: 0.65]
Predictive Analytics [docs: 3, score: 0.65]

Iterating through results

In the example above we silently omitted how the resulting structure of Cluster objects is enumerated and how documents in each cluster can be retrieved.

The Cluster class comes with utility methods to retrieve the label, score, list of contained documents and sub-clusters of a cluster. For example:

/** Returns all documents that belong directly to this cluster. */
public List<T> getDocuments() {
  return documents;
}

Our final routine dumping a cluster's label (and subclusters) can look like this:

static <T> void printClusters(List<Cluster<T>> clusters, String indent) {
  for (Cluster<T> c : clusters) {
    System.out.printf(
        Locale.ROOT,
        indent + "%s [docs: %,d, score: %.2f]\n",
        String.join(", ", c.getLabels()),
        c.getDocuments().size(),
        c.getScore());
    printClusters(c.getClusters(), indent + "  ");
  }
}

Note that the type of the "document" within the cluster is a generic type identical to that of the documents passed for clustering. Your concrete implementations of Document class may have additional fields or methods to retrieve any document data for display purposes.

Tweaking parameters

Lingo3G clustering algorithm comes with a number of different parameters to adjust its behavior. These parameters are exposed as public fields on the algorithm class instance and are documented in the generated JavaDoc documentation as well as parameters reference page.

Here is an example of restricting the hierarchy depth to one level:

Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
algorithm.labels.queryHint.set("data mining");
algorithm.labels.minLabelWords.set(2);
algorithm.hierarchy.maxHierarchyDepth.set(1);

which produces this output:

Knowledge Discovery [docs: 13, score: 1.00]
Data-mining Software [docs: 13, score: 0.98]
Data Mining Applications [docs: 11, score: 0.93]
Text Mining [docs: 8, score: 0.92]
Data Mining Tools [docs: 9, score: 0.92]
Data Mining Techniques [docs: 10, score: 0.91]
Data Mining Technology [docs: 8, score: 0.88]
Data Mining Solutions [docs: 7, score: 0.88]
Data Mining Conference [docs: 8, score: 0.86]
Data Mining Group [docs: 7, score: 0.85]
Data Analysis [docs: 6, score: 0.83]
Book on Data Mining [docs: 6, score: 0.82]
Data Management [docs: 5, score: 0.81]
Oracle Data Mining [docs: 4, score: 0.80]
Introduction to Data Mining [docs: 4, score: 0.77]
Data Warehousing [docs: 3, score: 0.75]
Predictive Analytics [docs: 3, score: 0.74]
Microsoft SQL Server [docs: 2, score: 0.69]
Central Connecticut State University [docs: 2, score: 0.69]
Data Mining Project [docs: 2, score: 0.69]
Process of Extracting [docs: 2, score: 0.68]
Business Intelligence [docs: 3, score: 0.65]

Note that parameters like maxHierarchyDepth above have to use setter methods to modify their value. This is because arguments are validated early and out-of-range or otherwise incorrect values will trigger exceptions at the exact moment they are set in the code.