Java API Basics
You can use Lingo3G Java API to call clustering from your Java software. This article explains the basics of the API.
Dependency setup
To set up Lingo3G as a dependency in your project, perform the following steps:
-
Install Lingo3G POM file in your local Maven repository by running:
mvn install:install-file -Dfile=lib/lingo3g-*.jar -DpomFile=lib/lingo3g-*.pomAlternatively, install the dependency on your company's shared-artifact servers, such as Sonatype Nexus.
-
Add a Lingo3G dependency to your project.
-
For Maven projects, add the following dependency:
<dependency> <groupid>com.carrotsearch.lingo3g</groupid> <artifactid>lingo3g</artifactid> <version>2.3.2</version> </dependency>Lingo3G Maven dependency.
-
For Gradle projects, add the following dependency:
dependencies { implementation("com.carrotsearch.lingo3g:lingo3g:2.3.2") }Lingo3G Gradle dependency
(Here we use
implementationas a particular configuration to attach the dependency to.)Alternatively, a gradle dependency can just include the set of JAR files from under the distribution's
lib/folder:dependencies { implementation fileTree("lingo3g-distribution/lib").include("**/*.jar") }
-
API overview
The Java code invoking Lingo3G clustering needs to perform the following steps:
-
Load and cache the reusable component – the
LanguageComponentsinstance.LoadLanguageComponentsonly once.LanguageComponentscontains various linguistic resources, such as dictionaries and stemmers, and may take significant time to load. LoadLanguageComponentsonce and reuse for all clustering calls. The class is thread-safe, so share one instance between all processing threads. -
For each clustering call:
-
Create the disposable component – the
Lingo3GClusteringAlgorithminstance. -
Prepare input for clustering – a stream of
Documentinstances. -
Perform clustering, reusing the
LanguageComponentsinstance loaded previously.
-
The Basic clustering section shows code examples that implement the above steps.
Basic clustering
This section shows how to invoke Lingo3G clustering, illustrating each
step with code taken from the
E01_ClusteringBasics.java example class. It shows just the
key elements required to process a stream of documents in English, without
any parameter or language resource tuning.
Make sure you install the license file before running any Java API code examples. For quick tests, you can copy the license file to your home directory.
Load reusable components
First, load the reusable components – the default language
resources for the English language. The
LanguageComponents instance is thread-safe, so you should
reuse it for all subsequent clustering calls.
// Our documents are in English so we load appropriate language resources.
// This call can be heavy and an instance of LanguageComponents should be
// created once and reused across different clustering calls.
LanguageComponents languageComponents = LanguageComponents.loader().load().language("English");
Create disposable components
Next, create the disposable component – a
Lingo3GClusteringAlgorithm
instance:
Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
Prepare documents
Next, assemble the input for clustering – a stream of
Document instances. Each document must implement a single
method that presents its clusterable text fields to the algorithm:
void visitFields(BiConsumer<String, String> fieldConsumer);
This example uses hardcoded values from an array defined in the
ExamplesData class:
static final String[][] DOCUMENTS_DATA_MINING = {
{
"http://en.wikipedia.org/wiki/Data_mining",
"Data mining -" + " Wikipedia, the free " + "encyclopedia",
"Article about knowledge-discovery in databases (KDD), the practice "
+ "of automatically "
+ "searching large stores of data for patterns."
},
Use Java streams for an easy way to covert data into
Document instances. Note that the code only submits the
title and snippet fields for clustering, it
omits the URL because it is not clusterable text.
// Create a stream of "documents" for clustering.
// Each such document provides text content fields to a visitor.
Stream<Document> documentStream =
Arrays.stream(ExamplesData.DOCUMENTS_DATA_MINING)
.map(
fields ->
(fieldVisitor) -> {
fieldVisitor.accept("title", fields[1]);
fieldVisitor.accept("content", fields[2]);
});
Because the Document interface has a single method, you can
implement it with a lambda expression.
Perform clustering
Finally, invoke clustering, passing the document stream and the language components instance. This example just prints the top-level cluster labels and document counts to the console:
List<Cluster<Document>> clusters;
clusters = algorithm.cluster(documentStream, languageComponents);
ExamplesCommon.printClusters(clusters);
When executed, this example should result in this output:
Knowledge Discovery [docs: 0, score: 1.00]
Software [docs: 4, score: 1.00]
Databases [docs: 3, score: 0.94]
Application Areas [docs: 2, score: 0.89]
Process [docs: 2, score: 0.87]
Market [docs: 2, score: 0.86]
Data-mining Software [docs: 0, score: 0.98]
Model [docs: 3, score: 1.00]
Data Mining and Knowledge Discovery [docs: 4, score: 0.99]
Data Mining Applications [docs: 2, score: 0.95]
Development [docs: 2, score: 0.90]
Field [docs: 2, score: 0.90]
Applications [docs: 0, score: 0.94]
Application Areas [docs: 3, score: 1.00]
Algorithms [docs: 2, score: 0.92]
Data Mining Software [docs: 2, score: 0.91]
SIAM International Conference on Data Mining [docs: 2, score: 0.90]
Customers [docs: 2, score: 0.90]
Standard [docs: 2, score: 0.90]
Trends [docs: 2, score: 0.90]
Text Mining [docs: 8, score: 0.92]
Data Mining Tools [docs: 9, score: 0.92]
Techniques [docs: 10, score: 0.92]
Data Mining Technology [docs: 8, score: 0.89]
Data Mining Solutions [docs: 7, score: 0.88]
Conference [docs: 8, score: 0.88]
Data Mining Group [docs: 7, score: 0.86]
Book [docs: 6, score: 0.84]
Data Analysis [docs: 6, score: 0.83]
Practice [docs: 6, score: 0.83]
Data Management [docs: 5, score: 0.82]
Oracle Data Mining [docs: 4, score: 0.80]
Data Warehousing [docs: 3, score: 0.75]
Association [docs: 3, score: 0.73]
Microsoft SQL Server [docs: 2, score: 0.69]
Data Mining Project [docs: 2, score: 0.69]
Central Connecticut State University [docs: 2, score: 0.69]
Process of Extracting [docs: 2, score: 0.68]
CCSU [docs: 2, score: 0.65]
SPSS [docs: 2, score: 0.65]
Dan [docs: 2, score: 0.65]
Science [docs: 2, score: 0.65]
Success [docs: 2, score: 0.65]
Business Intelligence [docs: 3, score: 0.65]
Predictive Analytics [docs: 3, score: 0.65]
Iterating through results
The Cluster class comes with utility methods to retrieve the
label, score, list of contained documents and sub-clusters of a cluster.
For example:
/** Returns all documents that belong directly to this cluster. */
public List<T> getDocuments() {
return documents;
}
A method printing a cluster summary, along with subclusters, can look like this:
static <T> void printClusters(List<Cluster<T>> clusters, String indent) {
for (Cluster<T> c : clusters) {
System.out.printf(
Locale.ROOT,
indent + "%s [docs: %,d, score: %.2f]\n",
String.join(", ", c.getLabels()),
c.getDocuments().size(),
c.getScore());
printClusters(c.getClusters(), indent + " ");
}
}
Note that the document within the cluster has a generic type identical to
the type of the documents you pass for clustering. Therefore, you can add
domain-specific fields and methods to your
Document implementations and use them when consuming the
clusters.
Changing parameters
Lingo3G comes with a number of parameters to adjust its behavior. The
Lingo3GClusteringAlgorithm class exposes the parameters as
public fields. To modify a parameter, use the set() method of
the corresponding field.
The following example restricts the cluster hierarchy depth to one level and filters out one-word labels:
Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
algorithm.labels.queryHint.set("data mining");
algorithm.labels.minLabelWords.set(2);
algorithm.hierarchy.maxHierarchyDepth.set(1);
and produces this output:
Knowledge Discovery [docs: 13, score: 1.00]
Data-mining Software [docs: 13, score: 0.98]
Data Mining Applications [docs: 11, score: 0.93]
Text Mining [docs: 8, score: 0.92]
Data Mining Tools [docs: 9, score: 0.92]
Data Mining Techniques [docs: 10, score: 0.91]
Data Mining Technology [docs: 8, score: 0.88]
Data Mining Solutions [docs: 7, score: 0.88]
Data Mining Conference [docs: 8, score: 0.86]
Data Mining Group [docs: 7, score: 0.85]
Data Analysis [docs: 6, score: 0.83]
Book on Data Mining [docs: 6, score: 0.82]
Data Management [docs: 5, score: 0.81]
Oracle Data Mining [docs: 4, score: 0.80]
Introduction to Data Mining [docs: 4, score: 0.77]
Data Warehousing [docs: 3, score: 0.75]
Predictive Analytics [docs: 3, score: 0.74]
Microsoft SQL Server [docs: 2, score: 0.69]
Central Connecticut State University [docs: 2, score: 0.69]
Data Mining Project [docs: 2, score: 0.69]
Process of Extracting [docs: 2, score: 0.68]
Business Intelligence [docs: 3, score: 0.65]
Note that Lingo3G validates parameter values when you call the
set() method. The method throws exceptions when you submit an
incorrect, such as out-of-range, value.
See JavaDoc documentation and the parameters reference page for a list and descriptions of the parameters.