Concurrency
If you intend to run Lingo3G clustering in parallel threads, make sure your code follows the concurrency guidelines outlined in this article.
Thread-safety
Lingo3G Java API follows the following contracts with respect to thread-safety:
-
Lingo3GClusteringAlgorithm
instances are not thread-safe – your code must not share them among parallel threads. -
LanguageComponents
instance is thread-safe – your code should share and reused it among parallel threads.
In other words, if your code needs to cluster data in parallel threads,
each thread should "own" its own
Lingo3GClusteringAlgorithm
instance. All threads should reuse
the sameLanguageComponent
instance loaded beforehand.
The following sections show two approaches to configuring Lingo3G algorithm instance once and then reusing it in subsequent, possibly concurrent, clustering calls.
Ephemeral instances
The simplest way to ensure thread-safety is to create and configure a
Lingo3GClusteringAlgorithm
instance on the fly and discard it after the clustering completes.
The following example defines a function that transforms a stream of documents into a list of clusters:
// Load one language components instance for sharing among all parallel threads.
LanguageComponents english = LanguageComponents.loader().load().language("English");
// Document stream -> cluster list function.
Function<Stream<Document>, List<Cluster<Document>>> processor =
(documentStream) -> {
// Algorithm instances are created per-call (per-thread)
Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
// ...configured in place
algorithm.documents.phraseDfThresholdScalingFactor.set(0.1);
algorithm.hierarchy.maxHierarchyDepth.set(1);
// and discarded once clustering call completes.
return algorithm.cluster(documentStream, english);
};
runConcurrentClustering(processor);
Calling clustering in multiple threads using disposable clustering algorithm instances in Java API.
Note that the code loads a LanguageComponents
instance once
and then shares it among all parallel threads for reuse.
Cloning preconfigured instances
If the configuration of your Lingo3GClusteringAlgorithm
instance is complex or you would like to decouple it from the actual
clustering, your code can do the following:
-
Create and configure a "blueprint"
Lingo3GClusteringAlgorithm
instance. -
Use the
Attrs.toMap
method to convert the "blueprint" instance into aMap
for sharing among concurrent threads. -
In each thread, use the
Attrs.fromMap
method to create a clone of the "blueprint" instance.
The following example demonstrates this approach:
// Apply any configuration tweaks once.
Lingo3GClusteringAlgorithm preconfigured = new Lingo3GClusteringAlgorithm();
preconfigured.documents.phraseDfThresholdScalingFactor.set(0.1);
preconfigured.hierarchy.maxHierarchyDepth.set(1);
// Populate the map with algorithm and its attributes.
Map<String, Object> attrs = Attrs.toMap(preconfigured);
// Reuse the previously populated map to create a new cloned instance.
Function<Stream<Document>, List<Cluster<Document>>> processor =
(documentStream) -> {
Lingo3GClusteringAlgorithm cloned;
cloned = Attrs.fromMap(Lingo3GClusteringAlgorithm.class, attrs);
return cloned.cluster(documentStream, english);
};
runConcurrentClustering(processor);
Calling clustering in multiple threads by cloning a preconfigured clustering algorithm instance in Java API.
Note that the parallel threads do not use the "blueprint" instance directly as it is not thread-safe. Instead, they create a disposable clone for each clustering call.