Concurrency

This article discusses approaches and pitfalls of running Lingo3G in concurrent environments.

Thread-safety

Lingo3G follows Carrot2 contracts on clustering algorithm components:

  • Lingo3G algorithm instances are not thread-safe and cannot be used by multiple threads in parallel,
  • language component instances are thread-safe and should be used and reused by parallel threads.

In other words, if the use-case scenario requires threads to cluster data in parallel, each thread should "own" its own Lingo3G algorithm instance. All threads can reuse the same LanguageComponent instance loaded beforehand.

The following sections of this article show how various approaches to configuring the algorithm once and then reusing it in subsequent, possibly concurrent, clustering calls.

Ephemeral instances

In this case we achieve thread-safety by creating Lingo3G instances on the fly, configure it appropriately and then discard it after the clustering completes. A simple pattern here would be to create a function that transforms a stream of documents into a list of clusters:

Function<Stream<Document>, List<Cluster<Document>>> processor =
    (documentStream) -> {
      // Algorithm instances are created per-call (per-thread)
      Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
      // ...configured in place
      algorithm.documents.phraseDfThresholdScalingFactor.set(0.1);
      algorithm.hierarchy.maxHierarchyDepth.set(1);
      // and discarded once clustering call completes.
      return algorithm.cluster(documentStream, english);
    };

runConcurrentClustering(processor);

An important assumption here is that the LanguageComponents object (english in the example above) has been loaded beforehand (once) and is reused. See the Language components page for more information on initialization and customization of language resources.

Cloning preconfigured instances

Sometimes the configuration can become fairly complex. Clustering algorithm instances can be converted into and recreated from a map, so we can extract all the details from the preconfigured instance and then deep-clone it on demand, as this example shows:

// Apply any configuration tweaks once.
Lingo3GClusteringAlgorithm preconfigured = new Lingo3GClusteringAlgorithm();
preconfigured.documents.phraseDfThresholdScalingFactor.set(0.1);
preconfigured.hierarchy.maxHierarchyDepth.set(1);

// Populate the map with algorithm and its attributes.
Map<String, Object> attrs = Attrs.toMap(preconfigured);

// Reuse the previously populated map to create a new cloned instance.
Function<Stream<Document>, List<Cluster<Document>>> processor =
    (documentStream) -> {
      ClusteringAlgorithm cloned;
      cloned = Attrs.fromMap(ClusteringAlgorithm.class, attrs);
      return cloned.cluster(documentStream, english);
    };

runConcurrentClustering(processor);