Concurrency

If you intend to run Lingo3G clustering in parallel threads, make sure your code follows the concurrency guidelines outlined in this article.

Thread-safety

Lingo3G Java API follows the following contracts with respect to thread-safety:

  • Lingo3GClusteringAlgorithm instances are not thread-safe – your code must not share them among parallel threads.

  • LanguageComponents instance is thread-safe – your code should share and reused it among parallel threads.

In other words, if your code needs to cluster data in parallel threads, each thread should "own" its own Lingo3GClusteringAlgorithm instance. All threads should reuse the sameLanguageComponent instance loaded beforehand.

The following sections show two approaches to configuring Lingo3G algorithm instance once and then reusing it in subsequent, possibly concurrent, clustering calls.

Ephemeral instances

The simplest way to ensure thread-safety is to create and configure a Lingo3GClusteringAlgorithm instance on the fly and discard it after the clustering completes.

The following example defines a function that transforms a stream of documents into a list of clusters:

// Load one language components instance for sharing among all parallel threads.
LanguageComponents english = LanguageComponents.loader().load().language("English");

// Document stream -> cluster list function.
Function<Stream<Document>, List<Cluster<Document>>> processor =
    (documentStream) -> {
      // Algorithm instances are created per-call (per-thread)
      Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();

      // ...configured in place
      algorithm.documents.phraseDfThresholdScalingFactor.set(0.1);
      algorithm.hierarchy.maxHierarchyDepth.set(1);

      // and discarded once clustering call completes.
      return algorithm.cluster(documentStream, english);
    };

runConcurrentClustering(processor);

Calling clustering in multiple threads using disposable clustering algorithm instances in Java API.

Note that the code loads a LanguageComponents instance once and then shares it among all parallel threads for reuse.

Cloning preconfigured instances

If the configuration of your Lingo3GClusteringAlgorithm instance is complex or you would like to decouple it from the actual clustering, your code can do the following:

  1. Create and configure a "blueprint" Lingo3GClusteringAlgorithm instance.

  2. Use the Attrs.toMap method to convert the "blueprint" instance into a Map for sharing among concurrent threads.

  3. In each thread, use the Attrs.fromMap method to create a clone of the "blueprint" instance.

The following example demonstrates this approach:

// Apply any configuration tweaks once.
Lingo3GClusteringAlgorithm preconfigured = new Lingo3GClusteringAlgorithm();
preconfigured.documents.phraseDfThresholdScalingFactor.set(0.1);
preconfigured.hierarchy.maxHierarchyDepth.set(1);

// Populate the map with algorithm and its attributes.
Map<String, Object> attrs = Attrs.toMap(preconfigured);

// Reuse the previously populated map to create a new cloned instance.
Function<Stream<Document>, List<Cluster<Document>>> processor =
    (documentStream) -> {
      Lingo3GClusteringAlgorithm cloned;
      cloned = Attrs.fromMap(Lingo3GClusteringAlgorithm.class, attrs);
      return cloned.cluster(documentStream, english);
    };

runConcurrentClustering(processor);

Calling clustering in multiple threads by cloning a preconfigured clustering algorithm instance in Java API.

Note that the parallel threads do not use the "blueprint" instance directly as it is not thread-safe. Instead, they create a disposable clone for each clustering call.