2.x migration guide

Version 2.0.0 modernizes the whole Lingo3G ecosystem, including end-user applications and the programming APIs. This article discusses the rationale behind the changes and provides migration procedures for the affected elements of Lingo3G.

Overview of changes

Lingo3G 2.0.0 comes with the first major API refactoring in several years. At first, we tried to avoid incompatible API changes to eliminate the need for manual code updates and data format conversions. Eventually, it became obvious that the accumulated technical debt and technological progress made sustaining the old architecture impractical and it was time to change it.

The primary motivation for changes was to simplify the programming interfaces to make your Lingo3G integration code cleaner and easier to maintain. We also wanted to modernize the underlying technology stack, so that Lingo3G's technical debt would not carry over to your systems.

The following list summarizes the changes and their driving factors.

  • Modernized REST API

    Lingo3G 2.x REST API replaces XML with JSON data format for documents, parameters and dictionaries to make integration with your JavaScript code a breeze.

  • Browser-based Workbench

    Lingo3G 2.x removes the Workbench desktop application. The underlying technology has become obsolete and made it hard for us to add new Workbench features.

    The new Workbench is a browser-based app you can try on-line or install on your machine. Lingo3G Workbench has new features the old version did not have, such as clustering data from Excel spreadsheets.

  • Simpler Java API focused on clustering

    Lingo3G 1.x Java API exposed the complete pipeline of document sources and clustering algorithm components, along with controllers that managed the data flow. This helped to cluster data from predefined document sources, but at the same time required learning lot of interfaces, methods and classes to make a simple clustering call.

    Lingo3G 2.x Java API removes document sources and data flow controllers entirely. To perform clustering, you just need to instantiate the clustering algorithm class, provide it with language resources and a stream of documents to cluster.

    If your application requires caching of clustering results or needs to perform clustering in concurrent threads, Lingo3G 2.x Java API does not get in the way – you can choose and implement the caching, reuse and concurrency pattern that best suits your code.

  • Harnessing the power of the compiler and IDE

    The 2.x Java API exposes clustering parameters as public, typed fields. When you write the integration code, the IDE autocomplete will help you to discover the available parameters and their value types. If you provide an invalid parameter value, the 2.x Java API throws an exception right away to aid debugging. Finally, if a new version removes or renames a parameter you used, the code won't compile letting you to detect the problem early.

  • Minimal, up-to-date dependencies and technology

    The new API minimizes dependencies on third-party libraries and technologies to lower the exposure to the security vulnerabilities in external components and to make integration easier.

  • C# API removed

    We decided to remove the native C# API in Lingo3G 2.x. The reason for this is that the technology we used to build the C# API has been discontinued for several years and the interest in Lingo3G C# API has been steadily declining.

  • Multilingual clustering removed

    Lingo3G 1.x offered built-in language detection and multilingual clustering. To keep focus on clustering and to limit dependencies, we removed this feature from Lingo3G 2.x – the new API allows clustering documents in a single language. If your application relies on multilingual clustering, see this migration step for possible solutions.

  • Features scheduled for removal

    The incremental clustering ("persistent clusters") feature and clustering based on nominal and numeric fields are scheduled for removal in Lingo3G 2.x released after December, 2021.

    The incremental clustering feature was aimed at scaling Lingo3G clustering beyond several thousands of document. If you use Lingo3G to cluster collections of this size, consider switching to Lingo4G, which was specifically designed for the large-scale clustering case.

    Clustering based on nominal and numeric fields was an experimental feature that never attracted significant attention. For this reason, we decided to schedule it for removal.

Migration guide

REST API (DCS)

Lingo3G 2.x REST API is not backward compatible with the 1.x API – you will need to update your Lingo3G integration code to make it work with version 2.x. This section lists the key elements to look for in your existing code.

  1. First, have a look at the new REST API.

    Before you start converting any existing code, have a look at the basics of the new API.

  2. Identify suite-dcs.xml modifications.

    Previously, Lingo3G used the suite-dcs.xml component suite definition to declare document sources, algorithms and parameter changes. If you edited that file to change parameters or add custom document sources, identify these changes first.

    The new REST API does not support document sources, so if your code used built-in or custom document sources, you need to modify the code to fetch documents directly from the source and pass them to Lingo3G REST API for clustering.

    If you modified suite-dcs.xml to change Lingo3G parameter values, pass the modified parameters directly in the clustering request or in a request template.

  3. Switch from XML to JSON.

    The new REST API accepts and returns data in JSON format, so you will need to modify your code to use the new format. If you write your client code in Java, you can use data model classes to produce the required JSON structure.

  4. Convert dictionaries to JSON.

    If you edited any dictionary files in your existing DCS setup, covert the dictionaries from XML to JSON and copy them to the DCS dictionary directory.

Java API

Lingo3G 2.x Java API is not backward compatible with the 1.x API – you will need to update your Lingo3G integration code to make it work with version 2.x. This section lists the key elements to look for in your existing code.

  1. First, have a look at the new API.

    Before you start converting any existing code, have a look at the basics of the new API, code examples included in the distribution package and JavaDoc. This will give you an idea of what the updated code should look like.

  2. Identify the controller.

    Look for references to the Controller class in your existing code. You should see a fragment similar to:

    Controller controller = ControllerFactory.createCachingPooling(...)

    Lingo3G 1.x used the controller as a cache and threading coordinator. Controller does not have a direct replacement in the 2.x API, but the place where you initialized it is very likely the right place to load and initialize the language components. Your code should load language components once and reuse them for all subsequent clustering calls, also in concurrent threads.

  3. Identify document source(s) and the process the method.

    Lingo3G 1.x API collected the input for clustering using classes implementing the IDocumentSource interface. Look for an implementation of this class that is part of the processing chain invoked on the controller, or search for an explicit list of documents set on the parameters passed to the process method. For example:

    CommonAttributesDescriptor.attributeBuilder(processingAttributes)
      .documents(...)
    ...
    ProcessingResult result = controller.process(processingAttributes,
      Lingo3GClusteringAlgorithm.class);

    Replace the controller.process(...) call with the 2.x Java API calls: Lingo3G algorithm instantiation, document stream preparation and clustering call.

    Note that with the new API, your code needs to provide an explicit stream of documents for clustering. For each document, provide only the fields that Lingo3G should use for clustering.

  4. Identify parameter customizations.

    With the 1.x API, your code could set custom parameter values during controller initialization or at processing-time. Identify any parameters that your code customizes. Their names in the old and new API are identical or very similar.

    With the 2.x API, your code should set parameters as part of Lingo3G algorithm instantiation. To decouple parameter customization from the clustering call, consider cloning a "blueprint" instance.

    If your code uses any of the following parameters, you need to make manual adjustments:

    • reloadResources is not available in the new API. A functional equivalent is loading a new language components instance explicitly. However, we don't recommend reloading language components in production code due to the computational overhead. To modify dictionaries at runtime, use ephemeral dictionaries instead.

    • languageRecognition, minLanguageRecognitionConfidence have been removed from the new API as part of language detection and multilingual clustering removal.

      To perform multilingual clustering, use an external language detection library, such as Lingua, to split documents into separate language and cluster each language separately.

    • titleFields is now called boostFields and the corresponding titleWordLabelScorerWeight is now called boostedFieldScorerWeight.

      Furthermore, Lingo3G 2.x does not boost the title field by default any more. If you'd like to boost the title field, set the boostFields parameter to title.

    • generateLabelHighlights is removed without replacement. If your application relies heavily on highlighting occurrences of cluster labels in the text, consider switching to Lingo4G, which offers comprehensive support for label highlighting.

  5. Identify resource location customizations.

    If your existing code customizes resource locations, you may see a code fragment like this:

    ResourceLookup resourceLookup = new ResourceLookup(new DirLocator(resourcesDir));
    ...
    LexicalDataLoaderDescriptor.attributeBuilder(controllerInitAttrs)
      .resourceLookup(resourceLookup);

    In Lingo3G 2.x API, you can customize resource locations when loading language components.

    Lingo3G 2.x also changes the dictionary file naming conventions. The following table illustrates the changes for English, other languages follow the same convention.

    Dictionary type Old name New name
    Label dictionary label-dictionary.en.xml english.labels.json
    Word dictionary word-dictionary.en.xml english.tags.json
    Synonyms synonyms.en.xml english.synonyms.json

    Finally, Lingo3G 2.x switches the format of dictionary files from XML to JSON. You can use a command line tool to convert your custom dictionaries.

  6. Convert dictionaries to JSON.

    If your existing code loads custom XML dictionary files, you need to rename and convert them to the JSON format. Version 2.x replaces the XML-based token-matching rules with simpler and more intuitive glob expressions.

    Lingo3G 2.x comes with a migration command line tool you can use to covert your custom dictionaries from XML to JSON. The following examples assume you convert English dictionaries, change the input file name if needed.

    To perform the conversion, perform these steps:

    1. Open a command prompt and change current directory to the top-level Lingo3G distribution folder.

    2. To convert a label dictionary, run:

      java -jar lib\lingo3g-migration-2.1.3.jar convert-labels label-dictionary.en.xml

      This command produces english.labels.json.

    3. To convert a synonym dictionary, run:

      java -jar lib\lingo3g-migration-2.1.3.jar convert-synonyms synonyms.en.xml

      This command produces english.synonyms.json.

    4. To convert a word dictionary to a tag dictionary, run:

      java -jar lib\lingo3g-migration-2.1.3.jar java -jar convert-words word-dictionary.en.xml

      The command produces english.tags.json.

    Note that conversion tools "flatten" all internal includes present in XML files. If to keep the dictionaries separate, remove include statements before conversion and run the converter on each file separately.

  7. Update license file lookup locations.

    Lingo3G 2.x Java API slightly changes the license file lookup locations. Most notably, Lingo3G now uses the class loader of Lingo3G API JAR to look up the license resource instead of the thread's context class loader.

C# API

Lingo3G 2.x discontinued the native C# API due to the declining developer interest. If your application relies on the C# API, consider converting your C# code to call Lingo3G REST API instead.

Clustering Workbench

Lingo3G 2.x discontinued the desktop Clustering Workbench application in favor of the browser-based Clustering Workbench app.

The following elements of the old desktop-based Workbench are not available in the new browser-based version:

Multiple results panels

The old Workbench could open multiple clustering results tabs for simultaneous viewing. To achieve the same with the new Workbench, use browser tabs to work on multiple clustering results at the same time.

Aduna Cluster Map

The old Workbench had an option to present document clusters as an interactive graph of nodes. This kind of visualization is not currently available in the new Workbench, but we're hoping to add an equivalent counterpart at some point.

Benchmark view

You could use the old Workbench to run performance benchmarks of Lingo3G clustering. This function is not possible to reliably implement in the new Workbench and is not available. The best way to asses Lingo3G performance is to benchmark it as part of the application you are developing. If you need any performance or benchmarking advice, please contact us at info@carrotsearch.com.

The new Workbench comes with features the old version did not have, such as clustering data from Excel spreadsheets or configurable document view.