Lingo3G 2.0.x Migration

This section discusses major API changes introduced in Lingo3G 2.0 and shows tentative steps to upgrade from Lingo3G 1.x.

Reasons for the new API

Lingo3G 2.0.0 comes with the first major API refactoring and cleanup in several years. We tried to avoid incompatible API changes, realizing it would inevitably cause increased integration effort, need for manual code updates, data format conversions and potential service disruptions. Eventually it became obvious that the accumulated technical debt and technological progress makes sustaining the old architecture impractical and it's time to change it.

We hope that the new programming interface in Lingo3G (and Carrot2) will make programming easier and your code cleaner.

We also hope you'll just plain like it as much as we do.

Overview of changes

Here is a very high level overview of the new design and its driving factors.

  • Minimize public Java API and boilerplate code.

    The old Carrot2 and Lingo3G architecture was centered around the full pipeline of document-sources, clustering algorithms and controllers that managed these components. This took some work off the shoulders of users but at the same time resulted in a lot of interfaces, methods and classes that the user had know in order to use the algorithm.

    The new API is much, much simpler. We removed document sources entirely and decided that programmers (API users) know best how to organize the code to ensure thread-safety, component reuse and other aspects that the framework has no means of knowing in advance. This strips the public API to a bare minimum: concrete clustering algorithm classes that need to be instantiated, provided with language resources and invoked with a document stream to cluster.

  • Make the compiler do the hard work.

    The new Java API is designed so that clustering implementations have public, typed attributes and throw exceptions at runtime if parameter values don't meet validation criteria. This allows the compiler (and IDE) to utilize code structure for autocompletion, argument type verification and early detection of failures (when upgrading, for example).

  • Modern and efficient REST API.

    Javascript dominates frontend development these days. A new, efficient and modernized HTTP/REST API based on JSON data streams needs to replace the old, XML-based data formats as it makes development in those technologies simpler and easier.

  • Minimal, up-to-date dependencies and technology.

    The new API tries to minimize dependencies on third party libraries and technologies. Such dependencies cause notorious integration problems and security-related concerns.

  • Clustering Workbench moves to the Web.

    We had to let go of the desktop version of the Workbench. The underlying technology has become obsolete and changes in the API (no interface for document sources) didn't fit with the Workbench's philosophy.

    This said, the DCS now comes with a built-in web-based Workbench that can be used for interactive fiddling with the algorithm and its settings.

  • Multilingual clustering is gone.

    Previously, multilingual clustering in Lingo3G was implemented using automatic document language identification followed by separate clustering calls on each language subset. The new API allows clustering documents in single-language only — language identification and other pre- and post-processing of multi-lingual content can be done prior to clustering (and with full control, outside of Lingo3G).

  • Incremental clustering is deprecated.

    Incremental clustering ("persistent clusters") feature is scheduled for removal in Lingo3G released after December, 2021.

  • Nominal and numeric fields.

    Support for nominal and numeric fields is scheduled for removal in Lingo3G released after December, 2021.

Migration guide

Java API

This section presents a checklist of key elements to look at when porting existing code that uses Lingo3G Java API to version 2.0.x.

Please read about the new API prior to converting any existing code (this section will help). Also, check out the Java examples distributed with Lingo3G as they'll give you an idea what the updated code should look like.

Once you're done with the above, look at your existing code utilizing Lingo3G and try to follow these steps.

  1. Identify the controller.

    The Controller class is perhaps the first thing to look at in your existing code to be ported. You will probably have something like the following snippet, somewhere:

    Controller controller = ControllerFactory.createCachingPooling(...)
    

    The controller was used as a cache and threading coordinator in the old API, it does not have as direct replacement — algorithm instances are now single-threaded and document sources are gone so no caching is needed. But the place where you initialized the controller is very likely the right place to load and initialize language components for the languages that will be used for clustering.

    Once language components are loaded here, they can (and should) be reused for all subsequent clustering calls.

  2. Identify document source(s) and process method.

    The new API needs an explicit stream of documents (with text fields to cluster) so you must identify the source of documents in the old API.

    Previously, Lingo3G API collected the input for clustering using classes implementing a IDocumentSource interface. Look for an implementation of this class that is part of the processing chain invoked on the controller, or search for an explicit list of documents set on the parameters passed to process method. For example:

    CommonAttributesDescriptor.attributeBuilder(processingAttributes)
      .documents(...)
    ...
    ProcessingResult result = controller.process(processingAttributes,
      Lingo3GClusteringAlgorithm.class);
    

    The controller.process(...) call is the snippet which should be replaced with instantiation of the Lingo3G algorithm and a clustering call in the new API (see this section for an example).

    Note that the new API needs only fields that were previously used for clustering — typically title and snippet fields. Any other document fields can be omitted in the visitFields method of the document visitor API.

  3. Identify parameter tweaks.

    The tuning parameters in the old API could be provided during controller initialization or at processing-time. Identify any parameters that are modified from their default values. Their names in the old and new API will be identical or very similar.

    The new code should set parameters directly where the algorithm is instantiated and used (this is cheap and explicit). If you'd like to extract this "setup" step from the actual clustering call, see this section for some hints on how to do it.

    The following parameters, if used, may require manual intervention:

    • reloadResources — not recommended for production at all (costly), but reloading LanguageComponents entirely is an equivalent.
    • languageRecognition, minLanguageRecognitionConfidence — these parameters have been removed without replacement (no multilingual clustering). Try to split documents into separate languages (with any language identifier software) and cluster independently.
    • generateLabelHighlights — removed without replacement.
    • titleFields parameter is now called boostFields and its corresponding score boost parameter is renamed from titleWordLabelScorerWeight to boostedFieldScorerWeight. Note that the title field is no longer boosted by default. If you relied on title field score boosts, set boostFields parameter to "title" (this section shows how to tweak parameters).
  4. Custom resources and resource lookups.

    If your code used custom resources (resource locations) then you may see a snippet like this in your code, for example:

    ResourceLookup resourceLookup = new ResourceLookup(new DirLocator(resourcesDir));
    ...
    LexicalDataLoaderDescriptor.attributeBuilder(controllerInitAttrs)
      .resourceLookup(resourceLookup);
    

    Custom resource locations are still supported but they need to be provided when loading language components. See this section for an example.

    In addition to the resource loading API, the format and the naming convention of resources have changed as well. The table below summarizes these changes and maps old resource names to new ones for English and Dutch (the same pattern applies to other languages):

    Resource type Language Old name New name
    Label dictionary English label-dictionary.en.xml english.labels.json
    Dutch label-dictionary.nl.xml dutch.labels.json
    Word dictionary English word-dictionary.en.xml english.tags.json
    Dutch word-dictionary.nl.xml dutch.tags.json
    Synonyms English synonyms.en.xml english.synonyms.json
    Dutch synonyms.nl.xml dutch.synonyms.json

    Heads up!

    Language resources are now in JSON format! Command-line tools are provided inside Lingo3G to convert legacy XML resources into JSON.

  5. License lookup locations.

    Lingo3G license lookup locations have changed slightly. Most notably, the class loader of Lingo3G API JAR is used to look up the license resource instead of the thread's context class loader. Please consult this section of the documentation for more information.

  6. Convert legacy resources.

    If you have any customized XML language resources from previous versions of Lingo3G, these resources will have to be renamed (see the table above) and converted to JSON.

    The token-matching rules have been replaced with simple, more intuitive glob expressions.

    To help out and speed up the conversion, we provide command-line tools that read previous resource files, and emit an equivalent in the new JSON format. Note that conversion tools will "flatten" all internal includes present in XML files. If you wish to keep the dictionaries separate, remove include statements prior to running each converter (and run the conversion on each file separately).

    The following command examples assume you're at the top of the distribution package and convert English (en) language resource files. Input file name should be changed accordingly for other languages. All output files will follow Lingo3G 2.0.0 naming pattern for the corresponding resource including renamed language code if applicable (see examples below).

    To convert a word dictionary to a tag dictionary, run:

    java -jar lib\lingo3g-migration-2.0.0-beta2.jar java -jar convert-words word-dictionary.en.xml
    

    This command produces english.tags.json.

    To convert a label dictionary, run:

    java -jar lib\lingo3g-migration-2.0.0-beta2.jar convert-labels label-dictionary.en.xml
    

    This command produces english.labels.json.

    To convert a synonym dictionary, run:

    java -jar lib\lingo3g-migration-2.0.0-beta2.jar convert-synonyms synonyms.en.xml
    

    This command produces english.synonyms.json.

REST API (DCS)

This section presents a checklist of key elements to look at when porting existing code that uses Lingo3G REST API (DCS) to version 2.0.0.

Please read about the new REST API prior to converting any existing code (this section will help). Also, check out the DCS API examples distributed with Lingo3G as they'll give you an idea what the updated code should look like.

Once you're done with the above, look at your existing code utilizing Lingo3G REST API and try to follow these steps.

  1. Identify the algorithm and parameters.

    Previously, Lingo3G used a component suite definition to declare document sources, algorithms and their parameter tweaks (a file called suite-dcs.xml). If you had any tweaks to the defaults (custom parameters, custom document sources) then identify these components first.

    The new REST API does not support document sources (much like the Java API) and the content of documents to be clustered needs to be provided as part of the request. If you used external document sources (or custom document sources) you'll need to modify your application to fetch documents from those sources first, then build a request with these documents and send it to Lingo3G.

    Any non-default parameter values will have to be set as part of each request (or be set in a request template inside the DCS).

  2. Switch from XML to JSON.

    The new DCS accepts (and returns) data in JSON format. You will have to switch your infrastructure to send requests to the DCS in this format. If your client code is in Java, you may utilize data model classes that can be serialized to JSON directly.

  3. Custom language resources.

    The new DCS uses the same (new) naming convention for language resources as the Java API. Please refer to this section for the location and layout of these resources and migrate your custom resources accordingly.