2.1.x release notes

Release notes for Lingo4G 2.1.x.

Version 2.1.1

Lingo4G 2.1.1 updates the dotAtlas visualization to fix label rendering artifacts and correctly handle browser viewport zooming in the 2d map visualizations.

Compatibility

Lingo4G 2.1.1 is fully backward-compatible with the 2.1.0 release and works with indices created by the 2.1.0 release.

Version 2.1.0

Lingo4G 2.1.0 significantly improves document indexing by suppressing truncated phrase (such as Association for Computing as opposed to Association for Computing Machinery) and learning high-quality embeddings for low-frequency labels.

Compatibility

Project descriptor

Updates required. Lingo4G 2.1.0 comes with significant improvements in phrase extraction and embedding learning. As a result, it removes support for the project descriptor properties listed below. If your project uses any of those properties, remove them to make the descriptor compatible with Lingo4G 2.1.0.

Properties to remove Explanation

In the embeddings.labels.input block:

In the previous versions, those properties configured label embedding learning. Lingo4G 2.1.0 comes with an updated learning algorithm in which those properties are not required.

max​Labels
min​Df
min​Top​Df
min​Labels​Percent​Per​Document

Strict JSON parsing

Updates required. Starting with version 2.1.0, Lingo4G requires strictly valid JSON. Lingo4G no longer accepts unquoted properties, comments and single-quoted strings.

Strict JSON parsing applies across all Lingo4G resources, including project descriptor, analysis API v1 and API v2 requests and external resources. If any of those files in your project contains non-standard JSON syntax, remove that syntax for Lingo4G 2.1.0 to accept the files.

Reindexing

Required. Lingo4G 2.1.0 changes the internal format of the index files and will not work with indices created by Lingo4G 2.0.x.

Default heap size

Lingo4G 2.1.0 changes the default value of the L4​G_​O​P​T​S variable from -​Xmx4g (setting an explicit heap limit of 4 gigabytes) to an empty string. This causes Java to determine the default and maximum heap size according to garbage colloector ergonomics as a dynamically computed fraction of the memory available to the process.

This change does not require any action. If you experience problems, set the L4​G_​O​P​T​S environment variable explicitly.

Improvements

Suppression of truncated phrases

As of version 2.1.0, Lingo4G improves the quality of labels by suppressing incomplete phrases at indexing time. Previous versions were likely to extract sub-phrases of longer phrases, such as Association for Computing, in addition to the more meaningful longer phrases, such as Association for Computing Machinery.

Version 2.1.0 by default extracts only the longer and more meaningful phrases. Setting the skip​Subphrases property to false reverts indexing to the previous behaviour. However, we recommend leaving the property at its default value of true for higher-quality labels.

Label embedding improvements

Version 2.1.0 comes with significant improvements to learning label embedding vectors. Lingo4G now splits learning into two phases: direct learning of vectors for high-frequency labels and estimation of vectors for low-frequency labels.

Lingo4G 2.1.0 can compute high-quality embeddings for the long tail of low-frequency labels the previous versions would ignore.

The following properties configure the new label embedding learning algorithm: max​Labels​For​Direct​Learning, min​Label​Tf​For​Direct​Learning and min​Label​Tf​For​Estimated​Learning.

Faster query highlighting

Version 2.1.0 improves the performance of label and search term highlighting in text.

Eager document content retrieval

Version 2.1.0 adds an option to choose between the streaming and eager document content retrieval mode.

Dependency updates

Lingo4G 2.1.0 updates Lucene to version 9.9.2.

Bug fixes

--output option in run-request works incorrectly

The --output option incorrectly saved the request's result to the parent directory of the provided location.