2.3.x release notes

Release notes for Lingo4G 2.3.x.

Version 2.3.0

Release 2.3.0 comes with the following new features and improvements:

  • External document vectors. Initial support for externally-provided document vectors, such as large language model (LLM) document embeddings.

  • Better built-in embeddings. Improved quality and performance of the built-in label and document embedding vectors.

  • New clustering algorithm based on community detection in graphs.

  • More meaningful, longer labels extracted from documents by default.

  • Document similarity computation based on arbitrary search fields.

Compatibility

Project descriptor

Updates required. Lingo4G 2.3.0 improves the label embedding learning algorithm. As a result, it introduces the following changes to the label embedding learning section of the project descriptor. If your project uses any of those properties, make the recommended changes before re-indexing your collection.

Affected properties Required updates

min​Iterations, max​Iterations

The new iterations property replaces the min​Iterations and max​Iterations properties.

If your descriptor contains the min​Iterations or max​Iterations properties, remove them and add the iterations property instead.

See the documentation of the new iterations property for guidelines about the value appropriate for your collection.

min​Usable​Vectors​Percent, context​Size​Sampling

Properties removed.

Remove the properties from your project descriptor.

model

The model property now accepts only one value: S​K​I​P_​G​R​A​M.

If your project descriptor uses a different model (C​B​O​W or C​O​M​P​O​S​I​T​E), remove the model property to use the default value.

negative​Samples

The default value of negative​Samples is now 4, which is suitable for the new default model (S​K​I​P_​G​R​A​M).

If your project descriptor overrides the negative​Samples property, set its value to be in the 2.5—5.0 range for the best embedding quality to learning time ratio.

Renamed properties. Version 2.3.0 introduces embedding similarity-weighted label collection. For consistency with the new implementation, the following properties of label​Collector:​top​From​Feature​Fields have been renamed:

Old name New name
min​Tf min​Weight
min​Tf​Mass min​Weight​Mass
Reindexing

Recommended. Re-index your project to take advantage of the improved quality of label embeddings.

New features

Initial support for custom vector fields

Lingo4G can now index vector fields, allowing the use of document embedding vectors from external sources like large language models (LLMs). All existing Lingo4G algorithms, such as clustering and 2d mapping, can take advantage of the external document vectors.

The Indexing LLM embeddings tutorial shows how to add sentence embeddings to the dataset-json-records toy project using Ollama.

Community Detection clustering

Lingo4G 2.3.0 introduces a new clustering algorithm based on community detection in graphs. This new clustering algorithm is faster and easier to tune compared to the Affinity Propagation clustering offered in the previous release of Lingo4G.

See the documentation of the clusters:​cd stage for more information and example requests.

Embedding-based label collection

Lingo4G 2.3.0 introduces the label​Collector:​top​Embedding​Nearest​Neighbors label collector, which generates longer and more descriptive labels.

Additionally, the label​Collector:​top​From​Feature​Fields collector can now weigh labels based on the label's and document's embedding vectors. Embedding-based weighting favors longer, more descriptive labels, while ensuring the collected labels do appear in the input documents. If label and document embeddings are available in your index, Lingo4G will perform embedding-weighted label collection by default.

labels:​filtered stage added

Version 2.3.0 adds the labels:​filtered stage, which you can use to compare lists of labels.

Content field bases similarities

Version 2.3.0 introduces the matrix​Rows:​by​Query similarity, which computes document similarities based on the number of content field values the documents share.

Paired with the matrix​Rows:​composite component, which fuses different similarity matrices, you can now perform clustering and 2d mapping of documents based on a combination of textual and content field similarity criteria.

See the Similarity matrices tutorial and its Content field similarity and Composite similarity sections for in-depth explanations and examples.

Custom Analyzer classes and separate analyzers for indexing and querying

You can now declare custom Analyzer implementations inside the analyzers block of the project descriptor and use a different analyzer for indexing and querying the index.

This setup is rather advanced, but can be useful for providing highly specific parsing of non-standard or structured text fields.

Improvements

Label embedding learning improvements

Lingo4G 2.3.0 significantly improves learning of label embeddings. Depending on your hardware, learning of label embeddings should now be up to 2x times faster compared to the 2.2.x releases. Additionally, the quality of label and document embeddings is also improved.

Version 2.3.0 removes the C​B​O​W and C​O​M​P​O​S​I​T​E models in favour of S​K​I​P_​G​R​A​M, which provides higher-quality labels and is now much faster to learn. See the compatibility section for the detailed list of updates to the label embedding learning configuration.