2.3.x release notes
Release notes for Lingo4G 2.3.x.
Version 2.3.0
Release 2.3.0 comes with the following new features and improvements:
-
External document vectors. Initial support for externally-provided document vectors, such as large language model (LLM) document embeddings.
-
Better built-in embeddings. Improved quality and performance of the built-in label and document embedding vectors.
-
New clustering algorithm based on community detection in graphs.
-
More meaningful, longer labels extracted from documents by default.
-
Document similarity computation based on arbitrary search fields.
Compatibility
- Project descriptor
-
Updates required. Lingo4G 2.3.0 improves the label embedding learning algorithm. As a result, it introduces the following changes to the label embedding learning section of the project descriptor. If your project uses any of those properties, make the recommended changes before re-indexing your collection.
Affected properties Required updates min​Iterations
,max​Iterations
The new
iterations
property replaces themin​Iterations
andmax​Iterations
properties.If your descriptor contains the
min​Iterations
ormax​Iterations
properties, remove them and add theiterations
property instead.See the documentation of the new
iterations
property for guidelines about the value appropriate for your collection.min​Usable​Vectors​Percent
,context​Size​Sampling
Properties removed.
Remove the properties from your project descriptor.
The
model
property now accepts only one value:S​K​I​P_​G​R​A​M
.If your project descriptor uses a different model (
C​B​O​W
orC​O​M​P​O​S​I​T​E
), remove themodel
property to use the default value.The default value of
negative​Samples
is now 4, which is suitable for the new default model (S​K​I​P_​G​R​A​M
).If your project descriptor overrides the
negative​Samples
property, set its value to be in the 2.5—5.0 range for the best embedding quality to learning time ratio.Renamed properties. Version 2.3.0 introduces embedding similarity-weighted label collection. For consistency with the new implementation, the following properties of
label​Collector:​top​From​Feature​Fields
have been renamed:Old name New name min​Tf
min​Weight
min​Tf​Mass
min​Weight​Mass
- Reindexing
-
Recommended. Re-index your project to take advantage of the improved quality of label embeddings.
New features
- Initial support for custom vector fields
-
Lingo4G can now index vector fields, allowing the use of document embedding vectors from external sources like large language models (LLMs). All existing Lingo4G algorithms, such as clustering and 2d mapping, can take advantage of the external document vectors.
The Indexing LLM embeddings tutorial shows how to add sentence embeddings to the
dataset-json-records
toy project using Ollama. - Community Detection clustering
-
Lingo4G 2.3.0 introduces a new clustering algorithm based on community detection in graphs. This new clustering algorithm is faster and easier to tune compared to the Affinity Propagation clustering offered in the previous release of Lingo4G.
See the documentation of the
clusters:​cd
stage for more information and example requests. - Embedding-based label collection
-
Lingo4G 2.3.0 introduces the
label​Collector:​top​Embedding​Nearest​Neighbors
label collector, which generates longer and more descriptive labels.Additionally, the
label​Collector:​top​From​Feature​Fields
collector can now weigh labels based on the label's and document's embedding vectors. Embedding-based weighting favors longer, more descriptive labels, while ensuring the collected labels do appear in the input documents. If label and document embeddings are available in your index, Lingo4G will perform embedding-weighted label collection by default. labels:​filtered
stage added-
Version 2.3.0 adds the
labels:​filtered
stage, which you can use to compare lists of labels. - Content field bases similarities
-
Version 2.3.0 introduces the
matrix​Rows:​by​Query
similarity, which computes document similarities based on the number of content field values the documents share.Paired with the
matrix​Rows:​composite
component, which fuses different similarity matrices, you can now perform clustering and 2d mapping of documents based on a combination of textual and content field similarity criteria.See the Similarity matrices tutorial and its Content field similarity and Composite similarity sections for in-depth explanations and examples.
- Custom
Analyzer
classes and separate analyzers for indexing and querying -
You can now declare custom Analyzer implementations inside the analyzers block of the project descriptor and use a different analyzer for indexing and querying the index.
This setup is rather advanced, but can be useful for providing highly specific parsing of non-standard or structured text fields.
Improvements
- Label embedding learning improvements
-
Lingo4G 2.3.0 significantly improves learning of label embeddings. Depending on your hardware, learning of label embeddings should now be up to 2x times faster compared to the 2.2.x releases. Additionally, the quality of label and document embeddings is also improved.
Version 2.3.0 removes the
C​B​O​W
andC​O​M​P​O​S​I​T​E
models in favour ofS​K​I​P_​G​R​A​M
, which provides higher-quality labels and is now much faster to learn. See the compatibility section for the detailed list of updates to the label embedding learning configuration.