vectors
The vectors:​*
stages return sets of multidimensional embedding vectors which you can then use to
compute vector-based similarity matrices.
The
matrix:​knn​Vectors​Similarity
stage is the most likely consumer of the results computed by the
vectors:​*
stages. To compute vector-based similarities between, for example, a set of documents matching a query, your request
first needs to use
vectors:​precomputed​Document​Embeddings
to prepare a subset of embedding vectors corresponding to your documents and then submit the subset of vectors to
the matrix computation stage.
You can use the following vectors source stages in your analysis requests:
-
vectors:​from​Embedding​Service
-
Returns embedding vectors for one or more fields, computed using an external service providing text embeddings.
-
vectors:​from​Vector​Field
-
Returns vectors from an explicit field (requires that an external vector field is added to documents during indexing.
-
vectors:​precomputed​Document​Embeddings
-
Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.
-
vectors:​precomputed​Label​Embeddings
-
Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.
vectors:​reference
-
References the results of another
vectors:​*
stage defined in the request.
To avoid large responses, the JSON output of the vectors stage does not include the actual vectors, but only the total number of vectors and a list of undefined vectors.
vectors:​from​Embedding​Service
Returns embedding vectors for all text fields provided by the
documentContent
property. All fields are concatenated into a single text block, which is then sent to an external
embedding​Service:​*
component. This stage is
useful to retrieve dynamically computed embeddings for a relatively small number of documents (because computing
embeddings using large language models is typically costly).
{
"type": "vectors:fromEmbeddingService",
"documentContent": null,
"embeddingService": {
"type": "embeddingService:reference",
"auto": true
},
"valueSeparator": "\n"
}
document​Content
In this property, you should provide documents and text fields to compute embeddings for. The
document​Content
property is an instance of the
documentContent
stage: it can include one or more field specifications, including maximum length limits.
embedding​Service
A reference to an embedding service component. Embedding service components may only be declared in the project descriptor's shared components section.
value​Separator
A text separator inserted between multiple field values, if more than one field (or value) is present in a document.
vectors:​from​Vector​Field
Retrieves vector data from an explicit document field. The data must be provided during document indexing, most likely computed from an external source (like an external vector model).
{
"type": "vectors:fromVectorField",
"documents": {
"type": "documents:reference",
"auto": true
},
"fieldName": null,
"maxInMemoryKnnSubIndexSize": 50000000
}
documents
The set of documents for which vector field data should be returned.
field​Name
Document field containing external vector data added during indexing.
max​In​Memory​Knn​Sub​Index​Size
Maximum size, in bytes, of the in-memory temporary kNN index for vectors produced by this stage.
Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of documents
by creating and querying a temporary kNN index containing just the vectors corresponding to those documents.
Lingo4G creates the temporary index only when its estimated size in bytes is smaller than the
max​In​Memory​Knn​Sub​Index​Size
threshold.
With the default value of the threshold, Lingo4G will create temporary kNN indices for up to ~400k document vectors.
To disable the creation of the temporary kNN indices, set max​In​Memory​Knn​Sub​Index​Size
to
0
. We don't recommend doing this in production settings as it can slow down the computation of
vector-based similarities by an order of magnitude.
vectors:​precomputed​Document​Embeddings
Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.
{
"type": "vectors:precomputedDocumentEmbeddings",
"documents": {
"type": "documents:reference",
"auto": true
},
"maxInMemoryKnnSubIndexSize": 50000000
}
You can use this stage to compute vector-based similarities between a set of documents:
The request combines the
vectors:​precomputed​Document​Embeddings
stage with documents:​by​Query
to build a subset of document embeddings. Then, the
matrix:​knn​Vectors​Similarity
stage uses the subset of vectors to compute the similarity matrix.
documents
The set of documents to which to narrow down the set of document embedding vectors.
max​In​Memory​Knn​Sub​Index​Size
Maximum size, in bytes, of the in-memory temporary kNN index for vectors produced by this stage.
Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of documents
by creating and querying a temporary kNN index containing just the vectors corresponding to those documents.
Lingo4G creates the temporary index only when its estimated size in bytes is smaller than the
max​In​Memory​Knn​Sub​Index​Size
threshold.
With the default value of the threshold, Lingo4G will create temporary kNN indices for up to 400k document vectors.
To disable the creation of the temporary kNN indices, set max​In​Memory​Knn​Sub​Index​Size
to
0
. We don't recommend doing this in production settings as it can slow down the computation of
vector-based similarities by an order of magnitude.
vectors:​precomputed​Label​Embeddings
Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.
{
"type": "vectors:precomputedLabelEmbeddings",
"labels": {
"type": "labels:reference",
"auto": true
},
"maxLabelsForSubIndex": 0.05
}
You can use this stage to compute vector-based similarities between a set of documents:
The request combines the
vectors:​precomputed​Label​Embeddings
stage with labels:​from​Documents
to build a subset of label embeddings for labels related to the query
clustering. Then, the
matrix:​knn​Vectors​Similarity
stage uses the subset of vectors to compute the similarity matrix.
labels
The list of labels to which to narrow down the set of embedding vectors.
max​Labels​For​Sub​Index
Determines the threshold for creating a temporary kNN index.
Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of labels by creating and querying a temporary kNN index containing just the vectors corresponding to the input labels. Lingo4G creates the temporary index only when the number of input labels divided by the total number of labels in the index is smaller or equal to the value of this property.
For example, if max​Labels​For​Sub​Index
is 0.3, if the input labels list contains fewer than 30% of
all labels in the index, Lingo4G creates a temporary index to speed up the computation of similarities. We don't
recommend setting this property to 0.0 or 1.0 in production.
vectors:​*
Consumers of
The following stages and components take vectors:​*
as
input:
Stage or component | Property |
---|---|
matrix:​knn​Vectors​Similarity | vectors |
matrix​Rows:​knn​Vectors​Similarity | rows columns |