vectors
The vectors:​*
stages return sets of multidimensional embedding vectors which you can then use to
compute vector-based similarity matrices.
The
matrix:​knn​Vectors​Similarity
stage is the most likely consumer of the results computed by the
vectors:​*
stages. To compute vector-based similarities between, for example, a set of documents matching a query, your request
first needs to use
vectors:​precomputed​Document​Embeddings
to prepare a subset of embedding vectors corresponding to your documents and then submit the subset of vectors to
the matrix computation stage.
You can use the following vectors source stages in your analysis requests:
-
vectors:​precomputed​Document​Embeddings
-
Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.
-
vectors:​precomputed​Label​Embeddings
-
Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.
vectors:​reference
-
References the results of another
vectors:​*
stage defined in the request.
To avoid large responses, the JSON output of the vectors stage does not include the actual vectors, but only the total number of vectors and a list of undefined vectors.
vectors:​precomputed​Document​Embeddings
Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.
{
"type": "vectors:precomputedDocumentEmbeddings",
"documents": {
"type": "documents:reference",
"auto": true
},
"maxInMemoryKnnSubIndexSize": 50000000
}
You can use this stage to compute vector-based similarities between a set of documents:
The request combines the
vectors:​precomputed​Document​Embeddings
stage with documents:​by​Query
to build a subset of document embeddings. Then, the
matrix:​knn​Vectors​Similarity
stage uses the subset of vectors to compute the similarity matrix.
documents
The set of documents to which to narrow down the set of document embedding vectors.
max​In​Memory​Knn​Sub​Index​Size
Maximum size, in bytes, of the in-memory temporary kNN index for vectors produced by this stage.
Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of documents
by creating and querying a temporary kNN index containing just the vectors corresponding to those documents.
Lingo4G creates the temporary index only when its estimated size in bytes is smaller than the
max​In​Memory​Knn​Sub​Index​Size
threshold.
With the default value of the threshold, Lingo4G will create temporary kNN indices for up to 400k document vectors.
To disable the creation of the temporary kNN indices, set max​In​Memory​Knn​Sub​Index​Size
to
0
. We don't recommend doing this in production settings as it can slow down the computation of
vector-based similarities by an order of magnitude.
vectors:​precomputed​Label​Embeddings
Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.
{
"type": "vectors:precomputedLabelEmbeddings",
"labels": {
"type": "labels:reference",
"auto": true
},
"maxLabelsForSubIndex": 0.05
}
You can use this stage to compute vector-based similarities between a set of documents:
The request combines the
vectors:​precomputed​Label​Embeddings
stage with labels:​from​Documents
to build a subset of label embeddings for labels related to the query
clustering. Then, the
matrix:​knn​Vectors​Similarity
stage uses the subset of vectors to compute the similarity matrix.
labels
The list of labels to which to narrow down the set of embedding vectors.
max​Labels​For​Sub​Index
Determines the threshold for creating a temporary kNN index.
Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of labels by creating and querying a temporary kNN index containing just the vectors corresponding to the input labels. Lingo4G creates the temporary index only when the number of input labels divided by the total number of labels in the index is smaller or equal to the value of this property.
For example, if max​Labels​For​Sub​Index
is 0.3, if the input labels list contains fewer than 30% of
all labels in the index, Lingo4G creates a temporary index to speed up the computation of similarities. We don't
recommend setting this property to 0.0 or 1.0 in production.
vectors:​*
Consumers of
The following stages and components take vectors:​*
as
input:
Stage or component | Property |
---|---|
matrix:​knn​Vectors​Similarity | vectors |
matrix​Rows:​knn​Vectors​Similarity | rows columns |