vectors

The vectors:​* stages return sets of multidimensional embedding vectors which you can then use to compute vector-based similarity matrices.

The matrix:​knn​Vectors​Similarity stage is the most likely consumer of the results computed by the vectors:​* stages. To compute vector-based similarities between, for example, a set of documents matching a query, your request first needs to use vectors:​precomputed​Document​Embeddings to prepare a subset of embedding vectors corresponding to your documents and then submit the subset of vectors to the matrix computation stage.


You can use the following vectors source stages in your analysis requests:

vectors:​precomputed​Document​Embeddings

Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.

vectors:​precomputed​Label​Embeddings

Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.


vectors:​reference

References the results of another vectors:​* stage defined in the request.


To avoid large responses, the JSON output of the vectors stage does not include the actual vectors, but only the total number of vectors and a list of undefined vectors.

vectors:​precomputed​Document​Embeddings

Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.

{
  "type": "vectors:precomputedDocumentEmbeddings",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "maxInMemoryKnnSubIndexSize": 50000000
}

You can use this stage to compute vector-based similarities between a set of documents:

{
  "name": "Computing vector-based similarities between documents matching a query",
  "stages": {
    "similarities": {
      "type": "matrix:knnVectorsSimilarity",
      "vectors": {
        "type": "vectors:precomputedDocumentEmbeddings",
        "documents": {
          "type": "documents:byQuery",
          "query": {
            "type": "query:string",
            "query": "clustering"
          }
        }
      }
    }
  }
}

Using vectors:​precomputed​Document​Embeddings to build a similarity matrix between document matching the clustering query.

The request combines the vectors:​precomputed​Document​Embeddings stage with documents:​by​Query to build a subset of document embeddings. Then, the matrix:​knn​Vectors​Similarity stage uses the subset of vectors to compute the similarity matrix.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The set of documents to which to narrow down the set of document embedding vectors.

max​In​Memory​Knn​Sub​Index​Size

Type
number
Default
50000000
Required
no

Maximum size, in bytes, of the in-memory temporary kNN index for vectors produced by this stage.

Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of documents by creating and querying a temporary kNN index containing just the vectors corresponding to those documents. Lingo4G creates the temporary index only when its estimated size in bytes is smaller than the max​In​Memory​Knn​Sub​Index​Size threshold.

With the default value of the threshold, Lingo4G will create temporary kNN indices for up to 400k document vectors.

To disable the creation of the temporary kNN indices, set max​In​Memory​Knn​Sub​Index​Size to 0. We don't recommend doing this in production settings as it can slow down the computation of vector-based similarities by an order of magnitude.

vectors:​precomputed​Label​Embeddings

Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.

{
  "type": "vectors:precomputedLabelEmbeddings",
  "labels": {
    "type": "labels:reference",
    "auto": true
  },
  "maxLabelsForSubIndex": 0.05
}

You can use this stage to compute vector-based similarities between a set of documents:

{
  "name": "Computing vector-based similarities between labels",
  "stages": {
    "similarities": {
      "type": "matrix:knnVectorsSimilarity",
      "vectors": {
        "type": "vectors:precomputedLabelEmbeddings",
        "labels": {
          "type": "labels:fromDocuments",
          "documents": {
            "type": "documents:byQuery",
            "query": {
              "type": "query:string",
              "query": "clustering"
            }
          },
          "maxLabels": {
            "type": "labelCount:fixed",
            "value": 200
          }
        }
      }
    }
  }
}

Using vectors:​precomputed​Label​Embeddings to build a similarity matrix between labels related to clustering.

The request combines the vectors:​precomputed​Label​Embeddings stage with labels:​from​Documents to build a subset of label embeddings for labels related to the query clustering. Then, the matrix:​knn​Vectors​Similarity stage uses the subset of vectors to compute the similarity matrix.

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The list of labels to which to narrow down the set of embedding vectors.

max​Labels​For​Sub​Index

Type
number
Default
0.05
Constraints
value >= 0 and value <= 1
Required
no

Determines the threshold for creating a temporary kNN index.

Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of labels by creating and querying a temporary kNN index containing just the vectors corresponding to the input labels. Lingo4G creates the temporary index only when the number of input labels divided by the total number of labels in the index is smaller or equal to the value of this property.

For example, if max​Labels​For​Sub​Index is 0.3, if the input labels list contains fewer than 30% of all labels in the index, Lingo4G creates a temporary index to speed up the computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.

Consumers of vectors:​*

The following stages and components take vectors:​* as input:

Stage or component Property
matrix:​knn​Vectors​Similarity
  • vectors
  • matrix​Rows:​knn​Vectors​Similarity
  • rows
  • columns