vectors

The vectors:​* stages return sets of multidimensional embedding vectors which you can then use to compute vector-based similarity matrices.

The matrix:​knn​Vectors​Similarity stage is the most likely consumer of the results computed by the vectors:​* stages. To compute vector-based similarities between, for example, a set of documents matching a query, your request first needs to use vectors:​precomputed​Document​Embeddings to prepare a subset of embedding vectors corresponding to your documents and then submit the subset of vectors to the matrix computation stage.


You can use the following vectors source stages in your analysis requests:

vectors:​from​Embedding​Service

Returns embedding vectors for one or more fields, computed using an external service providing text embeddings.

vectors:​from​Vector​Field

Returns vectors from an explicit field (requires that an external vector field is added to documents during indexing.

vectors:​precomputed​Document​Embeddings

Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.

vectors:​precomputed​Label​Embeddings

Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.


vectors:​reference

References the results of another vectors:​* stage defined in the request.


To avoid large responses, the JSON output of the vectors stage does not include the actual vectors, but only the total number of vectors and a list of undefined vectors.

vectors:​from​Embedding​Service

Returns embedding vectors for all text fields provided by the documentContent property. All fields are concatenated into a single text block, which is then sent to an external embedding​Service:​* component. This stage is useful to retrieve dynamically computed embeddings for a relatively small number of documents (because computing embeddings using large language models is typically costly).

{
  "type": "vectors:fromEmbeddingService",
  "documentContent": null,
  "embeddingService": {
    "type": "embeddingService:reference",
    "auto": true
  },
  "valueSeparator": "\n"
}

document​Content

Type
documentContent
Default
null
Required
yes

In this property, you should provide documents and text fields to compute embeddings for. The document​Content property is an instance of the documentContent stage: it can include one or more field specifications, including maximum length limits.

embedding​Service

Type
embeddingService
Default
{
  "type": "embeddingService:reference",
  "auto": true
}
Required
no

A reference to an embedding service component. Embedding service components may only be declared in the project descriptor's shared components section.

value​Separator

Type
string
Default
"\n"
Required
no

A text separator inserted between multiple field values, if more than one field (or value) is present in a document.

vectors:​from​Vector​Field

Retrieves vector data from an explicit document field. The data must be provided during document indexing, most likely computed from an external source (like an external vector model).

{
  "type": "vectors:fromVectorField",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "fieldName": null,
  "maxInMemoryKnnSubIndexSize": 50000000
}

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The set of documents for which vector field data should be returned.

field​Name

Type
project:vectorFields
Default
null
Required
yes

Document field containing external vector data added during indexing.

max​In​Memory​Knn​Sub​Index​Size

Type
number
Default
50000000
Required
no

Maximum size, in bytes, of the in-memory temporary kNN index for vectors produced by this stage.

Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of documents by creating and querying a temporary kNN index containing just the vectors corresponding to those documents. Lingo4G creates the temporary index only when its estimated size in bytes is smaller than the max​In​Memory​Knn​Sub​Index​Size threshold.

With the default value of the threshold, Lingo4G will create temporary kNN indices for up to ~400k document vectors.

To disable the creation of the temporary kNN indices, set max​In​Memory​Knn​Sub​Index​Size to 0. We don't recommend doing this in production settings as it can slow down the computation of vector-based similarities by an order of magnitude.

vectors:​precomputed​Document​Embeddings

Returns the precomputed document embedding vectors, narrowed down to the list of documents you provide.

{
  "type": "vectors:precomputedDocumentEmbeddings",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "maxInMemoryKnnSubIndexSize": 50000000
}

You can use this stage to compute vector-based similarities between a set of documents:

{
  "name": "Computing vector-based similarities between documents matching a query",
  "stages": {
    "similarities": {
      "type": "matrix:knnVectorsSimilarity",
      "vectors": {
        "type": "vectors:precomputedDocumentEmbeddings",
        "documents": {
          "type": "documents:byQuery",
          "query": {
            "type": "query:string",
            "query": "clustering"
          }
        }
      }
    }
  }
}

Using vectors:​precomputed​Document​Embeddings to build a similarity matrix between document matching the clustering query.

The request combines the vectors:​precomputed​Document​Embeddings stage with documents:​by​Query to build a subset of document embeddings. Then, the matrix:​knn​Vectors​Similarity stage uses the subset of vectors to compute the similarity matrix.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The set of documents to which to narrow down the set of document embedding vectors.

max​In​Memory​Knn​Sub​Index​Size

Type
number
Default
50000000
Required
no

Maximum size, in bytes, of the in-memory temporary kNN index for vectors produced by this stage.

Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of documents by creating and querying a temporary kNN index containing just the vectors corresponding to those documents. Lingo4G creates the temporary index only when its estimated size in bytes is smaller than the max​In​Memory​Knn​Sub​Index​Size threshold.

With the default value of the threshold, Lingo4G will create temporary kNN indices for up to 400k document vectors.

To disable the creation of the temporary kNN indices, set max​In​Memory​Knn​Sub​Index​Size to 0. We don't recommend doing this in production settings as it can slow down the computation of vector-based similarities by an order of magnitude.

vectors:​precomputed​Label​Embeddings

Returns the precomputed label embedding vectors, narrowed down to the list of labels you provide.

{
  "type": "vectors:precomputedLabelEmbeddings",
  "labels": {
    "type": "labels:reference",
    "auto": true
  },
  "maxLabelsForSubIndex": 0.05
}

You can use this stage to compute vector-based similarities between a set of documents:

{
  "name": "Computing vector-based similarities between labels",
  "stages": {
    "similarities": {
      "type": "matrix:knnVectorsSimilarity",
      "vectors": {
        "type": "vectors:precomputedLabelEmbeddings",
        "labels": {
          "type": "labels:fromDocuments",
          "documents": {
            "type": "documents:byQuery",
            "query": {
              "type": "query:string",
              "query": "clustering"
            }
          },
          "maxLabels": {
            "type": "labelCount:fixed",
            "value": 200
          }
        }
      }
    }
  }
}

Using vectors:​precomputed​Label​Embeddings to build a similarity matrix between labels related to clustering.

The request combines the vectors:​precomputed​Label​Embeddings stage with labels:​from​Documents to build a subset of label embeddings for labels related to the query clustering. Then, the matrix:​knn​Vectors​Similarity stage uses the subset of vectors to compute the similarity matrix.

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The list of labels to which to narrow down the set of embedding vectors.

max​Labels​For​Sub​Index

Type
number
Default
0.05
Constraints
value >= 0 and value <= 1
Required
no

Determines the threshold for creating a temporary kNN index.

Lingo4G can significantly speed up the computation of vector-based similarities for a small subset of labels by creating and querying a temporary kNN index containing just the vectors corresponding to the input labels. Lingo4G creates the temporary index only when the number of input labels divided by the total number of labels in the index is smaller or equal to the value of this property.

For example, if max​Labels​For​Sub​Index is 0.3, if the input labels list contains fewer than 30% of all labels in the index, Lingo4G creates a temporary index to speed up the computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.

Consumers of vectors:​*

The following stages and components take vectors:​* as input:

Stage or component Property
matrix:​knn​Vectors​Similarity
  • vectors
  • matrix​Rows:​knn​Vectors​Similarity
  • rows
  • columns