matrixRows

Computes individual rows of a similarity matrix. Matrix row sources allow certain stages, such as the document contrast scorer, to perform calculations without materializing a large similarity matrix in the main memory.

Lingo4G offers two matrix row source implementations that compute similarities between the two sets of documents you provide. You can use the matrix row sources as inputs to two kinds of stages:

Both kinds of stages can process the input matrix row-by-row, which means there is no need to keep the whole, potentially very large, matrix in the main memory. Some stages, such as embedding2d:​lv, however, access matrix elements in a random way, which makes them incompatible with matrix​Rows:​*, they only accept in-memory matrix:​* inputs.


You can use the following matrix​Rows:​* stage types in your analysis request JSONs:

matrix​Rows:​from​Matrix

Converts a full matrix into a source of rows. Mostly useful for debugging and testing.

matrix​Rows:​keyword​Document​Similarity

Computes keyword-based (More-Like-This) similarities between documents.

matrix​Rows:​knn​Vectors​Similarity

Computes similarities between documents based on multidimensional embeddings.


matrix​Rows:​reference

References a matrix​Rows:​* component defined in the request or in the project's default components.


matrix​Rows:​from​Matrix

Converts a full matrix into a source of rows. Mostly useful for debugging and testing.

{
  "type": "matrixRows:fromMatrix",
  "matrix": {
    "type": "matrix:reference",
    "auto": true
  }
}

matrix

Type
matrix
Default
{
  "type": "matrix:reference",
  "auto": true
}
Required
no

The matrix to convert into matrix​Rows:​*.

matrix​Rows:​keyword​Document​Similarity

Computes keyword-based (More-Like-This) similarities between documents.

{
  "type": "matrixRows:keywordDocumentSimilarity",
  "index": {
    "columns": {
      "documents": null
    },
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "maxColumnDocumentsForSubIndex": 0.3,
    "maxInMemorySubIndexSize": 8000000,
    "rows": {
      "documents": null,
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "fields": {
          "type": "featureFields:reference",
          "auto": true
        },
        "labelFilter": {
          "type": "labelFilter:reference",
          "auto": true
        },
        "labelListFilter": {
          "type": "labelListFilter:truncatedPhrases"
        },
        "minTf": 0,
        "minTfMass": 1,
        "tieResolution": "AUTO"
      },
      "maxQueryLabelsPerRowDocument": 10,
      "minQueryLabelsPerRowDocument": 0,
      "threads": "auto"
    },
    "threads": "auto"
  },
  "maxNeighbors": 10,
  "minQueryLabelsRequiredInColumnDocument": 1,
  "normalized": false,
  "threads": "auto"
}

To compute the keyword-based document similarity matrix rows, Lingo4G performs the following steps:

  1. For each row document, Lingo4G uses the label​Collector you provide to extract up to max​Query​Labels​Per​Row​Document labels that characterize the document.

    If the number of labels extracted from the row document is smaller than min​Query​Labels​Per​Row​Document, Lingo4G excludes the document from processing. The corresponding row in the similarity matrix will be empty.

  2. If the number of column documents in relation to the total number of documents in the index is larger than max​Column​Documents​For​Sub​Index, Lingo4G creates a temporary inverted index for the column documents to improve the performance of the similarity matrix building.

  3. For each row document, Lingo4G builds a search query consisting of labels extracted in step 1. Lingo4G restricts the query to find matches only among the column documents. Additionally, the query matches only documents that contain at least min​Query​Labels​Required​In​Column​Document of the document's labels obtained in step 1.

  4. For each row document, Lingo4G runs the corresponding query it built in step 3 to retrieve up to max​Neighbors matching column documents. Lingo4G uses up to threads to execute the queries in parallel.

  5. For each row document, Lingo4G builds the corresponding row of the similarity matrix using the matching documents retrieved in step 4.

    For example, assuming that the query corresponding to document at index 2 in the row documents array matched column documents at index 0 and 3, Lingo4G puts the following values into the similarity matrix:

    M = [ · · · · · · · · s 0 · · s 3 · · · · ]

    where s 0 and s 3 are the search scores obtained in step 4 for document at index 2.

    Note that output matrix is rectangular: the number of is equal to the number of row documents and the number of columns is equal to the number of column documents on input.

  6. If normalized is true, Lingo4G normalizes values each row of matrix M to fall in the 0...1 range.

index

Type
object
Default
{
  "rows": {
    "documents": null,
    "labelCollector": {
      "type": "labelCollector:topFromFeatureFields",
      "labelFilter": {
        "type": "labelFilter:reference",
        "auto": true
      },
      "labelListFilter": {
        "type": "labelListFilter:truncatedPhrases"
      },
      "fields": {
        "type": "featureFields:reference",
        "auto": true
      },
      "minTf": 0,
      "minTfMass": 1,
      "tieResolution": "AUTO"
    },
    "maxQueryLabelsPerRowDocument": 10,
    "minQueryLabelsPerRowDocument": 0,
    "threads": "auto"
  },
  "columns": {
    "documents": null
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "maxColumnDocumentsForSubIndex": 0.3,
  "maxInMemorySubIndexSize": 8000000,
  "threads": "auto"
}
Required
no

Configures the rows and columns of this similarity matrix rows source.

Additionally, this section configures the temporary inverted index Lingo4G may create to speed up the computation of the similarity matrix.

columns

Type
object
Default
{
  "documents": null
}
Required
no

Describes the columns of this similarity matrix rows source.

documents
Type
documents
Default
null
Required
yes

The documents to serve as columns for the similarity matrix rows computation.

Each document gives rise to one column in the output similarity matrix.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

The feature fields to use when looking for similar column documents.

Lingo4G uses the feature fields you provide in this property to run the similar document search queries in step 4 of the matrix building algorithm.

max​Column​Documents​For​Sub​Index

Type
number
Default
0.3
Constraints
value >= 0 and value <= 1
Required
no

Determines the threshold for creating a temporary inverted index.

Lingo4G can significantly speed up the computation of keyword similarities for a small set of documents by creating and querying a temporary disposable inverted index containing just the column documents. Lingo4G creates the temporary index only when the number of column documents divided by the total number of documents in the index is smaller or equal to the value of this property.

For example, if max​Column​Documents​For​Sub​Index is 0.3, if the column documents set contains fewer than 30% of all documents in the index, Lingo4G creates a temporary index to speed up the computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.

max​In​Memory​Sub​Index​Size

Type
integer
Default
8000000
Constraints
value >= 0
Required
no

Maximum size of the in-memory temporary index, in bytes.

If the size of the temporary index exceeds max​In​Memory​Sub​Index​Size, Lingo4G materializes the index on disk in a temporary directory.

rows

Type
object
Default
{
  "documents": null,
  "labelCollector": {
    "type": "labelCollector:topFromFeatureFields",
    "labelFilter": {
      "type": "labelFilter:reference",
      "auto": true
    },
    "labelListFilter": {
      "type": "labelListFilter:truncatedPhrases"
    },
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "minTf": 0,
    "minTfMass": 1,
    "tieResolution": "AUTO"
  },
  "maxQueryLabelsPerRowDocument": 10,
  "minQueryLabelsPerRowDocument": 0,
  "threads": "auto"
}
Required
no

Configures the rows and columns of this similarity matrix rows source.

documents
Type
documents
Default
null
Required
yes

The documents to serve as rows for the similarity matrix rows computation.

Each document gives rise to one row of the output similarity matrix.

label​Collector
Type
labelCollector
Default
{
  "type": "labelCollector:topFromFeatureFields",
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "minTf": 0,
  "minTfMass": 1,
  "tieResolution": "AUTO"
}
Required
no

Determines which labels to use to build similar documents search queries.

Lingo4G uses the label collector you provide in step 1 of the matrix building algorithm to extract labels describing each row document.

The default label extractor retrieves up to max​Query​Labels​Per​Row​Document of the most frequent labels in each row document. Provide a custom label collector to modify this behavior.

max​Query​Labels​Per​Row​Document
Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of labels to use to build the similar documents search query.

Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each row document.

Increasing max​Query​Labels​Per​Document, coupled with a larger max​Neighbors values, produces broader, more general similarity.

Lingo4G ignores the max​Query​Labels​Per​Document property if you set a custom label​Collector.

min​Query​Labels​Per​Row​Document
Type
integer
Default
0
Constraints
value >= 0
Required
no

The minimum number of labels the row document must contain to be included in the similarity matrix computation.

Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each row document. If a row document contains fewer than min​Query​Labels​Per​Row​Document, Lingo4G excludes it from processing, leaving the corresponding row in the similarity matrix empty.

If you want to exclude from further processing documents containing just one label, increase min​Query​Labels​Per​Row​Document beyond the default value of 1.

If you increase min​Query​Labels​Per​Row​Document, make sure to set max​Query​Labels​Per​Row​Document to a value equal or greater than min​Query​Labels​Per​Row​Document.

Lingo4G ignores the min​Query​Labels​Per​Row​Document property if you set a custom label​Collector.

threads
Type
threads
Default
auto
Required
no

The number of concurrent threads to use to collect labels from row documents.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to build the temporary inverted index.

max​Neighbors

Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of similar column documents to retrieve for each row document.

Lingo4G uses the max​Neighbors property in step 4 of the similarity matrix building algorithm to determine the maximum number of similar column documents to retrieve for each row of the similarity matrix. Therefore, this stage produces matrices whose rows contain at most max​Neighbors values.

min​Query​Labels​Required​In​Column​Document

Type
integer
Default
1
Constraints
value > 0
Required
no

The minimum number of common labels required for two documents to be treated as similar.

Lingo4G uses this property in step 3 of the similarity matrix building algorithm. If you increase min​Query​Labels​Required​In​Column​Document beyond the default value of 1, Lingo4G removes from the similarity matrix those document pairs that have fewer than min​Query​Labels​Required​In​Column​Document labels in common.

If you don't want to base document similarities on a single label shared between documents, increase min​Query​Labels​Required​In​Column​Document beyond the default value of 1.

normalized

Type
boolean
Default
false
Required
no

If true, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.

If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to execute document similarity search queries.

matrix​Rows:​knn​Vectors​Similarity

Computes a label or document similarity matrix based on multidimensional vector distance.

{
  "type": "matrixRows:knnVectorsSimilarity",
  "maxNeighbors": 10,
  "threads": "auto",
  "vectors": {
    "columns": {
      "type": "vectors:reference",
      "auto": true
    },
    "rows": {
      "type": "vectors:reference",
      "auto": true
    }
  }
}

For each vector in the rows vector set, Lingo4G finds up to max​Neighbors closest vectors in the columns vector set and transfers the cosine similarities between the nearest vectors to the output matrix. Therefore, the output matrix is rectangular: the number of rows is equal to the number of vectors in the rows set, the number of columns is equal to the number of vectors in the columns vector set.

If you use vectors:​precomputed​Document​Embeddings as the input vector set, this stage computes similarities between documents. Similarly, if you provide vectors:​precomputed​Label​Embeddings on input, this stage computes similarities between labels. You can also mix label and document embeddings in computation of the same matrix to compute label-to-documents or documents-to-labels similarities.

In many cases, embedding-based similarities offer better results compared to their and keyword-based counterparts.

max​Neighbors

Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of nearest column vectors to find for each row vector.

The larger max​Neighbors, the denser the matrix.

threads

Type
threads
Default
auto
Required
no

The number of threads to use for the computation.

vectors

Type
object
Default
{
  "rows": {
    "type": "vectors:reference",
    "auto": true
  },
  "columns": {
    "type": "vectors:reference",
    "auto": true
  }
}
Required
no

Configures the multidimensional vector sets to use for the computation of this similarity matrix rows.

columns

Type
vectors
Default
{
  "type": "vectors:reference",
  "auto": true
}
Required
no

The multidimensional vector sets to use for the columns of this similarity matrix.

The number of columns of the output matrix is equal to the number of vectors in the vector set you provide in this property.

rows

Type
vectors
Default
{
  "type": "vectors:reference",
  "auto": true
}
Required
no

The multidimensional vector sets to use for the rows of this similarity matrix.

The number of rows of the output matrix is equal to the number of vectors in the vector set you provide in this property.

Consumers of matrix​Rows:​*

The following stages and components take matrix​Rows:​* as input:

Stage or component Property
clusters:​from​Matrix​Columns
  • matrix​Rows
  • documents:​contrast​Score
  • matrix​Rows
  • documents:​from​Matrix​Columns
  • matrix​Rows
  • matrix:​from​Matrix​Rows
  • matrix​Rows