matrixRows

Computes individual rows of a similarity matrix. Matrix row sources allow certain stages, such as the document contrast scorer, to perform calculations without materializing a large similarity matrix in the main memory.

Lingo4G offers two matrix row source implementations that compute similarities between the two sets of documents you provide. You can use the matrix row sources as inputs to two kinds of stages:

Both kinds of stages can process the input matrix row-by-row, which means there is no need to keep the whole, potentially very large, matrix in the main memory. Some stages, such as embedding2d:​lv, however, access matrix elements in a random way, which makes them incompatible with matrix​Rows:​*, they only accept in-memory matrix:​* inputs.


You can use the following matrix​Rows:​* stage types in your analysis request JSONs:

matrix​Rows:​by​Query

Builds and executes row-specific queries, takes query results as the row's columns.

matrix​Rows:​composite

Aggregates multiple matrix row sources.

matrix​Rows:​from​Matrix

Converts a full matrix into a source of rows. Mostly useful for debugging and testing.

matrix​Rows:​keyword​Document​Similarity

Computes keyword-based (More-Like-This) similarities between documents.

matrix​Rows:​knn​Vectors​Similarity

Computes similarities between documents based on multidimensional embeddings.

matrix​Rows:​weighted

Applies additional value weighting to the provided matrix rows.


matrix​Rows:​reference

References a matrix​Rows:​* component defined in the request or in the project's default components.


matrix​Rows:​by​Query

For each row, builds a query specific to the row's document and takes query results as the row's column values.

{
  "type": "matrixRows:byQuery",
  "columns": {
    "type": "documents:reference",
    "auto": true
  },
  "maxNeighbors": 10,
  "normalized": true,
  "queryBuilder": {
    "type": "queryBuilder:reference",
    "auto": true
  },
  "rows": {
    "type": "documents:reference",
    "auto": true
  },
  "threads": "auto"
}

You can use this component to build similarity matrices based on the values of content fields of documents.

Note that depending on the number of rows on input and the complexity of the query, computing all rows of the matrix may take significant time.

The following request clusters and 2d maps arXiv papers based on their category (the set field).

{
  "name": "Clusters and 2d maps papers based on their category",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "similarityBySet": {
      "type": "matrix:fromMatrixRows",
        "type": "matrixRows:byQuery",
        "maxNeighbors": 2,
        "queryBuilder": {
          "type": "queryBuilder:string",
          "variables": [
            {
              "input": "set",
              "variable": "SET",
              "maxValues": 1,
              "quote": true
            }
          ],
          "query": "set:(<SET>)",
          "maxQueriesForDebugLogging": 2
        }
      }
    },
    "2dMapBySet": {
      "type": "embedding2d:lv"
    },
    "clustersBySet": {
      "type": "clusters:byValues",
      "values": {
        "type": "values:fromDocumentField",
        "fieldName": "set",
        "multipleValues": "COLLECT_FIRST"
      }
    }
  },
  "output": {
    "stages": [
      "documents",
      "2dMapBySet",
      "clustersBySet"
    ]
  }
}

Clusters and 2d-maps arXiv papers based on their category stored in the the set content field.

The request combines matrix​Rows:​by​Query with the query​Builder:​string component, which from each document extracts its set field value and runs the set:​<​S​E​T> query to get other papers belonging to the same category.

In practice, you may want to use matrix​Rows:​composite to combine matrix​Rows:​by​Query with and keyword- or vector-based similarities to perform clustering or 2d mapping based the composite similarity function.

columns

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents to serve as columns for the similarity matrix rows computation.

Each document gives rise to one column in the output similarity matrix.

max​Neighbors

Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of results to retrieve for each document-row-specific query.

Each row of the resulting similarity matrix will have at most max​Neighbors values.

normalized

Type
boolean
Default
true
Required
no

If true, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.

If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.

query​Builder

Type
queryBuilder
Default
{
  "type": "queryBuilder:reference",
  "auto": true
}
Required
no

The query builder to use to build the search query for each document corresponding to one row of the similarity matrix.

rows

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents to serve as rows for the similarity matrix rows computation.

Each document gives rise to one row of the output similarity matrix.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to perform document-row-specific search queries.

matrix​Rows:​composite

Aggregates multiple matrix row sources into one matrix row source.

{
  "type": "matrixRows:composite",
  "matrixRows": [],
  "normalized": true,
  "weightAggregation": "SUM"
}

All input matrices must have the same dimensions - they must have the same number of rows and columns.

The output matrix will have the same number of rows as the input matrices. Columns of the output matrix will be a union of the values provided by the input matrix rows, aggregated using the weight​Aggregation function.

matrix​Rows

Type
array of matrixRows
Default
[]
Required
no

The matrix row sources to compose.

All row sources must produce matrices with equal dimensions.

normalized

Type
boolean
Default
true
Required
no

If true, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.

If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.

weight​Aggregation

Type
weightAggregation
Default
"SUM"
Required
no

The aggregation function to use to combine values present in more than one input matrix.

See weight​Aggregation for more information.

matrix​Rows:​from​Matrix

Converts a full matrix into a source of rows. Mostly useful for debugging and testing.

{
  "type": "matrixRows:fromMatrix",
  "matrix": {
    "type": "matrix:reference",
    "auto": true
  }
}

matrix

Type
matrix
Default
{
  "type": "matrix:reference",
  "auto": true
}
Required
no

The matrix to convert into matrix​Rows:​*.

matrix​Rows:​keyword​Document​Similarity

Computes keyword-based (More-Like-This) similarities between documents.

{
  "type": "matrixRows:keywordDocumentSimilarity",
  "index": {
    "columns": {
      "documents": null
    },
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "maxColumnDocumentsForSubIndex": 0.3,
    "maxInMemorySubIndexSize": 8000000,
    "rows": {
      "documents": null,
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "failIfEmbeddingsNotAvailable": true,
        "fields": {
          "type": "featureFields:reference",
          "auto": true
        },
        "labelFilter": {
          "type": "labelFilter:reference",
          "auto": true
        },
        "labelListFilter": {
          "type": "labelListFilter:truncatedPhrases"
        },
        "labelWeighting": "EMBEDDING",
        "minTf": 0,
        "minWeight": 0,
        "minWeightMass": 1,
        "tieResolution": "AUTO"
      },
      "maxQueryLabelsPerRowDocument": 10,
      "minQueryLabelsPerRowDocument": 0,
      "threads": "auto"
    },
    "threads": "auto"
  },
  "maxNeighbors": 10,
  "minQueryLabelsRequiredInColumnDocument": 1,
  "normalized": false,
  "threads": "auto"
}

To compute the keyword-based document similarity matrix rows, Lingo4G performs the following steps:

  1. For each row document, Lingo4G uses the label​Collector you provide to extract up to max​Query​Labels​Per​Row​Document labels that characterize the document.

    If the number of labels extracted from the row document is smaller than min​Query​Labels​Per​Row​Document, Lingo4G excludes the document from processing. The corresponding row in the similarity matrix will be empty.

  2. If the number of column documents in relation to the total number of documents in the index is larger than max​Column​Documents​For​Sub​Index, Lingo4G creates a temporary inverted index for the column documents to improve the performance of the similarity matrix building.

  3. For each row document, Lingo4G builds a search query consisting of labels extracted in step 1. Lingo4G restricts the query to find matches only among the column documents. Additionally, the query matches only documents that contain at least min​Query​Labels​Required​In​Column​Document of the document's labels obtained in step 1.

  4. For each row document, Lingo4G runs the corresponding query it built in step 3 to retrieve up to max​Neighbors matching column documents. Lingo4G uses up to threads to execute the queries in parallel.

  5. For each row document, Lingo4G builds the corresponding row of the similarity matrix using the matching documents retrieved in step 4.

    For example, assuming that the query corresponding to document at index 2 in the row documents array matched column documents at index 0 and 3, Lingo4G puts the following values into the similarity matrix:

    M = [ · · · · · · · · s 0 · · s 3 · · · · ]

    where s 0 and s 3 are the search scores obtained in step 4 for document at index 2.

    Note that output matrix is rectangular: the number of is equal to the number of row documents and the number of columns is equal to the number of column documents on input.

  6. If normalized is true, Lingo4G normalizes values each row of matrix M to fall in the 0...1 range.

index

Type
object
Default
{
  "rows": {
    "documents": null,
    "labelCollector": {
      "type": "labelCollector:topFromFeatureFields",
      "labelFilter": {
        "type": "labelFilter:reference",
        "auto": true
      },
      "labelListFilter": {
        "type": "labelListFilter:truncatedPhrases"
      },
      "fields": {
        "type": "featureFields:reference",
        "auto": true
      },
      "minTf": 0,
      "minWeight": 0,
      "minWeightMass": 1,
      "tieResolution": "AUTO",
      "labelWeighting": "EMBEDDING",
      "failIfEmbeddingsNotAvailable": true
    },
    "maxQueryLabelsPerRowDocument": 10,
    "minQueryLabelsPerRowDocument": 0,
    "threads": "auto"
  },
  "columns": {
    "documents": null
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "maxColumnDocumentsForSubIndex": 0.3,
  "maxInMemorySubIndexSize": 8000000,
  "threads": "auto"
}
Required
no

Configures the rows and columns of this similarity matrix rows source.

Additionally, this section configures the temporary inverted index Lingo4G may create to speed up the computation of the similarity matrix.

columns

Type
object
Default
{
  "documents": null
}
Required
no

Describes the columns of this similarity matrix rows source.

documents
Type
documents
Default
null
Required
yes

The documents to serve as columns for the similarity matrix rows computation.

Each document gives rise to one column in the output similarity matrix.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

The feature fields to use when looking for similar column documents.

Lingo4G uses the feature fields you provide in this property to run the similar document search queries in step 4 of the matrix building algorithm.

max​Column​Documents​For​Sub​Index

Type
number
Default
0.3
Constraints
value >= 0 and value <= 1
Required
no

Determines the threshold for creating a temporary inverted index.

Lingo4G can significantly speed up the computation of keyword similarities for a small set of documents by creating and querying a temporary disposable inverted index containing just the column documents. Lingo4G creates the temporary index only when the number of column documents divided by the total number of documents in the index is smaller or equal to the value of this property.

For example, if max​Column​Documents​For​Sub​Index is 0.3, if the column documents set contains fewer than 30% of all documents in the index, Lingo4G creates a temporary index to speed up the computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.

max​In​Memory​Sub​Index​Size

Type
integer
Default
8000000
Constraints
value >= 0
Required
no

Maximum size of the in-memory temporary index, in bytes.

If the size of the temporary index exceeds max​In​Memory​Sub​Index​Size, Lingo4G materializes the index on disk in a temporary directory.

rows

Type
object
Default
{
  "documents": null,
  "labelCollector": {
    "type": "labelCollector:topFromFeatureFields",
    "labelFilter": {
      "type": "labelFilter:reference",
      "auto": true
    },
    "labelListFilter": {
      "type": "labelListFilter:truncatedPhrases"
    },
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "minTf": 0,
    "minWeight": 0,
    "minWeightMass": 1,
    "tieResolution": "AUTO",
    "labelWeighting": "EMBEDDING",
    "failIfEmbeddingsNotAvailable": true
  },
  "maxQueryLabelsPerRowDocument": 10,
  "minQueryLabelsPerRowDocument": 0,
  "threads": "auto"
}
Required
no

Configures the rows and columns of this similarity matrix rows source.

documents
Type
documents
Default
null
Required
yes

The documents to serve as rows for the similarity matrix rows computation.

Each document gives rise to one row of the output similarity matrix.

label​Collector
Type
labelCollector
Default
{
  "type": "labelCollector:topFromFeatureFields",
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "minTf": 0,
  "minWeight": 0,
  "minWeightMass": 1,
  "tieResolution": "AUTO",
  "labelWeighting": "EMBEDDING",
  "failIfEmbeddingsNotAvailable": true
}
Required
no

Determines which labels to use to build similar documents search queries.

Lingo4G uses the label collector you provide in step 1 of the matrix building algorithm to extract labels describing each row document.

The default label extractor retrieves up to max​Query​Labels​Per​Row​Document of the most frequent labels in each row document. Provide a custom label collector to modify this behavior.

max​Query​Labels​Per​Row​Document
Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of labels to use to build the similar documents search query.

Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each row document.

Increasing max​Query​Labels​Per​Document, coupled with a larger max​Neighbors values, produces broader, more general similarity.

Lingo4G ignores the max​Query​Labels​Per​Document property if you set a custom label​Collector.

min​Query​Labels​Per​Row​Document
Type
integer
Default
0
Constraints
value >= 0
Required
no

The minimum number of labels the row document must contain to be included in the similarity matrix computation.

Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each row document. If a row document contains fewer than min​Query​Labels​Per​Row​Document, Lingo4G excludes it from processing, leaving the corresponding row in the similarity matrix empty.

If you want to exclude from further processing documents containing just one label, increase min​Query​Labels​Per​Row​Document beyond the default value of 1.

If you increase min​Query​Labels​Per​Row​Document, make sure to set max​Query​Labels​Per​Row​Document to a value equal or greater than min​Query​Labels​Per​Row​Document.

Lingo4G ignores the min​Query​Labels​Per​Row​Document property if you set a custom label​Collector.

threads
Type
threads
Default
auto
Required
no

The number of concurrent threads to use to collect labels from row documents.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to build the temporary inverted index.

max​Neighbors

Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of similar column documents to retrieve for each row document.

Lingo4G uses the max​Neighbors property in step 4 of the similarity matrix building algorithm to determine the maximum number of similar column documents to retrieve for each row of the similarity matrix. Therefore, this stage produces matrices whose rows contain at most max​Neighbors values.

min​Query​Labels​Required​In​Column​Document

Type
integer
Default
1
Constraints
value > 0
Required
no

The minimum number of common labels required for two documents to be treated as similar.

Lingo4G uses this property in step 3 of the similarity matrix building algorithm. If you increase min​Query​Labels​Required​In​Column​Document beyond the default value of 1, Lingo4G removes from the similarity matrix those document pairs that have fewer than min​Query​Labels​Required​In​Column​Document labels in common.

If you don't want to base document similarities on a single label shared between documents, increase min​Query​Labels​Required​In​Column​Document beyond the default value of 1.

normalized

Type
boolean
Default
false
Required
no

If true, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.

If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to execute document similarity search queries.

matrix​Rows:​knn​Vectors​Similarity

Computes a label or document similarity matrix based on multidimensional vector distance.

{
  "type": "matrixRows:knnVectorsSimilarity",
  "maxNeighbors": 10,
  "minSimilarity": 0,
  "threads": "auto",
  "vectors": {
    "columns": {
      "type": "vectors:reference",
      "auto": true
    },
    "rows": {
      "type": "vectors:reference",
      "auto": true
    }
  }
}

For each vector in the rows vector set, Lingo4G finds up to max​Neighbors closest vectors in the columns vector set and transfers the cosine similarities between the nearest vectors to the output matrix. Therefore, the output matrix is rectangular: the number of rows is equal to the number of vectors in the rows set, the number of columns is equal to the number of vectors in the columns vector set.

If you use vectors:​precomputed​Document​Embeddings as the input vector set, this stage computes similarities between documents. Similarly, if you provide vectors:​precomputed​Label​Embeddings on input, this stage computes similarities between labels. You can also mix label and document embeddings in computation of the same matrix to compute label-to-documents or documents-to-labels similarities.

In many cases, embedding-based similarities offer better results compared to their and keyword-based counterparts.

max​Neighbors

Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of nearest column vectors to find for each row vector.

The larger max​Neighbors, the denser the matrix.

min​Similarity

Type
number
Default
0
Required
no

Specifies the minimum allowed similarity value.

The similarity matrix rows will exclude values lower than min​Similarity.

During a k-nearest-neighbors search on embedding vectors, the result usually includes exactly max​Neighbors matches, even if some similarity scores are very low. You can the min​Similarity threshold to exclude low-scoring matches.

Setting min​Similarity to a value higher than 0 may result in some empty rows in the similarity matrix. Consequently, when using such a matrix for clustering or 2d mapping, some documents may remain unclustered or without 2d embedding coordinates.

threads

Type
threads
Default
auto
Required
no

The number of threads to use for the computation.

vectors

Type
object
Default
{
  "rows": {
    "type": "vectors:reference",
    "auto": true
  },
  "columns": {
    "type": "vectors:reference",
    "auto": true
  }
}
Required
no

Configures the multidimensional vector sets to use for the computation of this similarity matrix rows.

columns

Type
vectors
Default
{
  "type": "vectors:reference",
  "auto": true
}
Required
no

The multidimensional vector sets to use for the columns of this similarity matrix.

The number of columns of the output matrix is equal to the number of vectors in the vector set you provide in this property.

rows

Type
vectors
Default
{
  "type": "vectors:reference",
  "auto": true
}
Required
no

The multidimensional vector sets to use for the rows of this similarity matrix.

The number of rows of the output matrix is equal to the number of vectors in the vector set you provide in this property.

matrix​Rows:​weighted

Applies additional value weighting to the provided matrix rows.

{
  "type": "matrixRows:weighted",
  "matrixRows": {
    "type": "matrixRows:reference",
    "auto": true
  },
  "offset": 0,
  "weight": 1
}

Lingo4G computes each value of the output matrix using the following formula:

value out = value in * weight + offset

where:

weight is value of the weight property of this stage,
offset is value of the offset property of this stage.

This stage is most useful in combination with matrix​Rows:​composite, where you can use it to balance the weights of the matrices being composed.

matrix​Rows

Type
matrixRows
Default
{
  "type": "matrixRows:reference",
  "auto": true
}
Required
no

The matrix rows to which to apply additional weighting.

offset

Type
number
Default
0
Required
no

The offset to add to each value of the input matrix.

weight

Type
number
Default
1
Required
no

The weight by which to multiple each value of the input matrix.

Consumers of matrix​Rows:​*

The following stages and components take matrix​Rows:​* as input:

Stage or component Property
clusters:​from​Matrix​Columns
  • matrix​Rows
  • documents:​contrast​Score
  • matrix​Rows
  • documents:​from​Matrix​Columns
  • matrix​Rows
  • matrix:​from​Matrix​Rows
  • matrix​Rows
  • matrix​Rows:​composite
  • matrix​Rows
  • matrix​Rows:​weighted
  • matrix​Rows