matrixRows
Computes individual rows of a similarity matrix. Matrix row sources allow certain stages, such as the document contrast scorer, to perform calculations without materializing a large similarity matrix in the main memory.
Lingo4G offers two matrix row source implementations that compute similarities between the two sets of documents you provide. You can use the matrix row sources as inputs to two kinds of stages:
-
documents:​contrast​Score
, which determines how novel a document is with respect to other documents that precede and succeed the document in time. -
documents:​from​Matrix​Columns
andclusters:​from​Matrix​Columns
, which collect and aggregate values from the columns of the input matrix. These stages are useful to select top-scoring documents where the score is an aggregation of a number of values.
Both kinds of stages can process the input matrix row-by-row, which means there is no need to keep the whole,
potentially very large, matrix in the main memory. Some stages, such as
embedding2d:​lv
, however, access matrix elements in a random way, which makes them incompatible with matrix​Rows:​*
,
they only accept in-memory matrix:​*
inputs.
You can use the following matrix​Rows:​*
stage types in your analysis request JSONs:
-
matrix​Rows:​from​Matrix
-
Converts a full matrix into a source of rows. Mostly useful for debugging and testing.
-
matrix​Rows:​keyword​Document​Similarity
-
Computes keyword-based (More-Like-This) similarities between documents.
-
matrix​Rows:​knn​Vectors​Similarity
-
Computes similarities between documents based on multidimensional embeddings.
matrix​Rows:​reference
-
References a
matrix​Rows:​*
component defined in the request or in the project's default components.
matrix​Rows:​from​Matrix
Converts a full matrix into a source of rows. Mostly useful for debugging and testing.
{
"type": "matrixRows:fromMatrix",
"matrix": {
"type": "matrix:reference",
"auto": true
}
}
matrix
The matrix to convert into matrix​Rows:​*
.
matrix​Rows:​keyword​Document​Similarity
Computes keyword-based (More-Like-This) similarities between documents.
{
"type": "matrixRows:keywordDocumentSimilarity",
"index": {
"columns": {
"documents": null
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"maxColumnDocumentsForSubIndex": 0.3,
"maxInMemorySubIndexSize": 8000000,
"rows": {
"documents": null,
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
},
"minTf": 0,
"minTfMass": 1,
"tieResolution": "AUTO"
},
"maxQueryLabelsPerRowDocument": 10,
"minQueryLabelsPerRowDocument": 0,
"threads": "auto"
},
"threads": "auto"
},
"maxNeighbors": 10,
"minQueryLabelsRequiredInColumnDocument": 1,
"normalized": false,
"threads": "auto"
}
To compute the keyword-based document similarity matrix rows, Lingo4G performs the following steps:
-
For each row
document
, Lingo4G uses thelabel​Collector
you provide to extract up tomax​Query​Labels​Per​Row​Document
labels that characterize the document.If the number of labels extracted from the row document is smaller than
min​Query​Labels​Per​Row​Document
, Lingo4G excludes the document from processing. The corresponding row in the similarity matrix will be empty. -
If the number of column
documents
in relation to the total number of documents in the index is larger thanmax​Column​Documents​For​Sub​Index
, Lingo4G creates a temporary inverted index for the column documents to improve the performance of the similarity matrix building. -
For each row document, Lingo4G builds a search query consisting of labels extracted in step 1. Lingo4G restricts the query to find matches only among the column
documents
. Additionally, the query matches only documents that contain at leastmin​Query​Labels​Required​In​Column​Document
of the document's labels obtained in step 1. -
For each row document, Lingo4G runs the corresponding query it built in step 3 to retrieve up to
max​Neighbors
matching column documents. Lingo4G uses up tothreads
to execute the queries in parallel. -
For each row document, Lingo4G builds the corresponding row of the similarity matrix using the matching documents retrieved in step 4.
For example, assuming that the query corresponding to document at index 2 in the row
documents
array matched columndocuments
at index 0 and 3, Lingo4G puts the following values into the similarity matrix:where and are the search scores obtained in step 4 for document at index 2.
Note that output matrix is rectangular: the number of is equal to the number of row
documents
and the number of columns is equal to the number of columndocuments
on input. -
If
normalized
istrue
, Lingo4G normalizes values each row of matrix to fall in the 0...1 range.
index
Configures the rows and columns of this similarity matrix rows source.
Additionally, this section configures the temporary inverted index Lingo4G may create to speed up the computation of the similarity matrix.
columns
Describes the columns of this similarity matrix rows source.
documents
The documents to serve as columns for the similarity matrix rows computation.
Each document gives rise to one column in the output similarity matrix.
fields
The feature fields to use when looking for similar column documents.
Lingo4G uses the feature fields you provide in this property to run the similar document search queries in step 4 of the matrix building algorithm.
max​Column​Documents​For​Sub​Index
Determines the threshold for creating a temporary inverted index.
Lingo4G can significantly speed up the computation of keyword similarities for a small set of documents by
creating and querying a temporary disposable inverted index containing just the column
documents
. Lingo4G creates the temporary index only when the number of column documents divided by the total number of
documents in the index is smaller or equal to the value of this property.
For example, if max​Column​Documents​For​Sub​Index
is 0.3
, if the column
documents
set contains fewer than 30% of all documents in the index, Lingo4G creates a temporary index to speed up the
computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.
max​In​Memory​Sub​Index​Size
Maximum size of the in-memory temporary index, in bytes.
If the size of the temporary index exceeds
max​In​Memory​Sub​Index​Size
, Lingo4G materializes the index on disk in a temporary directory.
rows
Configures the rows and columns of this similarity matrix rows source.
documents
The documents to serve as rows for the similarity matrix rows computation.
Each document gives rise to one row of the output similarity matrix.
label​Collector
Determines which labels to use to build similar documents search queries.
Lingo4G uses the label collector you provide in step 1 of the matrix building algorithm to extract labels describing each row document.
The default label extractor retrieves up to
max​Query​Labels​Per​Row​Document
of the most frequent labels in each row document. Provide a custom label collector to modify this behavior.
max​Query​Labels​Per​Row​Document
The maximum number of labels to use to build the similar documents search query.
Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each row document.
Increasing max​Query​Labels​Per​Document
, coupled with a larger
max​Neighbors
values, produces broader, more general similarity.
Lingo4G ignores the max​Query​Labels​Per​Document
property if you set a custom
label​Collector
.
min​Query​Labels​Per​Row​Document
The minimum number of labels the row document must contain to be included in the similarity matrix computation.
Lingo4G uses this property in step 1 of
the similarity matrix building algorithm where it collects a set of labels that best describe each row
document. If a row document contains fewer than min​Query​Labels​Per​Row​Document
, Lingo4G excludes
it from processing, leaving the corresponding row in the similarity matrix empty.
If you want to exclude from further processing documents containing just one label, increase
min​Query​Labels​Per​Row​Document
beyond the default value of 1.
If you increase min​Query​Labels​Per​Row​Document
, make sure to set
max​Query​Labels​Per​Row​Document
to a value equal or greater than min​Query​Labels​Per​Row​Document
.
Lingo4G ignores the min​Query​Labels​Per​Row​Document
property if you set a custom
label​Collector
.
threads
The number of concurrent threads to use to collect labels from row documents.
threads
The number of concurrent threads to use to build the temporary inverted index.
max​Neighbors
The maximum number of similar column documents to retrieve for each row document.
Lingo4G uses the max​Neighbors
property in
step 4
of the similarity matrix building algorithm to determine the maximum number of similar column documents to
retrieve for each row of the similarity matrix. Therefore, this stage produces matrices whose rows contain at
most
max​Neighbors
values.
min​Query​Labels​Required​In​Column​Document
The minimum number of common labels required for two documents to be treated as similar.
Lingo4G uses this property in step 3 of the
similarity matrix building algorithm. If you increase min​Query​Labels​Required​In​Column​Document
beyond the default value of 1, Lingo4G removes from the similarity matrix those document pairs that have fewer
than
min​Query​Labels​Required​In​Column​Document
labels in common.
If you don't want to base document similarities on a single label shared between documents, increase
min​Query​Labels​Required​In​Column​Document
beyond the default value of 1.
normalized
If true
, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.
If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.
threads
The number of concurrent threads to use to execute document similarity search queries.
matrix​Rows:​knn​Vectors​Similarity
Computes a label or document similarity matrix based on multidimensional vector distance.
{
"type": "matrixRows:knnVectorsSimilarity",
"maxNeighbors": 10,
"threads": "auto",
"vectors": {
"columns": {
"type": "vectors:reference",
"auto": true
},
"rows": {
"type": "vectors:reference",
"auto": true
}
}
}
For each vector in the rows
vector set, Lingo4G finds up to
max​Neighbors
closest vectors in the columns
vector set and transfers the cosine similarities between the nearest vectors to the output matrix. Therefore, the
output matrix is rectangular: the number of rows is equal to the number of vectors in the
rows
set, the number of columns is equal to the number of vectors in the
columns
vector set.
If you use
vectors:​precomputed​Document​Embeddings
as the input vector set, this stage computes similarities between documents. Similarly, if you provide
vectors:​precomputed​Label​Embeddings
on input, this stage computes similarities between labels. You can also mix label and document embeddings in
computation of the same matrix to compute label-to-documents or documents-to-labels similarities.
In many cases, embedding-based similarities offer better results compared to their and keyword-based counterparts.
max​Neighbors
The maximum number of nearest column vectors to find for each row vector.
The larger max​Neighbors
, the denser the matrix.
threads
The number of threads to use for the computation.
vectors
Configures the multidimensional vector sets to use for the computation of this similarity matrix rows.
columns
The multidimensional vector sets to use for the columns of this similarity matrix.
The number of columns of the output matrix is equal to the number of vectors in the vector set you provide in this property.
rows
The multidimensional vector sets to use for the rows of this similarity matrix.
The number of rows of the output matrix is equal to the number of vectors in the vector set you provide in this property.
matrix​Rows:​*
Consumers of
The following stages and components take matrix​Rows:​*
as
input: