matrixRows
Computes individual rows of a similarity matrix. Matrix row sources allow certain stages, such as the document contrast scorer, to perform calculations without materializing a large similarity matrix in the main memory.
Lingo4G offers two matrix row source implementations that compute similarities between the two sets of documents you provide. You can use the matrix row sources as inputs to two kinds of stages:
-
documents:​contrast​Score
, which determines how novel a document is with respect to other documents that precede and succeed the document in time. -
documents:​from​Matrix​Columns
andclusters:​from​Matrix​Columns
, which collect and aggregate values from the columns of the input matrix. These stages are useful to select top-scoring documents where the score is an aggregation of a number of values.
Both kinds of stages can process the input matrix row-by-row, which means there is no need to keep the whole,
potentially very large, matrix in the main memory. Some stages, such as
embedding2d:​lv
, however, access matrix elements in a random way, which makes them incompatible with matrix​Rows:​*
,
they only accept in-memory matrix:​*
inputs.
You can use the following matrix​Rows:​*
stage types in your analysis request JSONs:
-
matrix​Rows:​by​Query
-
Builds and executes row-specific queries, takes query results as the row's columns.
-
matrix​Rows:​composite
-
Aggregates multiple matrix row sources.
-
matrix​Rows:​from​Matrix
-
Converts a full matrix into a source of rows. Mostly useful for debugging and testing.
-
matrix​Rows:​keyword​Document​Similarity
-
Computes keyword-based (More-Like-This) similarities between documents.
-
matrix​Rows:​knn​Vectors​Similarity
-
Computes similarities between documents based on multidimensional embeddings.
-
matrix​Rows:​weighted
-
Applies additional value weighting to the provided matrix rows.
matrix​Rows:​reference
-
References a
matrix​Rows:​*
component defined in the request or in the project's default components.
matrix​Rows:​by​Query
For each row, builds a query specific to the row's document and takes query results as the row's column values.
{
"type": "matrixRows:byQuery",
"columns": {
"type": "documents:reference",
"auto": true
},
"maxNeighbors": 10,
"normalized": true,
"queryBuilder": {
"type": "queryBuilder:reference",
"auto": true
},
"rows": {
"type": "documents:reference",
"auto": true
},
"threads": "auto"
}
You can use this component to build similarity matrices based on the values of content fields of documents.
Note that depending on the number of rows on input and the complexity of the query, computing all rows of the matrix may take significant time.
The following request clusters and 2d maps arXiv papers based on their category (the
set
field).
The request combines matrix​Rows:​by​Query
with the
query​Builder:​string
component, which from each document extracts its set
field value and runs the
set:​<​S​E​T>
query to get other papers belonging to the same category.
In practice, you may want to use matrix​Rows:​composite
to combine
matrix​Rows:​by​Query
with and keyword- or
vector-based similarities
to perform clustering or 2d mapping based the composite similarity function.
columns
The documents to serve as columns for the similarity matrix rows computation.
Each document gives rise to one column in the output similarity matrix.
max​Neighbors
The maximum number of results to retrieve for each document-row-specific query.
Each row of the resulting similarity matrix will have at most
max​Neighbors
values.
normalized
If true
, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.
If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.
query​Builder
The query builder to use to build the search query for each document corresponding to one row of the similarity matrix.
rows
The documents to serve as rows for the similarity matrix rows computation.
Each document gives rise to one row of the output similarity matrix.
threads
The number of concurrent threads to use to perform document-row-specific search queries.
matrix​Rows:​composite
Aggregates multiple matrix row sources into one matrix row source.
{
"type": "matrixRows:composite",
"matrixRows": [],
"normalized": true,
"weightAggregation": "SUM"
}
All input matrices must have the same dimensions - they must have the same number of rows and columns.
The output matrix will have the same number of rows as the input matrices. Columns of the output matrix will be a
union of the values provided by the input matrix rows, aggregated using the
weight​Aggregation
function.
matrix​Rows
The matrix row sources to compose.
All row sources must produce matrices with equal dimensions.
normalized
If true
, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.
If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.
weight​Aggregation
The aggregation function to use to combine values present in more than one input matrix.
See weight​Aggregation
for more information.
matrix​Rows:​from​Matrix
Converts a full matrix into a source of rows. Mostly useful for debugging and testing.
{
"type": "matrixRows:fromMatrix",
"matrix": {
"type": "matrix:reference",
"auto": true
}
}
matrix
The matrix to convert into matrix​Rows:​*
.
matrix​Rows:​keyword​Document​Similarity
Computes keyword-based (More-Like-This) similarities between documents.
{
"type": "matrixRows:keywordDocumentSimilarity",
"index": {
"columns": {
"documents": null
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"maxColumnDocumentsForSubIndex": 0.3,
"maxInMemorySubIndexSize": 8000000,
"rows": {
"documents": null,
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"failIfEmbeddingsNotAvailable": true,
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
},
"labelWeighting": "EMBEDDING",
"minTf": 0,
"minWeight": 0,
"minWeightMass": 1,
"tieResolution": "AUTO"
},
"maxQueryLabelsPerRowDocument": 10,
"minQueryLabelsPerRowDocument": 0,
"threads": "auto"
},
"threads": "auto"
},
"maxNeighbors": 10,
"minQueryLabelsRequiredInColumnDocument": 1,
"normalized": false,
"threads": "auto"
}
To compute the keyword-based document similarity matrix rows, Lingo4G performs the following steps:
-
For each row
document
, Lingo4G uses thelabel​Collector
you provide to extract up tomax​Query​Labels​Per​Row​Document
labels that characterize the document.If the number of labels extracted from the row document is smaller than
min​Query​Labels​Per​Row​Document
, Lingo4G excludes the document from processing. The corresponding row in the similarity matrix will be empty. -
If the number of column
documents
in relation to the total number of documents in the index is larger thanmax​Column​Documents​For​Sub​Index
, Lingo4G creates a temporary inverted index for the column documents to improve the performance of the similarity matrix building. -
For each row document, Lingo4G builds a search query consisting of labels extracted in step 1. Lingo4G restricts the query to find matches only among the column
documents
. Additionally, the query matches only documents that contain at leastmin​Query​Labels​Required​In​Column​Document
of the document's labels obtained in step 1. -
For each row document, Lingo4G runs the corresponding query it built in step 3 to retrieve up to
max​Neighbors
matching column documents. Lingo4G uses up tothreads
to execute the queries in parallel. -
For each row document, Lingo4G builds the corresponding row of the similarity matrix using the matching documents retrieved in step 4.
For example, assuming that the query corresponding to document at index 2 in the row
documents
array matched columndocuments
at index 0 and 3, Lingo4G puts the following values into the similarity matrix:where and are the search scores obtained in step 4 for document at index 2.
Note that output matrix is rectangular: the number of is equal to the number of row
documents
and the number of columns is equal to the number of columndocuments
on input. -
If
normalized
istrue
, Lingo4G normalizes values each row of matrix to fall in the 0...1 range.
index
Configures the rows and columns of this similarity matrix rows source.
Additionally, this section configures the temporary inverted index Lingo4G may create to speed up the computation of the similarity matrix.
columns
Describes the columns of this similarity matrix rows source.
documents
The documents to serve as columns for the similarity matrix rows computation.
Each document gives rise to one column in the output similarity matrix.
fields
The feature fields to use when looking for similar column documents.
Lingo4G uses the feature fields you provide in this property to run the similar document search queries in step 4 of the matrix building algorithm.
max​Column​Documents​For​Sub​Index
Determines the threshold for creating a temporary inverted index.
Lingo4G can significantly speed up the computation of keyword similarities for a small set of documents by
creating and querying a temporary disposable inverted index containing just the column
documents
. Lingo4G creates the temporary index only when the number of column documents divided by the total number of
documents in the index is smaller or equal to the value of this property.
For example, if max​Column​Documents​For​Sub​Index
is 0.3
, if the column
documents
set contains fewer than 30% of all documents in the index, Lingo4G creates a temporary index to speed up the
computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.
max​In​Memory​Sub​Index​Size
Maximum size of the in-memory temporary index, in bytes.
If the size of the temporary index exceeds
max​In​Memory​Sub​Index​Size
, Lingo4G materializes the index on disk in a temporary directory.
rows
Configures the rows and columns of this similarity matrix rows source.
documents
The documents to serve as rows for the similarity matrix rows computation.
Each document gives rise to one row of the output similarity matrix.
label​Collector
Determines which labels to use to build similar documents search queries.
Lingo4G uses the label collector you provide in step 1 of the matrix building algorithm to extract labels describing each row document.
The default label extractor retrieves up to
max​Query​Labels​Per​Row​Document
of the most frequent labels in each row document. Provide a custom label collector to modify this behavior.
max​Query​Labels​Per​Row​Document
The maximum number of labels to use to build the similar documents search query.
Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each row document.
Increasing max​Query​Labels​Per​Document
, coupled with a larger
max​Neighbors
values, produces broader, more general similarity.
Lingo4G ignores the max​Query​Labels​Per​Document
property if you set a custom
label​Collector
.
min​Query​Labels​Per​Row​Document
The minimum number of labels the row document must contain to be included in the similarity matrix computation.
Lingo4G uses this property in step 1 of
the similarity matrix building algorithm where it collects a set of labels that best describe each row
document. If a row document contains fewer than min​Query​Labels​Per​Row​Document
, Lingo4G excludes
it from processing, leaving the corresponding row in the similarity matrix empty.
If you want to exclude from further processing documents containing just one label, increase
min​Query​Labels​Per​Row​Document
beyond the default value of 1.
If you increase min​Query​Labels​Per​Row​Document
, make sure to set
max​Query​Labels​Per​Row​Document
to a value equal or greater than min​Query​Labels​Per​Row​Document
.
Lingo4G ignores the min​Query​Labels​Per​Row​Document
property if you set a custom
label​Collector
.
threads
The number of concurrent threads to use to collect labels from row documents.
threads
The number of concurrent threads to use to build the temporary inverted index.
max​Neighbors
The maximum number of similar column documents to retrieve for each row document.
Lingo4G uses the max​Neighbors
property in
step 4
of the similarity matrix building algorithm to determine the maximum number of similar column documents to
retrieve for each row of the similarity matrix. Therefore, this stage produces matrices whose rows contain at
most
max​Neighbors
values.
min​Query​Labels​Required​In​Column​Document
The minimum number of common labels required for two documents to be treated as similar.
Lingo4G uses this property in step 3 of the
similarity matrix building algorithm. If you increase min​Query​Labels​Required​In​Column​Document
beyond the default value of 1, Lingo4G removes from the similarity matrix those document pairs that have fewer
than
min​Query​Labels​Required​In​Column​Document
labels in common.
If you don't want to base document similarities on a single label shared between documents, increase
min​Query​Labels​Required​In​Column​Document
beyond the default value of 1.
normalized
If true
, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.
If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.
threads
The number of concurrent threads to use to execute document similarity search queries.
matrix​Rows:​knn​Vectors​Similarity
Computes a label or document similarity matrix based on multidimensional vector distance.
{
"type": "matrixRows:knnVectorsSimilarity",
"maxNeighbors": 10,
"minSimilarity": 0,
"threads": "auto",
"vectors": {
"columns": {
"type": "vectors:reference",
"auto": true
},
"rows": {
"type": "vectors:reference",
"auto": true
}
}
}
For each vector in the rows
vector set, Lingo4G finds up to
max​Neighbors
closest vectors in the columns
vector set and transfers the cosine similarities between the nearest vectors to the output matrix. Therefore, the
output matrix is rectangular: the number of rows is equal to the number of vectors in the
rows
set, the number of columns is equal to the number of vectors in the
columns
vector set.
If you use
vectors:​precomputed​Document​Embeddings
as the input vector set, this stage computes similarities between documents. Similarly, if you provide
vectors:​precomputed​Label​Embeddings
on input, this stage computes similarities between labels. You can also mix label and document embeddings in
computation of the same matrix to compute label-to-documents or documents-to-labels similarities.
In many cases, embedding-based similarities offer better results compared to their and keyword-based counterparts.
max​Neighbors
The maximum number of nearest column vectors to find for each row vector.
The larger max​Neighbors
, the denser the matrix.
min​Similarity
Specifies the minimum allowed similarity value.
The similarity matrix rows will exclude values lower than min​Similarity
.
During a k-nearest-neighbors search on embedding vectors, the result usually includes exactly
max​Neighbors
matches, even if some similarity scores are very low. You can the min​Similarity
threshold to
exclude low-scoring matches.
Setting min​Similarity
to a value higher than 0 may result in some empty rows in the similarity
matrix. Consequently, when using such a matrix for clustering or 2d mapping, some documents may remain
unclustered or without 2d embedding coordinates.
threads
The number of threads to use for the computation.
vectors
Configures the multidimensional vector sets to use for the computation of this similarity matrix rows.
columns
The multidimensional vector sets to use for the columns of this similarity matrix.
The number of columns of the output matrix is equal to the number of vectors in the vector set you provide in this property.
rows
The multidimensional vector sets to use for the rows of this similarity matrix.
The number of rows of the output matrix is equal to the number of vectors in the vector set you provide in this property.
matrix​Rows:​weighted
Applies additional value weighting to the provided matrix rows.
{
"type": "matrixRows:weighted",
"matrixRows": {
"type": "matrixRows:reference",
"auto": true
},
"offset": 0,
"weight": 1
}
Lingo4G computes each value of the output matrix using the following formula:
where:
is value of the weight
property of this stage,
is value of the offset
property of this stage.
This stage is most useful in combination with matrix​Rows:​composite
, where you can use it to balance the weights of the matrices being composed.
matrix​Rows
The matrix rows to which to apply additional weighting.
offset
The offset to add to each value of the input matrix.
weight
The weight by which to multiple each value of the input matrix.
matrix​Rows:​*
Consumers of
The following stages and components take matrix​Rows:​*
as
input: