matrix
The matrix:​*
stages group various ways of producing matrices. The semantics of matrix rows and columns
depends on the specific stage. You can use matrices as input to the
clustering and
2d embedding stages.
Matrices are the bridge between individual entities, such as labels or documents, and their aggregations, such as clusters or 2d maps. Clustering and 2d mapping stages do not accept documents or labels directly on input. Instead, they accept similarity matrices, which define the semantics and interpretation of clusters or 2d maps.
Matrix building, clustering and 2d mapping stages rely on a very important concept universal to the whole Lingo4G analysis API: index alignment. When you build a similarity matrix, indices of rows and columns of the matrix correspond to the indices of documents or labels you provided on input. When you perform clustering on such a matrix, you receive clusters of related indices of the input matrix. Since indices across input entities, matrices and clusters are aligned, to find out which specific labels or documents got clustered, you need to look up the indices in the list of documents or labels you provided when building the similarity matrix.
You can use the following matrix stages in your analysis requests:
-
matrix:​cooccurrence​Label​Similarity
-
Computes a label similarity matrix based on how the labels co-occur in the documents you provide.
-
matrix:​direct
-
Returns a matrix whose contents you provide directly.
-
matrix:​element​Wise​Product
-
Computes an element-wise product of two matrices.
-
matrix:​from​Matrix​Rows
-
Collects matrix rows into an in-memory matrix.
-
matrix:​keyword​Document​Similarity
-
Computes a document similarity matrix based on the labels the documents share.
-
matrix:​keyword​Label​Document​Similarity
-
Computes similarities between a list of labels and a set of documents.
-
matrix:​knn2d​Distance​Similarity
-
Computes a matrix of similarities between 2d embeddings based on the 2d Euclidean distance. You can use this matrix to identify clusters of nearby points in a 2d map.
-
matrix:​knn​Vectors​Similarity
-
Computes a label or document similarity matrix based on multidimensional vector distance.
matrix:​reference
-
References the results of another
matrix:​*
stage defined in the request.
The JSON output of the matrix stage has the following structure:
-
columns
-
The number of columns in this matrix. The number of rows is equal to the length of the
indices
,values
anddiagonals
arrays. -
indices
-
Indices of the non-zero elements of the matrix. Each element in the array corresponds to one matrix row. For each row, the nested array lists zero-based indices of the non-empty matrix elements in that row.
-
values
-
Values of the non-zero elements of the matrix. Each element in the array corresponds to one matrix row. For each row, the nested array contains values of elements at the corresponding matrix indices indicated in the
indices
array.
Notes:
-
Most matrices in Lingo4G are sparse, hence the specific JSON output format.
-
Some matrix rows may be all-zeros. In such cases, the corresponding
indices
andvalues
arrays are empty.
matrix:​cooccurrence​Label​Similarity
Computes a label similarity matrix based on how the labels co-occur in the documents you provide.
{
"type": "matrix:cooccurrenceLabelSimilarity",
"cooccurrenceWindowSize": 32,
"documents": {
"type": "documents:reference",
"auto": true
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labels": {
"type": "labels:reference",
"auto": true
},
"normalized": true,
"similarityWeighting": "INCLUSION",
"threads": "auto"
}
You can use the cooccurrence-based similarity matrix to to cluster or 2d-map a set of labels.
Lingo4G computes the co-occurrence-based label similarity matrix in the following way:
-
For each pair (labelA, labelB) of input
labels
, Lingo4G scans the inputdocuments
to compute:- : how many times labelA occurred in the input documents,
- : how many times labelB occurred in the input documents,
-
how many times both labelA and labelB occurred in the document, at most
cooccurrence​Window​Size
words apart.
-
Based on the , and frequencies and the
similarity​Weighting
method you choose, Lingo4G computes the and similarity values and puts them in the output matrix at rows and columns corresponding to the indices of labelA and labelB in the inputlabels
list.For example, if on the
labels
list labelA is at index 0, and labelB is at index 2, Lingo4G puts the corresponding similarities at the following locations in the output matrix: -
If
normalized
istrue
, Lingo4G globally normalizes all values in the similarity matrix to the 0...1 range.
Note that depending on the
similarity​Weighting
you choose, the matrix may or may not be symmetrical.
The following request computes co-occurrence counts for a list of labels across all documents containing at least one of the labels.
Note how the requests uses the
query:​for​Labels
query component to select documents containing at least one of the input labels. We also use the
C​O​O​C​C​U​R​R​E​N​C​E​S
similarity weighting and disable matrix normalization to get the actual number of label
co-occurrences.
A typical use-case for matrix:​cooccurrence​Label​Similarity
is clustering or 2d mapping of labels. The
following request briefly demonstrates this use case.
The above request collects 1000 labels from documents containing the word
clustering, computes a co-occurrence similarity matrix for those labels and then arranges the labels into
clusters and a 2d-map. The request uses the explicit output.stages
array to prevent the output of the similarity matrix. Also note that the request uses
auto-references
to pass results between all the stages.
If you run the request in the JSON Sandbox app, you should see an interactive visualization of the clusters and the 2d map. Also have a look at the diagram tab, for a graphical representation of the connections between various stages.
Finally, if your index contains label embeddings, the embedding-based
matrix:​knn​Vectors​Similarity
stage may provide better clusters and 2d-maps.
cooccurrence​Window​Size
Determines the maximum number of words that can separate co-occurring labels.
For example. if you set the co-occurrence window size to 10
, Lingo4G treats two labels as
co-occurring if they occur in the document at most 8 words apart.
Use smaller co-occurrence windows for sparser, more focused similarity matrices.
documents
The documents in which to count label co-occurrences.
fields
The documents' feature fields in which to count label co-occurrences.
labels
The labels whose co-occurrences to count.
normalized
If true
, Lingo4G globally normalizes the similarity matrix to contain values in the 0...1 range.
The embedding2d:​lv
requires normalized matrix values.
One use case for non-normalized matrix values is computing the actual label co-occurrence frequencies with
similarity​Weighting
set to
C​O​O​C​C​U​R​R​E​N​C​E​S
.1
similarity​Weighting
Determines the binary similarity weighting Lingo4G applies to raw label co-occurrence when computing label similarity values.
In most cases, the R​R
, I​N​C​L​U​S​I​O​N
and
B​B
weightings provide best clustering and 2d mapping results.
The similarity​Weighting
property supports the following values:
R​R
- Russel-Rao similarity. Similarity values will be proportional to the raw co-occurrence counts. The RR weighting creates rather large clusters and selects frequent labels as cluster label exemplars.
I​N​C​L​U​S​I​O​N
-
Inclusion coefficient similarity, emphasizes connections between labels sharing the same words, for example Mac OS and Mac OS X 10.6.
L​O​E​V​I​N​G​E​R
-
The inclusion coefficient corrected for chance.
B​B
-
Braun-Blanquet similarity. Maximizes similarity between labels having similar numbers of occurrences. Promotes lower-frequency labels as cluster exemplars.
D​I​C​E
-
Dice coefficient.
Y​U​L​E
-
Yule coefficient.
O​C​H​I​A​I
-
Ochiai coefficient, binary cosine.
I​N​N​E​R_​P​R​O​D​U​C​T
-
Inner product of the rows of the co-occurrence matrix.
C​O​S​I​N​E
-
Cosine distance between the rows of the co-occurrence matrix.
P​E​A​R​S​O​N
-
Pearson correlation between the rows of the co-occurrence matrix.
C​O​O​C​C​U​R​R​E​N​C​E​S
-
Number of co-occurrences of labels. Set the
normalized
property tofalse
to get the actual numbers in the output matrix.
threads
The number of threads to use to count label co-occurrences.
matrix:​direct
A matrix where you directly provide all values in rows and columns.
{
"type": "matrix:direct",
"matrix": {
"columns": 0,
"indices": [],
"values": []
}
}
Direct matrices are useful mostly for debugging purposes or when you want to cluster or 2d-map a set of similarities coming from an external source.
matrix
Definition of the matrix.
The definition must follow the sparse matrix JSON structure.
columns
The number of columns of the matrix.
indices
Indices of non-zero elements of the matrix.
values
Values corresponding to indices
.
matrix:​element​Wise​Product
Computes the element-by-element product of two matrices.
{
"type": "matrix:elementWiseProduct",
"factorA": null,
"factorB": null
}
factor​A
Input matrix.
factor​B
Input matrix.
matrix:​from​Matrix​Rows
Materializes matrix rows into an in-memory matrix.
{
"type": "matrix:fromMatrixRows",
"matrixRows": {
"type": "matrixRows:reference",
"auto": true
}
}
This stage is useful mostly for debugging requests involving matrix rows.
matrix​Rows
The matrix rows to materialize.
matrix:​keyword​Document​Similarity
Computes a document similarity matrix based on the labels the documents share.
{
"type": "matrix:keywordDocumentSimilarity",
"documents": {
"type": "documents:reference",
"auto": true
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
},
"minTf": 0,
"minTfMass": 1,
"tieResolution": "AUTO"
},
"maxDocumentsForSubIndex": 0.3,
"maxInMemorySubIndexSize": 8000000,
"maxNeighbors": 8,
"maxQueryLabelsPerDocument": 4,
"minQueryLabelsPerDocument": 1,
"minQueryLabelsRequiredInSimilarDocument": 1,
"normalized": true,
"threads": "auto"
}
You can use the keyword-based document similarity matrix to cluster or 2d-map a set of documents.
To compute the keyword-based document similarity, Lingo4G performs the following steps:
-
For each document in the input
documents
, Lingo4G uses thelabel​Collector
you provide to extract up tomax​Query​Labels​Per​Document
labels that characterize the document.If the number of labels extracted from the document is smaller than
min​Query​Labels​Per​Document
, Lingo4G excludes the document from processing. The corresponding row in the similarity matrix will be empty. -
If the number of input
documents
in relation to the total number of documents in the index is larger thanmax​Documents​For​Sub​Index
, Lingo4G creates a temporary inverted index that improves the performance of the similarity matrix building. -
For each input document, Lingo4G builds a search query consisting of labels extracted in step 1. Lingo4G restricts the query to find matches only among the input
documents
. Additionally, the query matches only documents that contain at leastmin​Query​Labels​Required​In​Similar​Document
of the document's labels obtained in step 1. -
For each input document, Lingo4G runs the corresponding query it built in step 3 to retrieve up to
max​Neighbors
matching documents. Lingo4G uses up tothreads
to execute the queries in parallel. -
For each input document, Lingo4G builds the corresponding row of the similarity matrix using the matching documents retrieved in step 4.
For example, assuming that the query corresponding to document at index 2 in the input
documents
array matched documents at index 0, 2, and 3, Lingo4G puts the following values into the similarity matrix:where , and are the search scores obtained in step 4 for document at index 2.
Note that matrix is square and asymmetrical – the search query for document at index 0 or 3 may not return document 2.
-
If
normalized
istrue
, Lingo4G normalizes values each row of matrix to fall in the 0...1 range.
In the Apache Lucene, Solr and Elasticsearch world, this kind of similarity is also called More-Like-This similarity.
The following request uses keyword documents similarity to cluster and 2d map documents.
If you run the above request in the JSON Sandbox app, you should see the documents represented as a 2d map with point color corresponding to the top-level cluster the document belongs to.
documents
The documents among which to compute keyword-based similarities.
If you provide a set of N documents, this stage produces a square N Ă— N similarity matrix.
fields
The feature fields to use when looking for similar documents.
Lingo4G uses the feature fields you provide in this property to run the similar document search queries in step 4 of the matrix building algorithm.
label​Collector
Determines which labels to use to build similar documents search queries.
Lingo4G uses the label collector you provide in step 1 of the matrix building algorithm to extract labels describing each input document.
The default label extractor retrieves up to
max​Query​Labels​Per​Document
of the most frequent labels in each document. Provide a custom label collector to modify this behavior.
max​Documents​For​Sub​Index
Determines the threshold for creating a temporary inverted index.
Lingo4G can significantly speed up the computation of keyword similarities for a small set of documents by
creating and querying a temporary disposable inverted index containing just the input
documents
. Lingo4G creates the temporary index only when the number of input documents divided by the total number of
documents in the index is smaller or equal to the value of this property.
For example, if max​Documents​For​Sub​Index
is 0.3
, if the
documents
set contains fewer than 30% of all documents in the index, Lingo4G creates a temporary index to speed up the
computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.
max​In​Memory​Sub​Index​Size
Maximum size of the in-memory temporary index, in bytes.
If the size of the temporary index exceeds
max​In​Memory​Sub​Index​Size
, Lingo4G materializes the index on disk in a temporary directory.
max​Neighbors
The maximum number of similar documents to retrieve for each input document.
Lingo4G uses the max​Neighbors
property in
step 4
of the similarity matrix building algorithm to determine the maximum number of similar documents to retrieve for
each row of the similarity matrix. Therefore, this stage produces matrices whose rows contain at most
max​Neighbors
values.
Increasing the number of similar documents produces similarity matrices that give rise to larger clusters and tighter 2d maps. Conversely, lowering the number of similar documents gives rise to smaller clusters and more sparse 2d maps.
max​Query​Labels​Per​Document
The maximum number of labels to use to build the similar documents search query.
Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each input document.
Increasing max​Query​Labels​Per​Document
, coupled with a larger
max​Neighbors
values, produces broader, more general similarity matrices that usually give rise to larger clusters and more
dense 2d maps.
Lingo4G ignores the max​Query​Labels​Per​Document
property if you set a custom
label​Collector
.
min​Query​Labels​Per​Document
The minimum number of labels the document must contain to be included in the similarity matrix computation.
Lingo4G uses this property in step 1 of the
similarity matrix building algorithm where it collects a set of labels that best describe each input document.
If a document contains fewer than min​Query​Labels​Per​Document
, Lingo4G excludes it from processing,
leaving the corresponding row in the similarity matrix empty.
If you want to exclude from further processing documents containing just one label, increase
min​Query​Labels​Per​Document
beyond the default value of 1. Clusters arising from such similarity
matrices are usually smaller and 2d maps are more sparse.
If you increase min​Query​Labels​Per​Document
, make sure to set
max​Query​Labels​Per​Document
to a value equal or greater than min​Query​Labels​Per​Document
.
Lingo4G ignores the min​Query​Labels​Per​Document
property if you set a custom
label​Collector
.
min​Query​Labels​Required​In​Similar​Document
The minimum number of common labels required for two documents to be treated as similar.
Lingo4G uses this property in step 3 of the
similarity matrix building algorithm. If you increase min​Query​Labels​Per​Document
beyond the default value of 1, Lingo4G removes from the similarity matrix those document pairs that have fewer
than
min​Query​Labels​Per​Document
labels in common.
If you don't want to base document similarities on a single label shared between documents, increase
min​Query​Labels​Per​Document
beyond the default value of 1. Clusters arising from such similarity
matrices are usually smaller and 2d maps are more sparse.
normalized
If true
, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.
If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.
If you plan to feed the similarity matrix to the
embedding2d:​lv
2d embedding algorithm, make sure the normalized
property is true
. Otherwise, Lingo4G
throws an error to prevent incorrect 2d embedding results.
threads
The number of concurrent threads to use to execute document similarity search queries.
matrix:​keyword​Label​Document​Similarity
Computes a rectangular similarity matrix a list of labels and a set of documents you provide.
{
"type": "matrix:keywordLabelDocumentSimilarity",
"documents": {
"type": "documents:reference",
"auto": true
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labels": {
"type": "labels:reference",
"auto": true
},
"maxSimilarDocumentsPerLabel": 5,
"threads": "auto"
}
You can use this stage to overlay labels on a pre-existing 2d map of documents in such a way that labels summarize the documents lying in various areas of the map.
Lingo4G computes the label-document similarity matrix in the following way:
-
For each label from the input
labels
list, Lingo4G builds a search query consisting of that label, covering thefields
you provide and limited to the set of targetdocuments
. -
For each input label, Lingo4G executes the search query it built in step 1 and collects up to
max​Similar​Documents​Per​Label
documents. Then, Lingo4G sets the value in the output similarity matrix at the row corresponding to label index and column corresponding to the index of the matching document to equal the search score of the matching document.
The following request uses the
matrix:​keyword​Label​Document​Similarity
stage to describe a 2d map of documents by placing labels near groups of documents related to those labels.
See the reference documentation for the
embedding2d:​lv​Overlay
stage for in-depth explanation of this kind of requests.
documents
The documents for which to compute the similarity matrix.
Column indices in the output matrix correspond to indices on the document list you provide.
fields
The document fields to search for labels when building the similarity matrix.
labels
The labels for which to build the similarity matrix.
Row indices in the output matrix correspond to indices on the label list you provide.
max​Similar​Documents​Per​Label
The maximum number of similar documents to retrieve for each label.
Each row of the output matrix has at most max​Similar​Documents​Per​Label
.
threads
The number of parallel threads to use when building the similarity matrix.
matrix:​knn2d​Distance​Similarity
Computes a matrix of similarities based on Euclidean distance between 2d points.
{
"type": "matrix:knn2dDistanceSimilarity",
"embedding2d": {
"type": "embedding2d:reference",
"auto": true
},
"maxNearestPoints": 8
}
You can use this stage to identify clusters of nearby points in a 2d map.
Lingo4G computes the matrix in the following way:
-
For each 2d point in the input
embedding2d
, find up tomax​Nearest​Points
with respect to the Euclidean distance. -
For each nearest point found in step 1, compute the similarity using the following formula:
where is the Euclidean distance between points and .
The above similarity formula converts 2d distances to the 0...1 in such a way that the similarity for points with zero distance is 1 and similarity for points that are infinitely apart is 0.
-
Put similarity computed in step 2 into the similarity matrix at the row and column location corresponding to points and .
The following request uses the matrix:​knn2d​Distance​Similarity
stage to identify clusters of nearby points in a 2d map:
The above request collects 2000 labels from documents containing the word
clustering and arranges them on a 2d map using the
matrix:​knn​Vectors​Similarity
similarity. Then, it clusters the label points on the 2d map using the
clusters:​ap
clustering algorithm and the matrix:​knn2d​Distance​Similarity
similarity.
If you run the above request in the JSON Sandbox app, you should see the map of labels with proximity-based clusters represented as different colors. Switch to the diagram for a visualization of the data flow in the request.
embedding2d
The input 2d points for which to find the nearest neighbors.
max​Nearest​Points
The maximum number of the nearest points to find for each point.
The larger the max​Nearest​Points
value, the denser the matrix and the larger the clusters you obtain
from clustering that matrix.
matrix:​knn​Vectors​Similarity
Computes a label or document similarity matrix based on multidimensional vector distance.
{
"type": "matrix:knnVectorsSimilarity",
"maxNeighbors": 10,
"threads": "auto",
"vectors": {
"type": "vectors:reference",
"auto": true
}
}
You can use this stage to cluster or 2d-map a set of documents or labels based on the similarity of their corresponding embedding vectors.
For each vector in the input set of multidimensional
vectors
, Lingo4G finds up to
max​Neighbors
closest vectors in the same vector set and transfers the cosine similarities between the nearest vectors to the
output matrix. Therefore, the output matrix is square, its size is equal to the number of vectors in the input
vectors
set and values fall in the 0...1 range.
If you use
vectors:​precomputed​Document​Embeddings
as the input vectors
set, this stage computes
similarities between documents. Similarly, if you provide
vectors:​precomputed​Label​Embeddings
on input, this stage computes similarities between labels.
In many cases, embedding-based similarities offer better clustering and 2d mapping results compared to their co-occurrence and keyword-based counterparts. The following request arranges 2000 labels into a 2d map where similar labels are close to each other.
max​Neighbors
The maximum number of nearest vectors to find for each vector.
The larger max​Neighbors
, the denser the matrix and the larger the clusters resulting from
clustering the matrix.
threads
The number of threads to use for the computation.
vectors
The set of multidimensional vectors among which to find similarities.
The size of the output square matrix is equal to the number of vectors in the input vector set.
matrix:​*
Consumers of
The following stages and components take matrix:​*
as
input:
Stage or component | Property |
---|---|
clusters:​ap | matrix |
embedding2d:​lv | matrix |
embedding2d:​lv​Overlay | matrix |
matrix:​element​Wise​Product | factor​A factor​B |
matrix​Rows:​from​Matrix | matrix |