matrix
The matrix:â€‹*
stages group various ways of producing matrices. The semantics of matrix rows and columns
depends on the specific stage. You can use matrices as input to the
clustering and
2d embedding stages.
Matrices are the bridge between individual entities, such as labels or documents, and their aggregations, such as clusters or 2d maps. Clustering and 2d mapping stages do not accept documents or labels directly on input. Instead, they accept similarity matrices, which define the semantics and interpretation of clusters or 2d maps.
Matrix building, clustering and 2d mapping stages rely on a very important concept universal to the whole Lingo4G analysis API: index alignment. When you build a similarity matrix, indices of rows and columns of the matrix correspond to the indices of documents or labels you provided on input. When you perform clustering on such a matrix, you receive clusters of related indices of the input matrix. Since indices across input entities, matrices and clusters are aligned, to find out which specific labels or documents got clustered, you need to look up the indices in the list of documents or labels you provided when building the similarity matrix.
You can use the following matrix stages in your analysis requests:

matrix:â€‹cooccurrenceâ€‹Labelâ€‹Similarity

Computes a label similarity matrix based on how the labels cooccur in the documents you provide.

matrix:â€‹direct

Returns a matrix whose contents you provide directly.

matrix:â€‹elementâ€‹Wiseâ€‹Product

Computes an elementwise product of two matrices.

matrix:â€‹fromâ€‹Matrixâ€‹Rows

Collects matrix rows into an inmemory matrix.

matrix:â€‹keywordâ€‹Documentâ€‹Similarity

Computes a document similarity matrix based on the labels the documents share.

matrix:â€‹keywordâ€‹Labelâ€‹Documentâ€‹Similarity

Computes similarities between a list of labels and a set of documents.

matrix:â€‹knn2dâ€‹Distanceâ€‹Similarity

Computes a matrix of similarities between 2d embeddings based on the 2d Euclidean distance. You can use this matrix to identify clusters of nearby points in a 2d map.

matrix:â€‹knnâ€‹Vectorsâ€‹Similarity

Computes a label or document similarity matrix based on multidimensional vector distance.
matrix:â€‹reference

References the results of another
matrix:â€‹*
stage defined in the request.
The JSON output of the matrix stage has the following structure:

columns

The number of columns in this matrix. The number of rows is equal to the length of the
indices
,values
anddiagonals
arrays. 
indices

Indices of the nonzero elements of the matrix. Each element in the array corresponds to one matrix row. For each row, the nested array lists zerobased indices of the nonempty matrix elements in that row.

values

Values of the nonzero elements of the matrix. Each element in the array corresponds to one matrix row. For each row, the nested array contains values of elements at the corresponding matrix indices indicated in the
indices
array.
Notes:

Most matrices in Lingo4G are sparse, hence the specific JSON output format.

Some matrix rows may be allzeros. In such cases, the corresponding
indices
andvalues
arrays are empty.
matrix:â€‹cooccurrenceâ€‹Labelâ€‹Similarity
Computes a label similarity matrix based on how the labels cooccur in the documents you provide.
{
"type": "matrix:cooccurrenceLabelSimilarity",
"cooccurrenceWindowSize": 32,
"documents": {
"type": "documents:reference",
"auto": true
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labels": {
"type": "labels:reference",
"auto": true
},
"normalized": true,
"similarityWeighting": "INCLUSION",
"threads": "auto"
}
You can use the cooccurrencebased similarity matrix to to cluster or 2dmap a set of labels.
Lingo4G computes the cooccurrencebased label similarity matrix in the following way:

For each pair (labelA, labelB) of input
labels
, Lingo4G scans the inputdocuments
to compute: ${f}_{\mathrm{A}}$ : how many times labelA occurred in the input documents,
 ${f}_{\mathrm{B}}$ : how many times labelB occurred in the input documents,

${f}_{\mathrm{AB}}$
how many times both labelA and labelB occurred in the document, at most
cooccurrenceâ€‹Windowâ€‹Size
words apart.

Based on the ${f}_{\mathrm{A}}$ , ${f}_{\mathrm{B}}$ and ${f}_{\mathrm{AB}}$ frequencies and the
similarityâ€‹Weighting
method you choose, Lingo4G computes the ${s}_{\mathrm{AB}}$ and ${s}_{\mathrm{BA}}$ similarity values and puts them in the output matrix at rows and columns corresponding to the indices of labelA and labelB in the inputlabels
list.For example, if on the
$$M=\left[\begin{array}{cccc}\mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{{s}_{\mathrm{AB}}}& \mathrm{\xc2\xb7}\\ \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}\\ \mathrm{{s}_{\mathrm{BA}}}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}\\ \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}\end{array}\right]$$labels
list labelA is at index 0, and labelB is at index 2, Lingo4G puts the corresponding similarities at the following locations in the output matrix: 
If
normalized
istrue
, Lingo4G globally normalizes all values in the similarity matrix to the 0...1 range.
Note that depending on the
similarityâ€‹Weighting
you choose, the matrix may or may not be symmetrical.
The following request computes cooccurrence counts for a list of labels across all documents containing at least one of the labels.
Note how the requests uses the
query:â€‹forâ€‹Labels
query component to select documents containing at least one of the input labels. We also use the
Câ€‹Oâ€‹Oâ€‹Câ€‹Câ€‹Uâ€‹Râ€‹Râ€‹Eâ€‹Nâ€‹Câ€‹Eâ€‹S
similarity weighting and disable matrix normalization to get the actual number of label
cooccurrences.
A typical usecase for matrix:â€‹cooccurrenceâ€‹Labelâ€‹Similarity
is clustering or 2d mapping of labels. The
following request briefly demonstrates this use case.
The above request collects 1000 labels from documents containing the word
clustering, computes a cooccurrence similarity matrix for those labels and then arranges the labels into
clusters and a 2dmap. The request uses the explicit output.stages
array to prevent the output of the similarity matrix. Also note that the request uses
autoreferences
to pass results between all the stages.
If you run the request in the JSON Sandbox app, you should see an interactive visualization of the clusters and the 2d map. Also have a look at the diagram tab, for a graphical representation of the connections between various stages.
Finally, if your index contains label embeddings, the embeddingbased
matrix:â€‹knnâ€‹Vectorsâ€‹Similarity
stage may provide better clusters and 2dmaps.
cooccurrenceâ€‹Windowâ€‹Size
Determines the maximum number of words that can separate cooccurring labels.
For example. if you set the cooccurrence window size to 10
, Lingo4G treats two labels as
cooccurring if they occur in the document at most 8 words apart.
Use smaller cooccurrence windows for sparser, more focused similarity matrices.
documents
The documents in which to count label cooccurrences.
fields
The documents' feature fields in which to count label cooccurrences.
labels
The labels whose cooccurrences to count.
normalized
If true
, Lingo4G globally normalizes the similarity matrix to contain values in the 0...1 range.
The embedding2d:â€‹lv
requires normalized matrix values.
One use case for nonnormalized matrix values is computing the actual label cooccurrence frequencies with
similarityâ€‹Weighting
set to
Câ€‹Oâ€‹Oâ€‹Câ€‹Câ€‹Uâ€‹Râ€‹Râ€‹Eâ€‹Nâ€‹Câ€‹Eâ€‹S
.1
similarityâ€‹Weighting
Determines the binary similarity weighting Lingo4G applies to raw label cooccurrence when computing label similarity values.
In most cases, the Râ€‹R
, Iâ€‹Nâ€‹Câ€‹Lâ€‹Uâ€‹Sâ€‹Iâ€‹Oâ€‹N
and
Bâ€‹B
weightings provide best clustering and 2d mapping results.
The similarityâ€‹Weighting
property supports the following values:
Râ€‹R
 RusselRao similarity. Similarity values will be proportional to the raw cooccurrence counts. The RR weighting creates rather large clusters and selects frequent labels as cluster label exemplars.
Iâ€‹Nâ€‹Câ€‹Lâ€‹Uâ€‹Sâ€‹Iâ€‹Oâ€‹N

Inclusion coefficient similarity, emphasizes connections between labels sharing the same words, for example Mac OS and Mac OS X 10.6.
Lâ€‹Oâ€‹Eâ€‹Vâ€‹Iâ€‹Nâ€‹Gâ€‹Eâ€‹R

The inclusion coefficient corrected for chance.
Bâ€‹B

BraunBlanquet similarity. Maximizes similarity between labels having similar numbers of occurrences. Promotes lowerfrequency labels as cluster exemplars.
Dâ€‹Iâ€‹Câ€‹E

Dice coefficient.
Yâ€‹Uâ€‹Lâ€‹E

Yule coefficient.
Oâ€‹Câ€‹Hâ€‹Iâ€‹Aâ€‹I

Ochiai coefficient, binary cosine.
Iâ€‹Nâ€‹Nâ€‹Eâ€‹R_â€‹Pâ€‹Râ€‹Oâ€‹Dâ€‹Uâ€‹Câ€‹T

Inner product of the rows of the cooccurrence matrix.
Câ€‹Oâ€‹Sâ€‹Iâ€‹Nâ€‹E

Cosine distance between the rows of the cooccurrence matrix.
Pâ€‹Eâ€‹Aâ€‹Râ€‹Sâ€‹Oâ€‹N

Pearson correlation between the rows of the cooccurrence matrix.
Câ€‹Oâ€‹Oâ€‹Câ€‹Câ€‹Uâ€‹Râ€‹Râ€‹Eâ€‹Nâ€‹Câ€‹Eâ€‹S

Number of cooccurrences of labels. Set the
normalized
property tofalse
to get the actual numbers in the output matrix.
threads
The number of threads to use to count label cooccurrences.
matrix:â€‹direct
A matrix where you directly provide all values in rows and columns.
{
"type": "matrix:direct",
"matrix": {
"columns": 0,
"indices": [],
"values": []
}
}
Direct matrices are useful mostly for debugging purposes or when you want to cluster or 2dmap a set of similarities coming from an external source.
matrix
Definition of the matrix.
The definition must follow the sparse matrix JSON structure.
columns
The number of columns of the matrix.
indices
Indices of nonzero elements of the matrix.
values
Values corresponding to indices
.
matrix:â€‹elementâ€‹Wiseâ€‹Product
Computes the elementbyelement product of two matrices.
{
"type": "matrix:elementWiseProduct",
"factorA": null,
"factorB": null
}
factorâ€‹A
Input matrix.
factorâ€‹B
Input matrix.
matrix:â€‹fromâ€‹Matrixâ€‹Rows
Materializes matrix rows into an inmemory matrix.
{
"type": "matrix:fromMatrixRows",
"matrixRows": {
"type": "matrixRows:reference",
"auto": true
}
}
This stage is useful mostly for debugging requests involving matrix rows.
matrixâ€‹Rows
The matrix rows to materialize.
matrix:â€‹keywordâ€‹Documentâ€‹Similarity
Computes a document similarity matrix based on the labels the documents share.
{
"type": "matrix:keywordDocumentSimilarity",
"documents": {
"type": "documents:reference",
"auto": true
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
},
"minTf": 0,
"minTfMass": 1,
"tieResolution": "AUTO"
},
"maxDocumentsForSubIndex": 0.3,
"maxInMemorySubIndexSize": 8000000,
"maxNeighbors": 8,
"maxQueryLabelsPerDocument": 4,
"minQueryLabelsPerDocument": 1,
"minQueryLabelsRequiredInSimilarDocument": 1,
"normalized": true,
"threads": "auto"
}
You can use the keywordbased document similarity matrix to cluster or 2dmap a set of documents.
To compute the keywordbased document similarity, Lingo4G performs the following steps:

For each document in the input
documents
, Lingo4G uses thelabelâ€‹Collector
you provide to extract up tomaxâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
labels that characterize the document.If the number of labels extracted from the document is smaller than
minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
, Lingo4G excludes the document from processing. The corresponding row in the similarity matrix will be empty. 
If the number of input
documents
in relation to the total number of documents in the index is larger thanmaxâ€‹Documentsâ€‹Forâ€‹Subâ€‹Index
, Lingo4G creates a temporary inverted index that improves the performance of the similarity matrix building. 
For each input document, Lingo4G builds a search query consisting of labels extracted in step 1. Lingo4G restricts the query to find matches only among the input
documents
. Additionally, the query matches only documents that contain at leastminâ€‹Queryâ€‹Labelsâ€‹Requiredâ€‹Inâ€‹Similarâ€‹Document
of the document's labels obtained in step 1. 
For each input document, Lingo4G runs the corresponding query it built in step 3 to retrieve up to
maxâ€‹Neighbors
matching documents. Lingo4G uses up tothreads
to execute the queries in parallel. 
For each input document, Lingo4G builds the corresponding row of the similarity matrix using the matching documents retrieved in step 4.
For example, assuming that the query corresponding to document at index 2 in the input
$$M=\left[\begin{array}{cccc}\mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}\\ \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}\\ \mathrm{{s}_{0}}& \mathrm{\xc2\xb7}& \mathrm{{s}_{2}}& \mathrm{{s}_{3}}\\ \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}& \mathrm{\xc2\xb7}\end{array}\right]$$documents
array matched documents at index 0, 2, and 3, Lingo4G puts the following values into the similarity matrix:where ${s}_{0}$ , ${s}_{2}$ and ${s}_{3}$ are the search scores obtained in step 4 for document at index 2.
Note that matrix $M$ is square and asymmetrical â€“ the search query for document at index 0 or 3 may not return document 2.

If
normalized
istrue
, Lingo4G normalizes values each row of matrix $M$ to fall in the 0...1 range.
In the Apache Lucene, Solr and Elasticsearch world, this kind of similarity is also called MoreLikeThis similarity.
The following request uses keyword documents similarity to cluster and 2d map documents.
If you run the above request in the JSON Sandbox app, you should see the documents represented as a 2d map with point color corresponding to the toplevel cluster the document belongs to.
documents
The documents among which to compute keywordbased similarities.
If you provide a set of N documents, this stage produces a square N Ă— N similarity matrix.
fields
The feature fields to use when looking for similar documents.
Lingo4G uses the feature fields you provide in this property to run the similar document search queries in step 4 of the matrix building algorithm.
labelâ€‹Collector
Determines which labels to use to build similar documents search queries.
Lingo4G uses the label collector you provide in step 1 of the matrix building algorithm to extract labels describing each input document.
The default label extractor retrieves up to
maxâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
of the most frequent labels in each document. Provide a custom label collector to modify this behavior.
maxâ€‹Documentsâ€‹Forâ€‹Subâ€‹Index
Determines the threshold for creating a temporary inverted index.
Lingo4G can significantly speed up the computation of keyword similarities for a small set of documents by
creating and querying a temporary disposable inverted index containing just the input
documents
. Lingo4G creates the temporary index only when the number of input documents divided by the total number of
documents in the index is smaller or equal to the value of this property.
For example, if maxâ€‹Documentsâ€‹Forâ€‹Subâ€‹Index
is 0.3
, if the
documents
set contains fewer than 30% of all documents in the index, Lingo4G creates a temporary index to speed up the
computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.
maxâ€‹Inâ€‹Memoryâ€‹Subâ€‹Indexâ€‹Size
Maximum size of the inmemory temporary index, in bytes.
If the size of the temporary index exceeds
maxâ€‹Inâ€‹Memoryâ€‹Subâ€‹Indexâ€‹Size
, Lingo4G materializes the index on disk in a temporary directory.
maxâ€‹Neighbors
The maximum number of similar documents to retrieve for each input document.
Lingo4G uses the maxâ€‹Neighbors
property in
step 4
of the similarity matrix building algorithm to determine the maximum number of similar documents to retrieve for
each row of the similarity matrix. Therefore, this stage produces matrices whose rows contain at most
maxâ€‹Neighbors
values.
Increasing the number of similar documents produces similarity matrices that give rise to larger clusters and tighter 2d maps. Conversely, lowering the number of similar documents gives rise to smaller clusters and more sparse 2d maps.
maxâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
The maximum number of labels to use to build the similar documents search query.
Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each input document.
Increasing maxâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
, coupled with a larger
maxâ€‹Neighbors
values, produces broader, more general similarity matrices that usually give rise to larger clusters and more
dense 2d maps.
Lingo4G ignores the maxâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
property if you set a custom
labelâ€‹Collector
.
minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
The minimum number of labels the document must contain to be included in the similarity matrix computation.
Lingo4G uses this property in step 1 of the
similarity matrix building algorithm where it collects a set of labels that best describe each input document.
If a document contains fewer than minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
, Lingo4G excludes it from processing,
leaving the corresponding row in the similarity matrix empty.
If you want to exclude from further processing documents containing just one label, increase
minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
beyond the default value of 1. Clusters arising from such similarity
matrices are usually smaller and 2d maps are more sparse.
If you increase minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
, make sure to set
maxâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
to a value equal or greater than minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
.
Lingo4G ignores the minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
property if you set a custom
labelâ€‹Collector
.
minâ€‹Queryâ€‹Labelsâ€‹Requiredâ€‹Inâ€‹Similarâ€‹Document
The minimum number of common labels required for two documents to be treated as similar.
Lingo4G uses this property in step 3 of the
similarity matrix building algorithm. If you increase minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
beyond the default value of 1, Lingo4G removes from the similarity matrix those document pairs that have fewer
than
minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
labels in common.
If you don't want to base document similarities on a single label shared between documents, increase
minâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
beyond the default value of 1. Clusters arising from such similarity
matrices are usually smaller and 2d maps are more sparse.
normalized
If true
, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.
If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.
If you plan to feed the similarity matrix to the
embedding2d:â€‹lv
2d embedding algorithm, make sure the normalized
property is true
. Otherwise, Lingo4G
throws an error to prevent incorrect 2d embedding results.
threads
The number of concurrent threads to use to execute document similarity search queries.
matrix:â€‹keywordâ€‹Labelâ€‹Documentâ€‹Similarity
Computes a rectangular similarity matrix a list of labels and a set of documents you provide.
{
"type": "matrix:keywordLabelDocumentSimilarity",
"documents": {
"type": "documents:reference",
"auto": true
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labels": {
"type": "labels:reference",
"auto": true
},
"maxSimilarDocumentsPerLabel": 5,
"threads": "auto"
}
You can use this stage to overlay labels on a preexisting 2d map of documents in such a way that labels summarize the documents lying in various areas of the map.
Lingo4G computes the labeldocument similarity matrix in the following way:

For each label from the input
labels
list, Lingo4G builds a search query consisting of that label, covering thefields
you provide and limited to the set of targetdocuments
. 
For each input label, Lingo4G executes the search query it built in step 1 and collects up to
maxâ€‹Similarâ€‹Documentsâ€‹Perâ€‹Label
documents. Then, Lingo4G sets the value in the output similarity matrix at the row corresponding to label index and column corresponding to the index of the matching document to equal the search score of the matching document.
The following request uses the
matrix:â€‹keywordâ€‹Labelâ€‹Documentâ€‹Similarity
stage to describe a 2d map of documents by placing labels near groups of documents related to those labels.
See the reference documentation for the
embedding2d:â€‹lvâ€‹Overlay
stage for indepth explanation of this kind of requests.
documents
The documents for which to compute the similarity matrix.
Column indices in the output matrix correspond to indices on the document list you provide.
fields
The document fields to search for labels when building the similarity matrix.
labels
The labels for which to build the similarity matrix.
Row indices in the output matrix correspond to indices on the label list you provide.
maxâ€‹Similarâ€‹Documentsâ€‹Perâ€‹Label
The maximum number of similar documents to retrieve for each label.
Each row of the output matrix has at most maxâ€‹Similarâ€‹Documentsâ€‹Perâ€‹Label
.
threads
The number of parallel threads to use when building the similarity matrix.
matrix:â€‹knn2dâ€‹Distanceâ€‹Similarity
Computes a matrix of similarities based on Euclidean distance between 2d points.
{
"type": "matrix:knn2dDistanceSimilarity",
"embedding2d": {
"type": "embedding2d:reference",
"auto": true
},
"maxNearestPoints": 8
}
You can use this stage to identify clusters of nearby points in a 2d map.
Lingo4G computes the matrix in the following way:

For each 2d point $p$ in the input
embedding2d
, find up tomaxâ€‹Nearestâ€‹Points
with respect to the Euclidean distance. 
For each nearest point ${p}_{n}$ found in step 1, compute the similarity using the following formula:
$$s=\frac{1}{1+{e}^{d}}$$where $d$ is the Euclidean distance between points $p$ and ${p}_{n}$ .
The above similarity formula converts 2d distances to the 0...1 in such a way that the similarity for points with zero distance is 1 and similarity for points that are infinitely apart is 0.

Put similarity $s$ computed in step 2 into the similarity matrix at the row and column location corresponding to points $p$ and ${p}_{n}$ .
The following request uses the matrix:â€‹knn2dâ€‹Distanceâ€‹Similarity
stage to identify clusters of nearby points in a 2d map:
The above request collects 2000 labels from documents containing the word
clustering and arranges them on a 2d map using the
matrix:â€‹knnâ€‹Vectorsâ€‹Similarity
similarity. Then, it clusters the label points on the 2d map using the
clusters:â€‹ap
clustering algorithm and the matrix:â€‹knn2dâ€‹Distanceâ€‹Similarity
similarity.
If you run the above request in the JSON Sandbox app, you should see the map of labels with proximitybased clusters represented as different colors. Switch to the diagram for a visualization of the data flow in the request.
embedding2d
The input 2d points for which to find the nearest neighbors.
maxâ€‹Nearestâ€‹Points
The maximum number of the nearest points to find for each point.
The larger the maxâ€‹Nearestâ€‹Points
value, the denser the matrix and the larger the clusters you obtain
from clustering that matrix.
matrix:â€‹knnâ€‹Vectorsâ€‹Similarity
Computes a label or document similarity matrix based on multidimensional vector distance.
{
"type": "matrix:knnVectorsSimilarity",
"maxNeighbors": 10,
"threads": "auto",
"vectors": {
"type": "vectors:reference",
"auto": true
}
}
You can use this stage to cluster or 2dmap a set of documents or labels based on the similarity of their corresponding embedding vectors.
For each vector in the input set of multidimensional
vectors
, Lingo4G finds up to
maxâ€‹Neighbors
closest vectors in the same vector set and transfers the cosine similarities between the nearest vectors to the
output matrix. Therefore, the output matrix is square, its size is equal to the number of vectors in the input
vectors
set and values fall in the 0...1 range.
If you use
vectors:â€‹precomputedâ€‹Documentâ€‹Embeddings
as the input vectors
set, this stage computes
similarities between documents. Similarly, if you provide
vectors:â€‹precomputedâ€‹Labelâ€‹Embeddings
on input, this stage computes similarities between labels.
In many cases, embeddingbased similarities offer better clustering and 2d mapping results compared to their cooccurrence and keywordbased counterparts. The following request arranges 2000 labels into a 2d map where similar labels are close to each other.
maxâ€‹Neighbors
The maximum number of nearest vectors to find for each vector.
The larger maxâ€‹Neighbors
, the denser the matrix and the larger the clusters resulting from
clustering the matrix.
threads
The number of threads to use for the computation.
vectors
The set of multidimensional vectors among which to find similarities.
The size of the output square matrix is equal to the number of vectors in the input vector set.
matrix:â€‹*
Consumers of
The following stages and components take matrix:â€‹*
as
input:
Stage or component  Property 

clusters:â€‹ap  matrix 
embedding2d:â€‹lv  matrix 
embedding2d:â€‹lvâ€‹Overlay  matrix 
matrix:â€‹elementâ€‹Wiseâ€‹Product  factorâ€‹A factorâ€‹B 
matrixâ€‹Rows:â€‹fromâ€‹Matrix  matrix 