Similarity matrices
Similarity matrices are crucial in clustering or 2D mapping of labels and documents. This tutorial explores different types of similarity matrices and how to use them in your Lingo4G analysis requests.
- Lingo4G analysis request JSON: How to construct analysis request specifications.
- Document selection: How to select documents for processing.
Similarity matrices define relationships between entities, such as labels or documents. For example, the following square matrix shows mutual similarities among 5 labels:
clustering algorithm | k-means | DBSCAN | nanomaterials | MWCNTs | |
---|---|---|---|---|---|
clustering algorithm | - | 0.84 | 0.79 | - | - |
k-means | 0.84 | - | 0.78 | - | - |
DBSCAN | 0.79 | 0.78 | - | - | - |
nanomaterials | 0.52 | - | - | - | 0.83 |
MWCNTs | - | 0.54 | - | 0.83 | - |
Inspecting the first row of the matrix reveals that the labels most similar to clustering algorithm are k-means and DBSCAN, which are names of popular clustering algorithms. Notice in the nanomaterials row, MWCNTs, which stands for "Multi-Walled Carbon Nanotubes", has a much higher value (0.83) than k-means (0.54). Lingo4G included k-means in the nanomaterials row only because we forced it to output two similar labels for each row, while the input contained only two labels related to nanomaterials.
A few more things to note about similarity matrices:
-
Range and semantics of values. The range and semantics of values in a similarity matrix depend on the specific
matrix:​*
stage that produced the matrix.The example label similarities shown above come from the
matrix:​knn​Vectors​Similarity
stage, which uses multidimensional vectors to compute similarity. This specific stage produces values in the 0...1 range, but other stages may produce different ranges. -
Sparsity. Most matrix stages in Lingo4G produce sparse matrices. This means that the matrices don't define the similarities between all pairs of entities. Instead, for each row, the matrix contains a certain number of entities most similar to the row's entity (k nearest neighbors). The maximum number of neighbors for each row is determined by a stage property usually called
max​Neighbors
.
You can compute the label similarity matrix shown above using Lingo4G by following these steps:
-
If you haven't followed the initial Quick start tutorial, complete these steps:
-
Open the Lingo4G JSON Sandbox app in a modern browser.
-
Paste the following request and press the Execute button:
-
The
similarities
section of the result will contain the similarity matrix. See the matrix output reference for the description of the JSON encoding of matrices in Lingo4G.
Document similarities
Most of your Lingo4G analysis requests will use matrices to cluster and 2d-map documents. In the following sections, we'll explore different kinds of document-to-document similarities available in Lingo4G.
Keyword similarity
Document similarity based on shared keywords and phrases is the most straightforward and easy to understand.
Let's see how it works by building a request that performs the following:
- select documents containing the clustering word,
- compute the similarity matrix based on common phrases,
- create a 2d map of the documents from the similarity matrix.
Let's start with a document selector that selects documents matching the clustering query:
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
}
}
}
}
These results alone are not very useful, so let's add the similarity
stage to compute the
similarities among the selected documents:
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
}
},
"similarities": {
"type": "matrix:keywordDocumentSimilarity"
}
},
"output": {
"stages": [
"documents"
]
}
}
This request uses the
matrix:​keyword​Document​Similarity
stage to compute similarity based on shared labels. It produces a square similarity matrix with rows and columns
corresponding to the document set you provide in the
documents
property. Our request relies on Lingo4G's
auto reference resolution to resolve this property
automatically.
The request also introduces an output
section to prevent outputting the raw similarity matrix,
which is rarely useful alone and increases response size.
Finally, let's add the 2d​Map
stage to create a 2d map from the similarity matrix:
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
}
},
"similarities": {
"type": "matrix:keywordDocumentSimilarity"
},
"2dMap": {
"type": "embedding2d:lv",
"matrix": {
"type": "matrix:reference",
"use": "similarities"
}
}
},
"output": {
"stages": [
"documents",
"2dMap"
]
}
}
The 2d embedding is computed by the
embedding2d:​lv
stage. Note that this stage accepts a similarity matrix rather than documents. This means that
embedding2d:​lv
is agnostic to the specific type of entities it processes. It is the input
similarity matrix that defines the semantics and interpretation of the 2d embeddings. Let state this once again
for clarity:
When you build a similarity matrix, indices of rows and columns correspond to indices of the input documents or labels. Performing 2d mapping on this matrix provides 2d coordinates corresponding to the matrix rows. Indices across input entities, matrices, and 2d coordinates are aligned, so to determine the 2d coordinates of a document, look up its index in the document list and then find the same index in the 2d coordinates array.
Let us illustrate this further by looking at the results of the request we built. Copy and paste the request into the JSON Sandbox app and press the Execute button.
In the JSON results tab, you should see output similar to the following (only the first 5 elements of each array shown for brevity):
{
"result" : {
"documents" : {
"matches" : {
"value" : 20057,
"relation" : "GREATER_OR_EQUAL"
},
"documents" : [
{
"id" : 14539,
"weight" : 5.828844
},
{
"id" : 277469,
"weight" : 5.8251066
},
{
"id" : 218609,
"weight" : 5.8239636
},
{
"id" : 216337,
"weight" : 5.8107133
},
{
"id" : 192190,
"weight" : 5.785284
}
]
},
"2dMap" : {
"points" : [
{
"x" : -7.236797,
"y" : -5.8753614
},
{
"x" : 1.736527,
"y" : -7.3485107
},
{
"x" : -6.2543592,
"y" : -4.847388
},
{
"x" : -1.7186593,
"y" : -11.761698
},
{
"x" : -5.1737633,
"y" : -6.1466236
}
]
}
}
}
Because of index alignment, the 2d coordinates of document
14539
, located at index 0 of the documents
array, are available at index 0 of the
points
array.
If you switch to the documents map tab of the results area, you should see all the document points visualized:
Although the 2d map is currently unlabeled, you can clearly see areas where documents cluster globally and locally.
Refer to the documentation for the
matrix:​keyword​Document​Similarity
stage for a detailed description of its similarity algorithm. Feel free to experiment with the stage properties,
such as
max​Neighbors
or
min​Query​Labels​Required​In​Similar​Document
to see their impact on the 2d embedding.
Embedding similarity
Let's now modify the previous request by swapping the
matrix:​keyword​Document​Similarity
stage for
matrix:​knn​Vectors​Similarity
stage, which uses embedding vectors to compute similarities.
The
matrix:​knn​Vectors​Similarity
computes a square similarity matrix between the list of
vectors
you provide. In our example, we use the
vectors:​precomputed​Document​Embeddings
stage, which returns
document embedding vectors corresponding to the list of documents you
provide in its
documents
. We rely on Lingo4G's auto reference resolution
mechanism to resolve that property automatically.
If you run the modified request in the JSON Sandbox app, you'll notice that the 2d maps resulting from embedding-based similarities are more tightly clustered compared to the keyword similarity based maps.
If your index contains externally-computed embedding vectors (most likely from a Large Language Model), you
can use those embeddings instead of Lingo4G's built-in embeddings. Simply swap the
vectors:​precomputed​Document​Embeddings
stage for the
vectors:​from​Vector​Field
stage, providing the name of the document field containing the embedding vectors.
{
"similarities": {
"type": "matrix:knnVectorsSimilarity",
"vectors": {
"type": "vectors:fromVectorField",
"fieldName": "embedding"
}
}
}
Content field similarity
Let's explore another similarity type — this time based on arbitrary per-document search queries. This type of similarity connects documents based on the equal values of one or more content fields.
The arXiv example project, on which we run our
example requests, contains the set
field. For each document, this field defines the top-level part
of the paper's arXiv category, such as
cs for Computer Science or astro-ph for Astrophysics.
Let's analyze a request that 2d maps documents by the search query-based similarity:
Let's break down the highlighted fragment into pieces:
-
The
matrix​Rows:​by​Query
component computes the rows of the query-based similarity matrix. For each row, corresponding to one document from thedocuments
list, the component builds and executes a document-specific search query (see below). The results of the query are the row document's neighbors — they give rise to the values of that matrix row. -
The
query​Builder:​string
component builds the document-specific search query.The
query
property defines the query template. The template can contain variable references (denoted by angle brackets) that the query builder fills in with values of the indexed fields of the document. For multivalued fields, Lingo4G joins the values with theoperator
you define,O​R
by default.In our example, the query builder declares one variable,
S​E​T
, and binds that variable to the values of the document'sset
field. As a result, for each document, the query returns other documents that share the largest number ofset
field values with the document being processed. -
The
matrix:​from​Matrix​Rows
stage materializes the rows from thematrix​Rows:​by​Query
stage into a complete matrix required by the 2d mapping stage.For the curious:matrix:​*
vsmatrix​Rows:​*
Until now, all stages generating matrices were of type
matrix:​*
. This request, however, uses a component of thematrix​Rows:​*
type. The distinction comes from the fact that certain algorithms, such as clustering or 2d mapping, require a complete matrix on input, while others, like contrast score computation, can process matrix rows one-by-one. To cater to the latter group of algorithms, Lingo4G implements the same similarity computation methods both asmatrix:​*
stages and asmatrix​Rows:​*
components. They compute the same results, but the latter do not materialize the whole similarity matrix, significantly reducing memory usage in algorithms that can consume similarities row-by-row.Since the by-query similarity is currently only available in the
matrix​Rows:​*
form, and we're performing 2d embedding, which requires a materialized matrix, we wrap ourmatrix​Rows:​by​Query
component with thematrix:​from​Matrix​Rows
stage, which materializes the rows into the full matrix suitable to pass to theembedding2d:​lv
stage for 2d embedding.
If you run the request in the JSON Sandbox app, you should see a result similar to the following:
As expected, since the similarity function is restricted to a nominal field with a limited number of distinct
values, the 2d map contains groups corresponding to the individual values of the
set
field.
The content field-based similarity has limited value on its own but can be part of a composite similarity function together with some content-based similarity method, such as keyword or embedding similarity.
Composite similarity
Let's try the
matrix​Rows:​composite
component to fuse different similarity matrices, allowing us to 2d map or cluster documents based on multiple
criteria.
This request computes a composite similarity matrix that fuses two similarity functions: embedding and keyword-based similarity.
Let's break down the similarity computation part:
-
The
matrix​Rows:​composite
component fuses two or more similarity matrix row sources. In our example, we use two similarity matrix components:-
matrix​Rows:​by​Query
produces similarities based on the number of shared content field values.We wrap the
matrix​Rows:​by​Query
component in thematrix​Rows:​weighted
component to apply some weighting to the content field-based similarity. This allows us to increase or decrease the impact of this specific similarity on the final similarity matrix. In the example, we weigh the content field similarity by 0.1, leaving the embedding similarity unweighted (which is equivalent to a weight of 1.0). -
matrix​Rows:​knn​Vectors​Similarity
produces similarities based on the document's multidimensional vectors.
-
-
Since the 2d mapping stage requires a materialized similarity matrix, we wrap the
matrix​Rows:​composite
in thematrix:​from​Matrix​Rows
stage to collect the matrix rows produced by the composite component into a full matrix.
Additionally, the request adds the clusters​By​Set
stage, which groups the documents by the first
value of their set
field. This helps us track the impact of the content field similarity on the
final result.
If you run the request in the JSON Sandbox app, you should see a result like this:
To see the impact of the set
field similarity on the result, temporarily lower the
weight
property from 0.1
to 0.0
and re-run the analysis.
By comparing the screenshots, you can see that the similarity based on the shared
set
field values helps bring the separate brown areas together while still maintaining the local groupings.
Label similarities
Let's switch our attention to processing labels. Similar to documents, if we create a matrix that represents similarities between labels, we can use the same 2d mapping and clustering algorithms to 2d map and cluster labels.
The following request demonstrates the two label similarity computation methods available in Lingo4G:
Let's break the request down into individual stages:
-
The
labels
stage uses thelabels:​from​Documents
stage to extract 500 labels that best characterize the input documents (documents matching the clustering query). Subsequent stages compute similarities between the labels and arrange them on 2d maps. -
The
2d​Map​By​Cooccurrences
stage computes the 2d coordinates for the labels based on co-occurrence similarity. Thematrix:​cooccurrence​Label​Similarity
stage counts the number of times each pair of labels co-occurs in the set ofdocuments
you provide and uses that number to compute the similarity value. In our request, we count the co-occurrences across all documents matching the clustering query.You can adjust the
cooccurrence​Window​Size
property to determine how far apart the labels can be to be counted as co-occurring. Thesimilarity​Weighting
property offers various binary similarity weightings to apply to the raw co-occurrence count to arrive at the final similarity value. -
The
2d​Map​By​Embeddings
stage computes the 2d map of the labels based on the embedding vector similarity. Notice that we use the same stage —matrix:​knn​Vectors​Similarity
— to compute similarities between both documents and labels. The stage accepts a list ofvectors
to use for computation and this time we provide it withvectors:​precomputed​Label​Embeddings
— the embedding vectors corresponding to the list of labels produced by thelabels
stage.
If you run the above request in the JSON Sandbox app, you should see results similar to the following. Use the combo box at the top of the map to switch between the two 2d maps produced by the request.
Exactly like with documents, the results of the labels
, 2d​Map​By​Cooccurrences
and
2d​Map​By​Embeddings
are index-aligned. This means that the 2d coordinates found at index 0 of the
result array correspond to the label at index 0 in the labels
stage results array.
Compared to co-occurrence similarity, embedding vector similarity usually creates smaller, tighter clusters of labels.
Label-document similarities
All the 2d maps produced in the document similarities section contained 2d points corresponding only to documents. To make the maps more useful, let's add some labels to describe various areas of the maps. To do that, we'll need to create a labels-to-documents similarity matrix.
The following request extends the document embeddings similarity request by adding a label overlay on top of the document 2d map.
Compared to the original request, this request introduces the following changes:
-
The
labels
stage extracts 500 labels that best describe the input documents. -
The
2d​Map​Labels
stage usesembedding2d:​lv​Overlay
to overlay labels on an existing 2d map of documents. In theembedding2d
property, we provide a reference to the 2d document map we want to put labels on.To position labels on top of an existing 2d embedding, Lingo4G needs to know which documents are similar to each label. These similarities form a rectangular similarity matrix with rows corresponding to the labels we want to put on the map and columns corresponding to the documents that are already present on the 2d map. The
matrix:​keyword​Label​Document​Similarity
stage produces exactly this kind of rectangular similarity matrix. -
The document similarity matrix computation is now inlined into the
2d​Map
stage specification.
Following the index alignment principle, the 2d point coordinates in the
2d​Map
stage are index-aligned with the results of the documents
stage. Similarly, the 2d points array
returned by the 2d​Map​Labels
stage is index-aligned with the labels
stage results.
If you execute the above request in the JSON Sandbox app, you should see the 2d documents annotated with labels.
The
matrix:​keyword​Label​Document​Similarity
uses the keyword matching method to produce the labels-to-documents similarity matrix. Let's use the
matrix​Rows:​knn​Vectors​Similarity
component to generate a similar matrix using embedding vector similarity.
The only change we made to the previous request is the highlighted part, which computes the labels-to-documents
similarity. The
matrix​Rows:​knn​Vectors​Similarity
component allows us to specify the row and column vectors separately to build the rectangular similarity matrix.
The 2d map overlay stage requires labels as rows and documents as columns, so we pass the appropriate vector sets
in the
rows
and
columns
properties. Feel free to run the modified request in the JSON Sandbox app to compare the two approaches to
label-to-document similarity.
2d distance similarities
In the previous sections, we used various similarities to create 2d maps of documents and labels. In this final
section, we'll explore another type of similarity —
matrix:​knn2d​Distance​Similarity
— to identify separate areas on the 2d maps.
The following request adds an extra stage to cluster the points on the 2d document map.
The extra stage uses the
clusters:​cd
clustering paired with the
matrix:​knn2d​Distance​Similarity
stage, which computes similarities based on the Euclidean distances in the 2d space. This ensures that the
clusters group points that are close in the 2d space, rather than the original multidimensional document space.
If you run the above request in JSON Sandbox, you should see a result similar to the following:
With clusters present in the analysis result, JSON Sandbox automatically assigns different colors to each top-level document cluster.
Feel free to experiment with the
link​Density​Threshold
and
max​Nearest​Points
properties to see the impact on the number and structure of clusters.
Further reading
This wraps up our exploration of the similarity matrix computation in Lingo4G. For further information, see the API reference documentation of the following stages and components:
-
matrix:​*
: computing materialized similarity matrices, -
matrix​Rows:​*
: computing rows of similarity matrices, -
clusters:​*
: clustering algorithm that consume similarity matrices, -
embedding2d:​*
: 2d mapping algorithms that consume similarity matrices, -
documents:​contrast​Score
,clusters:​from​Matrix​Columns
: algorithms that consume similarity matrix rows (not covered in this tutorial).