Similarity matrices

Similarity matrices are crucial in clustering or 2D mapping of labels and documents. This tutorial explores different types of similarity matrices and how to use them in your Lingo4G analysis requests.

Prerequisites

Similarity matrices define relationships between entities, such as labels or documents. For example, the following square matrix shows mutual similarities among 5 labels:

clustering algorithm k-means DBSCAN nanomaterials MWCNTs
clustering algorithm - 0.84 0.79 - -
k-means 0.84 - 0.78 - -
DBSCAN 0.79 0.78 - - -
nanomaterials 0.52 - - - 0.83
MWCNTs - 0.54 - 0.83 -

Inspecting the first row of the matrix reveals that the labels most similar to clustering algorithm are k-means and DBSCAN, which are names of popular clustering algorithms. Notice in the nanomaterials row, MWCNTs, which stands for "Multi-Walled Carbon Nanotubes", has a much higher value (0.83) than k-means (0.54). Lingo4G included k-means in the nanomaterials row only because we forced it to output two similar labels for each row, while the input contained only two labels related to nanomaterials.

A few more things to note about similarity matrices:

  • Range and semantics of values. The range and semantics of values in a similarity matrix depend on the specific matrix:​* stage that produced the matrix.

    The example label similarities shown above come from the matrix:​knn​Vectors​Similarity stage, which uses multidimensional vectors to compute similarity. This specific stage produces values in the 0...1 range, but other stages may produce different ranges.

  • Sparsity. Most matrix stages in Lingo4G produce sparse matrices. This means that the matrices don't define the similarities between all pairs of entities. Instead, for each row, the matrix contains a certain number of entities most similar to the row's entity (k nearest neighbors). The maximum number of neighbors for each row is determined by a stage property usually called max​Neighbors.

For the curious

You can compute the label similarity matrix shown above using Lingo4G by following these steps:

  1. If you haven't followed the initial Quick start tutorial, complete these steps:

    1. Prerequisites
    2. Installation
    3. Data download and indexing
    4. Learning embeddings
    5. Starting the Lingo4G server
  2. Open the Lingo4G JSON Sandbox app in a modern browser.

  3. Paste the following request and press the Execute button:

    {
      "stages": {
        "labels": {
          "type": "labels:direct",
          "labels": [
            {
              "label": "clustering algorithm"
            },
            {
              "label": "k-means"
            },
            {
              "label": "DBSCAN"
            },
            {
              "label": "nanomaterials"
            },
            {
              "label": "MWCNTs"
            }
          ]
        },
        "similarities":{
          "type": "matrix:knnVectorsSimilarity",
          "vectors": {
            "type": "vectors:precomputedLabelEmbeddings"
          },
          "maxNeighbors": 2
        }
      }
    }

    Computing similarities between a predefined list of labels.

  4. The similarities section of the result will contain the similarity matrix. See the matrix output reference for the description of the JSON encoding of matrices in Lingo4G.

Document similarities

Most of your Lingo4G analysis requests will use matrices to cluster and 2d-map documents. In the following sections, we'll explore different kinds of document-to-document similarities available in Lingo4G.

Keyword similarity

Document similarity based on shared keywords and phrases is the most straightforward and easy to understand.

Let's see how it works by building a request that performs the following:

  1. select documents containing the clustering word,
  2. compute the similarity matrix based on common phrases,
  3. create a 2d map of the documents from the similarity matrix.

Let's start with a document selector that selects documents matching the clustering query:

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    }
  }
}

These results alone are not very useful, so let's add the similarity stage to compute the similarities among the selected documents:

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "similarities": {
      "type": "matrix:keywordDocumentSimilarity"
    }
  },
  "output": {
    "stages": [
      "documents"
    ]
  }
}

This request uses the matrix:​keyword​Document​Similarity stage to compute similarity based on shared labels. It produces a square similarity matrix with rows and columns corresponding to the document set you provide in the documents property. Our request relies on Lingo4G's auto reference resolution to resolve this property automatically.

The request also introduces an output section to prevent outputting the raw similarity matrix, which is rarely useful alone and increases response size.

Finally, let's add the 2d​Map stage to create a 2d map from the similarity matrix:

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "similarities": {
      "type": "matrix:keywordDocumentSimilarity"
    },
    "2dMap": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:reference",
        "use": "similarities"
      }
    }
  },
  "output": {
    "stages": [
      "documents",
      "2dMap"
    ]
  }
}

The 2d embedding is computed by the embedding2d:​lv stage. Note that this stage accepts a similarity matrix rather than documents. This means that embedding2d:​lv is agnostic to the specific type of entities it processes. It is the input similarity matrix that defines the semantics and interpretation of the 2d embeddings. Let state this once again for clarity:

Clustering and 2d mapping in Lingo4G rely on index alignment of result matrices

When you build a similarity matrix, indices of rows and columns correspond to indices of the input documents or labels. Performing 2d mapping on this matrix provides 2d coordinates corresponding to the matrix rows. Indices across input entities, matrices, and 2d coordinates are aligned, so to determine the 2d coordinates of a document, look up its index in the document list and then find the same index in the 2d coordinates array.

Let us illustrate this further by looking at the results of the request we built. Copy and paste the request into the JSON Sandbox app and press the Execute button.

In the JSON results tab, you should see output similar to the following (only the first 5 elements of each array shown for brevity):

{
  "result" : {
    "documents" : {
      "matches" : {
        "value" : 20057,
        "relation" : "GREATER_OR_EQUAL"
      },
      "documents" : [
        {
          "id" : 14539,
          "weight" : 5.828844
        },
        {
          "id" : 277469,
          "weight" : 5.8251066
        },
        {
          "id" : 218609,
          "weight" : 5.8239636
        },
        {
          "id" : 216337,
          "weight" : 5.8107133
        },
        {
          "id" : 192190,
          "weight" : 5.785284
        }
      ]
    },
    "2dMap" : {
      "points" : [
        {
          "x" : -7.236797,
          "y" : -5.8753614
        },
        {
          "x" : 1.736527,
          "y" : -7.3485107
        },
        {
          "x" : -6.2543592,
          "y" : -4.847388
        },
        {
          "x" : -1.7186593,
          "y" : -11.761698
        },
        {
          "x" : -5.1737633,
          "y" : -6.1466236
        }
      ]
    }
  }
}

Because of index alignment, the 2d coordinates of document 14539, located at index 0 of the documents array, are available at index 0 of the points array.

If you switch to the documents map tab of the results area, you should see all the document points visualized:

Lingo4G JSON sandbox app, 2d map of documents based on keyword similarity (light theme).
Lingo4G JSON sandbox app, 2d map of documents based on keyword similarity (dark theme).

2d map of documents based on the keyword similarity matrix.

Although the 2d map is currently unlabeled, you can clearly see areas where documents cluster globally and locally.

Refer to the documentation for the matrix:​keyword​Document​Similarity stage for a detailed description of its similarity algorithm. Feel free to experiment with the stage properties, such as max​Neighbors or min​Query​Labels​Required​In​Similar​Document to see their impact on the 2d embedding.

Embedding similarity

Let's now modify the previous request by swapping the matrix:​keyword​Document​Similarity stage for matrix:​knn​Vectors​Similarity stage, which uses embedding vectors to compute similarities.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "similarities": {
      "type": "matrix:knnVectorsSimilarity",
      "vectors": {
        "type": "vectors:precomputedDocumentEmbeddings"
      }
    },
    "2dMap": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:reference",
        "use": "similarities"
      }
    }
  },
  "output": {
    "stages": [
      "documents",
      "2dMap"
    ]
  }
}

Computing similarities between documents using embedding vector similarity.

The matrix:​knn​Vectors​Similarity computes a square similarity matrix between the list of vectors you provide. In our example, we use the vectors:​precomputed​Document​Embeddings stage, which returns document embedding vectors corresponding to the list of documents you provide in its documents. We rely on Lingo4G's auto reference resolution mechanism to resolve that property automatically.

If you run the modified request in the JSON Sandbox app, you'll notice that the 2d maps resulting from embedding-based similarities are more tightly clustered compared to the keyword similarity based maps.

Lingo4G JSON sandbox app, 2d map of documents based on keyword similarity (light theme).
Lingo4G JSON sandbox app, 2d map of documents based on keyword similarity (dark theme).

2d map of documents based on the embedding vector similarity matrix.

Using external, LLM-based embedding vectors

If your index contains externally-computed embedding vectors (most likely from a Large Language Model), you can use those embeddings instead of Lingo4G's built-in embeddings. Simply swap the vectors:​precomputed​Document​Embeddings stage for the vectors:​from​Vector​Field stage, providing the name of the document field containing the embedding vectors.

{
  "similarities": {
    "type": "matrix:knnVectorsSimilarity",
    "vectors": {
      "type": "vectors:fromVectorField",
      "fieldName": "embedding"
    }
  }
}

Content field similarity

Let's explore another similarity type — this time based on arbitrary per-document search queries. This type of similarity connects documents based on the equal values of one or more content fields.

The arXiv example project, on which we run our example requests, contains the set field. For each document, this field defines the top-level part of the paper's arXiv category, such as cs for Computer Science or astro-ph for Astrophysics.

Let's analyze a request that 2d maps documents by the search query-based similarity:

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "similarities": {
      "type": "matrix:fromMatrixRows",
      "matrixRows": {
        "type": "matrixRows:byQuery",
        "rows": {
          "type": "documents:reference",
          "use": "documents"
        },
        "queryBuilder": {
          "type": "queryBuilder:string",
          "variables": [
            {
              "input": "set",
              "variable": "SET",
              "quote": true
            }
          ],
          "query": "set:(<SET>)"
        }
      }
    },
    "2dMap": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:reference",
        "use": "similarities"
      }
    }
  },
  "output": {
    "stages": [
      "documents",
      "2dMap"
    ]
  }
}

Computing similarities between documents using arbitrary search queries.

Let's break down the highlighted fragment into pieces:

  • The matrix​Rows:​by​Query component computes the rows of the query-based similarity matrix. For each row, corresponding to one document from the documents list, the component builds and executes a document-specific search query (see below). The results of the query are the row document's neighbors — they give rise to the values of that matrix row.

  • The query​Builder:​string component builds the document-specific search query.

    The query property defines the query template. The template can contain variable references (denoted by angle brackets) that the query builder fills in with values of the indexed fields of the document. For multivalued fields, Lingo4G joins the values with the operator you define, O​R by default.

    In our example, the query builder declares one variable, S​E​T, and binds that variable to the values of the document's set field. As a result, for each document, the query returns other documents that share the largest number of set field values with the document being processed.

  • The matrix:​from​Matrix​Rows stage materializes the rows from the matrix​Rows:​by​Query stage into a complete matrix required by the 2d mapping stage.

    For the curious: matrix:​* vs matrix​Rows:​*

    Until now, all stages generating matrices were of type matrix:​*. This request, however, uses a component of the matrix​Rows:​* type. The distinction comes from the fact that certain algorithms, such as clustering or 2d mapping, require a complete matrix on input, while others, like contrast score computation, can process matrix rows one-by-one. To cater to the latter group of algorithms, Lingo4G implements the same similarity computation methods both as matrix:​* stages and as matrix​Rows:​* components. They compute the same results, but the latter do not materialize the whole similarity matrix, significantly reducing memory usage in algorithms that can consume similarities row-by-row.

    Since the by-query similarity is currently only available in the matrix​Rows:​* form, and we're performing 2d embedding, which requires a materialized matrix, we wrap our matrix​Rows:​by​Query component with the matrix:​from​Matrix​Rows stage, which materializes the rows into the full matrix suitable to pass to the embedding2d:​lv stage for 2d embedding.

If you run the request in the JSON Sandbox app, you should see a result similar to the following:

Lingo4G JSON sandbox app, 2d map of documents based on content field similarity (light theme).
Lingo4G JSON sandbox app, 2d map of documents based on content field similarity (dark theme).

2d map of documents based on the embedding vector similarity matrix.

As expected, since the similarity function is restricted to a nominal field with a limited number of distinct values, the 2d map contains groups corresponding to the individual values of the set field.

The content field-based similarity has limited value on its own but can be part of a composite similarity function together with some content-based similarity method, such as keyword or embedding similarity.

Composite similarity

Let's try the matrix​Rows:​composite component to fuse different similarity matrices, allowing us to 2d map or cluster documents based on multiple criteria.

This request computes a composite similarity matrix that fuses two similarity functions: embedding and keyword-based similarity.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "similarities": {
      "type": "matrix:fromMatrixRows",
      "matrixRows": {
        "type": "matrixRows:composite",
        "matrixRows": [
          {
            "type": "matrixRows:weighted",
            "weight": 0.1,
            "matrixRows": {
              "type": "matrixRows:byQuery",
              "rows": {
                "type": "documents:reference",
                "use": "documents"
              },
              "queryBuilder": {
                "type": "queryBuilder:string",
                "variables": [
                  {
                    "input": "set",
                    "variable": "SET",
                    "quote": true
                  }
                ],
                "query": "set:(<SET>)"
              }
            }
          },
          {
            "type": "matrixRows:knnVectorsSimilarity",
            "vectors": {
              "rows": {
                "type": "vectors:precomputedDocumentEmbeddings"
              },
              "columns": {
                "type": "vectors:precomputedDocumentEmbeddings"
              }
            }
          }
        ]
      }
    },
    "2dMap": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:reference",
        "use": "similarities"
      }
    },
    "clustersBySet": {
      "type": "clusters:byValues",
      "values": {
        "type": "values:fromDocumentField",
        "fieldName": "set",
        "multipleValues": "COLLECT_FIRST"
      }
    }
  },
  "output": {
    "stages": [
      "documents",
      "2dMap",
      "clustersBySet"
    ]
  }
}

Computing similarities between documents using the composite similarity function.

Let's break down the similarity computation part:

  • The matrix​Rows:​composite component fuses two or more similarity matrix row sources. In our example, we use two similarity matrix components:

  • Since the 2d mapping stage requires a materialized similarity matrix, we wrap the matrix​Rows:​composite in the matrix:​from​Matrix​Rows stage to collect the matrix rows produced by the composite component into a full matrix.

Additionally, the request adds the clusters​By​Set stage, which groups the documents by the first value of their set field. This helps us track the impact of the content field similarity on the final result.

If you run the request in the JSON Sandbox app, you should see a result like this:

Lingo4G JSON sandbox app, 2d map of documents based on composite similarity (light theme).
Lingo4G JSON sandbox app, 2d map of documents based on composite field similarity (dark theme).

2d map of documents based on the composite of embedding vector and content field similarities, documents colored by the first value of the set field.

To see the impact of the set field similarity on the result, temporarily lower the weight property from 0.1 to 0.0 and re-run the analysis.

Lingo4G JSON sandbox app, 2d map of documents based on embedding similarity (light theme).
Lingo4G JSON sandbox app, 2d map of documents based on embedding similarity (dark theme).

2d map of documents based on the embedding vector similarity, documents colored by the first value of the set field.

By comparing the screenshots, you can see that the similarity based on the shared set field values helps bring the separate brown areas together while still maintaining the local groupings.

Label similarities

Let's switch our attention to processing labels. Similar to documents, if we create a matrix that represents similarities between labels, we can use the same 2d mapping and clustering algorithms to 2d map and cluster labels.

The following request demonstrates the two label similarity computation methods available in Lingo4G:

{
  "stages": {
    "labels": {
      "documents": {
        "type": "documents:byQuery",
        "query": {
          "type": "query:string",
          "query": "clustering"
        }
      },
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 500
      }
    },
    "2dMapByCooccurrences": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:cooccurrenceLabelSimilarity",
        "documents": {
          "type": "documents:byQuery",
          "limit": "unlimited",
          "query": {
            "type": "query:string",
            "query": "clustering"
          }
        }
      }
    },
    "2dMapByEmbeddings": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedLabelEmbeddings"
        }
      }
    }
  }
}

Computing similarities between labels using two methods: co-occurrence and embedding vector similarity.

Let's break the request down into individual stages:

  • The labels stage uses the labels:​from​Documents stage to extract 500 labels that best characterize the input documents (documents matching the clustering query). Subsequent stages compute similarities between the labels and arrange them on 2d maps.

  • The 2d​Map​By​Cooccurrences stage computes the 2d coordinates for the labels based on co-occurrence similarity. The matrix:​cooccurrence​Label​Similarity stage counts the number of times each pair of labels co-occurs in the set of documents you provide and uses that number to compute the similarity value. In our request, we count the co-occurrences across all documents matching the clustering query.

    You can adjust the cooccurrence​Window​Size property to determine how far apart the labels can be to be counted as co-occurring. The similarity​Weighting property offers various binary similarity weightings to apply to the raw co-occurrence count to arrive at the final similarity value.

  • The 2d​Map​By​Embeddings stage computes the 2d map of the labels based on the embedding vector similarity. Notice that we use the same stage — matrix:​knn​Vectors​Similarity — to compute similarities between both documents and labels. The stage accepts a list of vectors to use for computation and this time we provide it with vectors:​precomputed​Label​Embeddings — the embedding vectors corresponding to the list of labels produced by the labels stage.

If you run the above request in the JSON Sandbox app, you should see results similar to the following. Use the combo box at the top of the map to switch between the two 2d maps produced by the request.

Lingo4G JSON sandbox app, 2d map of labels using co-occurrence and embedding similarity (light theme).
Lingo4G JSON sandbox app, 2d map of labels using co-occurrence and embedding similarity (dark theme).

2d maps of labels based on co-occurrence and embedding vector similarities.

Exactly like with documents, the results of the labels, 2d​Map​By​Cooccurrences and 2d​Map​By​Embeddings are index-aligned. This means that the 2d coordinates found at index 0 of the result array correspond to the label at index 0 in the labels stage results array.

Compared to co-occurrence similarity, embedding vector similarity usually creates smaller, tighter clusters of labels.

Label-document similarities

All the 2d maps produced in the document similarities section contained 2d points corresponding only to documents. To make the maps more useful, let's add some labels to describe various areas of the maps. To do that, we'll need to create a labels-to-documents similarity matrix.

The following request extends the document embeddings similarity request by adding a label overlay on top of the document 2d map.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 500
      }
    },
    "2dMap": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedDocumentEmbeddings"
        }
      }
    },
    "2dMapLabels": {
      "type": "embedding2d:lvOverlay",
      "embedding2d": {
        "type": "embedding2d:reference",
        "use": "2dMap"
      },
      "matrix": {
        "type": "matrix:keywordLabelDocumentSimilarity"
      }
    }
  }
}

Using label-to-documents similarities to overlay labels on top of a 2d map of documents.

Compared to the original request, this request introduces the following changes:

  • The labels stage extracts 500 labels that best describe the input documents.

  • The 2d​Map​Labels stage uses embedding2d:​lv​Overlay to overlay labels on an existing 2d map of documents. In the embedding2d property, we provide a reference to the 2d document map we want to put labels on.

    To position labels on top of an existing 2d embedding, Lingo4G needs to know which documents are similar to each label. These similarities form a rectangular similarity matrix with rows corresponding to the labels we want to put on the map and columns corresponding to the documents that are already present on the 2d map. The matrix:​keyword​Label​Document​Similarity stage produces exactly this kind of rectangular similarity matrix.

  • The document similarity matrix computation is now inlined into the 2d​Map stage specification.

Following the index alignment principle, the 2d point coordinates in the 2d​Map stage are index-aligned with the results of the documents stage. Similarly, the 2d points array returned by the 2d​Map​Labels stage is index-aligned with the labels stage results.

If you execute the above request in the JSON Sandbox app, you should see the 2d documents annotated with labels.

Lingo4G JSON sandbox app, 2d of documents with a labels overlay (light theme).
Lingo4G JSON sandbox app, 2d of documents with a labels overlay (dark theme).

2d maps of documents with labels overlay.

The matrix:​keyword​Label​Document​Similarity uses the keyword matching method to produce the labels-to-documents similarity matrix. Let's use the matrix​Rows:​knn​Vectors​Similarity component to generate a similar matrix using embedding vector similarity.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 500
      }
    },
    "2dMap": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedDocumentEmbeddings"
        }
      }
    },
    "2dMapLabels": {
      "type": "embedding2d:lvOverlay",
      "embedding2d": {
        "type": "embedding2d:reference",
        "use": "2dMap"
      },
      "matrix": {
        "type": "matrix:fromMatrixRows",
        "matrixRows": {
          "type": "matrixRows:knnVectorsSimilarity",
          "vectors": {
            "rows": {
              "type": "vectors:precomputedLabelEmbeddings"
            },
            "columns": {
              "type": "vectors:precomputedDocumentEmbeddings"
            }
          }
        }
      }
    }
  }
}

Computing labels-to-documents similarities using embedding vector similarity.

The only change we made to the previous request is the highlighted part, which computes the labels-to-documents similarity. The matrix​Rows:​knn​Vectors​Similarity component allows us to specify the row and column vectors separately to build the rectangular similarity matrix. The 2d map overlay stage requires labels as rows and documents as columns, so we pass the appropriate vector sets in the rows and columns properties. Feel free to run the modified request in the JSON Sandbox app to compare the two approaches to label-to-document similarity.

2d distance similarities

In the previous sections, we used various similarities to create 2d maps of documents and labels. In this final section, we'll explore another type of similarity — matrix:​knn2d​Distance​Similarity — to identify separate areas on the 2d maps.

The following request adds an extra stage to cluster the points on the 2d document map.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 500
      }
    },
    "2dMap": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedDocumentEmbeddings"
        }
      }
    },
    "2dMapLabels": {
      "type": "embedding2d:lvOverlay",
      "embedding2d": {
        "type": "embedding2d:reference",
        "use": "2dMap"
      },
      "matrix": {
        "type": "matrix:keywordLabelDocumentSimilarity"
      }
    },
    "2dMapClusters": {
      "type": "clusters:cd",
      "matrix": {
        "type": "matrix:knn2dDistanceSimilarity",
        "embedding2d": {
          "type": "embedding2d:reference",
          "use": "2dMap"
        },
        "maxNearestPoints": 20
      },
      "linkDensityThreshold": 0.01
    }
  }
}

Using 2d Euclidean similarity between points on the 2d document map to identify clusters of nearby points.

The extra stage uses the clusters:​cd clustering paired with the matrix:​knn2d​Distance​Similarity stage, which computes similarities based on the Euclidean distances in the 2d space. This ensures that the clusters group points that are close in the 2d space, rather than the original multidimensional document space.

If you run the above request in JSON Sandbox, you should see a result similar to the following:

With clusters present in the analysis result, JSON Sandbox automatically assigns different colors to each top-level document cluster.

Lingo4G JSON sandbox app, 2d of documents with labels and clusters (light theme).
Lingo4G JSON sandbox app, 2d of documents with labels and clusters (dark theme).

2d maps of documents with a labels overlay and clusters based on the 2d Euclidean distance between document points.

Feel free to experiment with the link​Density​Threshold and max​Nearest​Points properties to see the impact on the number and structure of clusters.

Further reading

This wraps up our exploration of the similarity matrix computation in Lingo4G. For further information, see the API reference documentation of the following stages and components: