matrix

The matrix:​* stages group various ways of producing matrices. The semantics of matrix rows and columns depends on the specific stage. You can use matrices as input to the clustering and 2d embedding stages.

Matrices are the bridge between individual entities, such as labels or documents, and their aggregations, such as clusters or 2d maps. Clustering and 2d mapping stages do not accept documents or labels directly on input. Instead, they accept similarity matrices, which define the semantics and interpretation of clusters or 2d maps.

Matrix building, clustering and 2d mapping stages rely on a very important concept universal to the whole Lingo4G analysis API: index alignment. When you build a similarity matrix, indices of rows and columns of the matrix correspond to the indices of documents or labels you provided on input. When you perform clustering on such a matrix, you receive clusters of related indices of the input matrix. Since indices across input entities, matrices and clusters are aligned, to find out which specific labels or documents got clustered, you need to look up the indices in the list of documents or labels you provided when building the similarity matrix.


You can use the following matrix stages in your analysis requests:

matrix:​cooccurrence​Label​Similarity

Computes a label similarity matrix based on how the labels co-occur in the documents you provide.

matrix:​direct

Returns a matrix whose contents you provide directly.

matrix:​element​Wise​Product

Computes an element-wise product of two matrices.

matrix:​from​Matrix​Rows

Collects matrix rows into an in-memory matrix.

matrix:​keyword​Document​Similarity

Computes a document similarity matrix based on the labels the documents share.

matrix:​keyword​Label​Document​Similarity

Computes similarities between a list of labels and a set of documents.

matrix:​knn2d​Distance​Similarity

Computes a matrix of similarities between 2d embeddings based on the 2d Euclidean distance. You can use this matrix to identify clusters of nearby points in a 2d map.

matrix:​knn​Vectors​Similarity

Computes a label or document similarity matrix based on multidimensional vector distance.


matrix:​reference

References the results of another matrix:​* stage defined in the request.


The JSON output of the matrix stage has the following structure:

{
  "result" : {
    "matrix" : {
      "columns" : 4,
      "indices" : [
        [ 3, 1 ],
        [ ],
        [ 0 ],
        [ 0, 1 ]
      ],
      "values" : [
        [ 0.21158148, 0.16382119 ],
        [ ],
        [ 0.13711902 ],
        [ 0.52923995, 0.45256963 ]
      ]
    }
  }
}
columns

The number of columns in this matrix. The number of rows is equal to the length of the indices, values and diagonals arrays.

indices

Indices of the non-zero elements of the matrix. Each element in the array corresponds to one matrix row. For each row, the nested array lists zero-based indices of the non-empty matrix elements in that row.

values

Values of the non-zero elements of the matrix. Each element in the array corresponds to one matrix row. For each row, the nested array contains values of elements at the corresponding matrix indices indicated in the indices array.

Notes:

  • Most matrices in Lingo4G are sparse, hence the specific JSON output format.

  • Some matrix rows may be all-zeros. In such cases, the corresponding indices and values arrays are empty.

matrix:​cooccurrence​Label​Similarity

Computes a label similarity matrix based on how the labels co-occur in the documents you provide.

{
  "type": "matrix:cooccurrenceLabelSimilarity",
  "cooccurrenceWindowSize": 32,
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labels": {
    "type": "labels:reference",
    "auto": true
  },
  "normalized": true,
  "similarityWeighting": "INCLUSION",
  "threads": "auto"
}

You can use the cooccurrence-based similarity matrix to to cluster or 2d-map a set of labels.

Lingo4G computes the co-occurrence-based label similarity matrix in the following way:

  1. For each pair (labelA, labelB) of input labels, Lingo4G scans the input documents to compute:

    • f A : how many times labelA occurred in the input documents,
    • f B : how many times labelB occurred in the input documents,
    • f AB how many times both labelA and labelB occurred in the document, at most cooccurrence​Window​Size words apart.
  2. Based on the f A , f B and f AB frequencies and the similarity​Weighting method you choose, Lingo4G computes the s AB and s BA similarity values and puts them in the output matrix at rows and columns corresponding to the indices of labelA and labelB in the input labels list.

    For example, if on the labels list labelA is at index 0, and labelB is at index 2, Lingo4G puts the corresponding similarities at the following locations in the output matrix:

    M = [ · · s AB · · · · · s BA · · · · · · · ]
  3. If normalized is true, Lingo4G globally normalizes all values in the similarity matrix to the 0...1 range.

Note that depending on the similarity​Weighting you choose, the matrix may or may not be symmetrical.

The following request computes co-occurrence counts for a list of labels across all documents containing at least one of the labels.

{
  "name": "Computing co-occurrence counts for a list of labels.",
  "stages": {
    "labels": {
      "type": "labels:direct",
      "labels": [
        {
          "label": "cluster"
        },
        {
          "label": "galaxies"
        },
        {
          "label": "factorization"
        },
        {
          "label": "photon"
        },
        {
          "label": "algebra"
        }
      ]
    },
    "matrix": {
      "type": "matrix:cooccurrenceLabelSimilarity",
      "labels": {
        "type": "labels:reference",
        "use": "labels"
      },
      "documents": {
        "type": "documents:byQuery",
        "query": {
          "type": "query:forLabels",
          "labels": {
            "type": "labels:reference",
            "use": "labels"
          }
        },
        "limit": "unlimited"
      },
      "similarityWeighting": "COOCCURRENCES",
      "normalized": false
    }
  }
}

Computing co-occurrence counts for a list of labels.

Note how the requests uses the query:​for​Labels query component to select documents containing at least one of the input labels. We also use the C​O​O​C​C​U​R​R​E​N​C​E​S similarity weighting and disable matrix normalization to get the actual number of label co-occurrences.

A typical use-case for matrix:​cooccurrence​Label​Similarity is clustering or 2d mapping of labels. The following request briefly demonstrates this use case.

{
  "name": "Clustering and 2d-mapping labels based on co-occurrence similarity.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      },
      "limit": "unlimited"
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 1000
      }
    },
    "similarities": {
      "type": "matrix:cooccurrenceLabelSimilarity"
    },
    "clusters": {
      "type": "clusters:ap"
    },
    "map2d": {
      "type": "embedding2d:lv"
    }
  },
  "output": {
    "stages": [
      "documents",
      "labels",
      "clusters",
      "map2d"
    ]
  }
}

Clustering and 2d-mapping labels based on co-occurrence similarity.

The above request collects 1000 labels from documents containing the word clustering, computes a co-occurrence similarity matrix for those labels and then arranges the labels into clusters and a 2d-map. The request uses the explicit output.stages array to prevent the output of the similarity matrix. Also note that the request uses auto-references to pass results between all the stages.

If you run the request in the JSON Sandbox app, you should see an interactive visualization of the clusters and the 2d map. Also have a look at the diagram tab, for a graphical representation of the connections between various stages.

Finally, if your index contains label embeddings, the embedding-based matrix:​knn​Vectors​Similarity stage may provide better clusters and 2d-maps.

cooccurrence​Window​Size

Type
integer
Default
32
Constraints
value >= 0
Required
no

Determines the maximum number of words that can separate co-occurring labels.

For example. if you set the co-occurrence window size to 10, Lingo4G treats two labels as co-occurring if they occur in the document at most 8 words apart.

Use smaller co-occurrence windows for sparser, more focused similarity matrices.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents in which to count label co-occurrences.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

The documents' feature fields in which to count label co-occurrences.

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The labels whose co-occurrences to count.

normalized

Type
boolean
Default
true
Required
no

If true, Lingo4G globally normalizes the similarity matrix to contain values in the 0...1 range.

The embedding2d:​lv requires normalized matrix values.

One use case for non-normalized matrix values is computing the actual label co-occurrence frequencies with similarity​Weighting set to C​O​O​C​C​U​R​R​E​N​C​E​S.1

similarity​Weighting

Type
string
Default
"INCLUSION"
Constraints
one of [COOCCURRENCES, RR, INCLUSION, LOEVINGER, BB, DICE, YULE, OCHIAI, INNER_PRODUCT, COSINE, PEARSON]
Required
no

Determines the binary similarity weighting Lingo4G applies to raw label co-occurrence when computing label similarity values.

In most cases, the R​R, I​N​C​L​U​S​I​O​N and B​B weightings provide best clustering and 2d mapping results.

The similarity​Weighting property supports the following values:

R​R
Russel-Rao similarity. Similarity values will be proportional to the raw co-occurrence counts. The RR weighting creates rather large clusters and selects frequent labels as cluster label exemplars.
I​N​C​L​U​S​I​O​N

Inclusion coefficient similarity, emphasizes connections between labels sharing the same words, for example Mac OS and Mac OS X 10.6.

L​O​E​V​I​N​G​E​R

The inclusion coefficient corrected for chance.

B​B

Braun-Blanquet similarity. Maximizes similarity between labels having similar numbers of occurrences. Promotes lower-frequency labels as cluster exemplars.

D​I​C​E

Dice coefficient.

Y​U​L​E

Yule coefficient.

O​C​H​I​A​I

Ochiai coefficient, binary cosine.

I​N​N​E​R_​P​R​O​D​U​C​T

Inner product of the rows of the co-occurrence matrix.

C​O​S​I​N​E

Cosine distance between the rows of the co-occurrence matrix.

P​E​A​R​S​O​N

Pearson correlation between the rows of the co-occurrence matrix.

C​O​O​C​C​U​R​R​E​N​C​E​S

Number of co-occurrences of labels. Set the normalized property to false to get the actual numbers in the output matrix.

threads

Type
threads
Default
auto
Required
no

The number of threads to use to count label co-occurrences.

matrix:​direct

A matrix where you directly provide all values in rows and columns.

{
  "type": "matrix:direct",
  "matrix": {
    "columns": 0,
    "indices": [],
    "values": []
  }
}

Direct matrices are useful mostly for debugging purposes or when you want to cluster or 2d-map a set of similarities coming from an external source.

matrix

Type
object
Default
{
  "columns": 0,
  "indices": [],
  "values": []
}
Required
no

Definition of the matrix.

The definition must follow the sparse matrix JSON structure.

columns

Type
integer
Default
undefined
Constraints
value >= 0
Required
yes

The number of columns of the matrix.

indices

Type
array of array of integer
Default
undefined
Required
yes

Indices of non-zero elements of the matrix.

values

Type
array of array of number
Default
undefined
Required
yes

Values corresponding to indices.

matrix:​element​Wise​Product

Computes the element-by-element product of two matrices.

{
  "type": "matrix:elementWiseProduct",
  "factorA": null,
  "factorB": null
}

factor​A

Type
matrix
Default
null
Required
yes

Input matrix.

factor​B

Type
matrix
Default
null
Required
yes

Input matrix.

matrix:​from​Matrix​Rows

Materializes matrix rows into an in-memory matrix.

{
  "type": "matrix:fromMatrixRows",
  "matrixRows": {
    "type": "matrixRows:reference",
    "auto": true
  }
}

This stage is useful mostly for debugging requests involving matrix rows.

matrix​Rows

Type
matrixRows
Default
{
  "type": "matrixRows:reference",
  "auto": true
}
Required
no

The matrix rows to materialize.

matrix:​keyword​Document​Similarity

Computes a document similarity matrix based on the labels the documents share.

{
  "type": "matrix:keywordDocumentSimilarity",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelCollector": {
    "type": "labelCollector:topFromFeatureFields",
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "labelFilter": {
      "type": "labelFilter:reference",
      "auto": true
    },
    "labelListFilter": {
      "type": "labelListFilter:truncatedPhrases"
    },
    "minTf": 0,
    "minTfMass": 1,
    "tieResolution": "AUTO"
  },
  "maxDocumentsForSubIndex": 0.3,
  "maxInMemorySubIndexSize": 8000000,
  "maxNeighbors": 8,
  "maxQueryLabelsPerDocument": 4,
  "minQueryLabelsPerDocument": 1,
  "minQueryLabelsRequiredInSimilarDocument": 1,
  "normalized": true,
  "threads": "auto"
}

You can use the keyword-based document similarity matrix to cluster or 2d-map a set of documents.

To compute the keyword-based document similarity, Lingo4G performs the following steps:

  1. For each document in the input documents, Lingo4G uses the label​Collector you provide to extract up to max​Query​Labels​Per​Document labels that characterize the document.

    If the number of labels extracted from the document is smaller than min​Query​Labels​Per​Document, Lingo4G excludes the document from processing. The corresponding row in the similarity matrix will be empty.

  2. If the number of input documents in relation to the total number of documents in the index is larger than max​Documents​For​Sub​Index, Lingo4G creates a temporary inverted index that improves the performance of the similarity matrix building.

  3. For each input document, Lingo4G builds a search query consisting of labels extracted in step 1. Lingo4G restricts the query to find matches only among the input documents. Additionally, the query matches only documents that contain at least min​Query​Labels​Required​In​Similar​Document of the document's labels obtained in step 1.

  4. For each input document, Lingo4G runs the corresponding query it built in step 3 to retrieve up to max​Neighbors matching documents. Lingo4G uses up to threads to execute the queries in parallel.

  5. For each input document, Lingo4G builds the corresponding row of the similarity matrix using the matching documents retrieved in step 4.

    For example, assuming that the query corresponding to document at index 2 in the input documents array matched documents at index 0, 2, and 3, Lingo4G puts the following values into the similarity matrix:

    M = [ · · · · · · · · s 0 · s 2 s 3 · · · · ]

    where s 0 , s 2 and s 3 are the search scores obtained in step 4 for document at index 2.

    Note that matrix M is square and asymmetrical – the search query for document at index 0 or 3 may not return document 2.

  6. If normalized is true, Lingo4G normalizes values each row of matrix M to fall in the 0...1 range.

In the Apache Lucene, Solr and Elasticsearch world, this kind of similarity is also called More-Like-This similarity.

The following request uses keyword documents similarity to cluster and 2d map documents.

{
  "name": "Clustering and 2d-mapping documents using keyword document similarity.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      },
      "limit": 10000
    },
    "matrix": {
      "type": "matrix:keywordDocumentSimilarity",
      "documents": {
        "type": "documents:reference",
        "use": "documents"
      }
    },
    "documents2dEmbedding": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:reference",
        "use": "matrix"
      }
    },
    "cluster": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:reference",
        "use": "matrix"
      },
      "inputPreference": -10000,
      "softening": 0.2
    }
  },
  "output": {
    "stages": [
      "documents",
      "documents2dEmbedding",
      "cluster"
    ]
  }
}

Clustering and 2d-mapping documents using keyword document similarity.

If you run the above request in the JSON Sandbox app, you should see the documents represented as a 2d map with point color corresponding to the top-level cluster the document belongs to.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents among which to compute keyword-based similarities.

If you provide a set of N documents, this stage produces a square N Ă— N similarity matrix.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

The feature fields to use when looking for similar documents.

Lingo4G uses the feature fields you provide in this property to run the similar document search queries in step 4 of the matrix building algorithm.

label​Collector

Type
labelCollector
Default
{
  "type": "labelCollector:topFromFeatureFields",
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "minTf": 0,
  "minTfMass": 1,
  "tieResolution": "AUTO"
}
Required
no

Determines which labels to use to build similar documents search queries.

Lingo4G uses the label collector you provide in step 1 of the matrix building algorithm to extract labels describing each input document.

The default label extractor retrieves up to max​Query​Labels​Per​Document of the most frequent labels in each document. Provide a custom label collector to modify this behavior.

max​Documents​For​Sub​Index

Type
number
Default
0.3
Constraints
value >= 0 and value <= 1
Required
no

Determines the threshold for creating a temporary inverted index.

Lingo4G can significantly speed up the computation of keyword similarities for a small set of documents by creating and querying a temporary disposable inverted index containing just the input documents. Lingo4G creates the temporary index only when the number of input documents divided by the total number of documents in the index is smaller or equal to the value of this property.

For example, if max​Documents​For​Sub​Index is 0.3, if the documents set contains fewer than 30% of all documents in the index, Lingo4G creates a temporary index to speed up the computation of similarities. We don't recommend setting this property to 0.0 or 1.0 in production.

max​In​Memory​Sub​Index​Size

Type
integer
Default
8000000
Constraints
value >= 0
Required
no

Maximum size of the in-memory temporary index, in bytes.

If the size of the temporary index exceeds max​In​Memory​Sub​Index​Size, Lingo4G materializes the index on disk in a temporary directory.

max​Neighbors

Type
integer
Default
8
Constraints
value >= 0
Required
no

The maximum number of similar documents to retrieve for each input document.

Lingo4G uses the max​Neighbors property in step 4 of the similarity matrix building algorithm to determine the maximum number of similar documents to retrieve for each row of the similarity matrix. Therefore, this stage produces matrices whose rows contain at most max​Neighbors values.

Increasing the number of similar documents produces similarity matrices that give rise to larger clusters and tighter 2d maps. Conversely, lowering the number of similar documents gives rise to smaller clusters and more sparse 2d maps.

max​Query​Labels​Per​Document

Type
integer
Default
4
Constraints
value >= 0
Required
no

The maximum number of labels to use to build the similar documents search query.

Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each input document.

Increasing max​Query​Labels​Per​Document, coupled with a larger max​Neighbors values, produces broader, more general similarity matrices that usually give rise to larger clusters and more dense 2d maps.

Lingo4G ignores the max​Query​Labels​Per​Document property if you set a custom label​Collector.

min​Query​Labels​Per​Document

Type
integer
Default
1
Constraints
value >= 0
Required
no

The minimum number of labels the document must contain to be included in the similarity matrix computation.

Lingo4G uses this property in step 1 of the similarity matrix building algorithm where it collects a set of labels that best describe each input document. If a document contains fewer than min​Query​Labels​Per​Document, Lingo4G excludes it from processing, leaving the corresponding row in the similarity matrix empty.

If you want to exclude from further processing documents containing just one label, increase min​Query​Labels​Per​Document beyond the default value of 1. Clusters arising from such similarity matrices are usually smaller and 2d maps are more sparse.

If you increase min​Query​Labels​Per​Document, make sure to set max​Query​Labels​Per​Document to a value equal or greater than min​Query​Labels​Per​Document.

Lingo4G ignores the min​Query​Labels​Per​Document property if you set a custom label​Collector.

min​Query​Labels​Required​In​Similar​Document

Type
integer
Default
1
Constraints
value >= 0
Required
no

The minimum number of common labels required for two documents to be treated as similar.

Lingo4G uses this property in step 3 of the similarity matrix building algorithm. If you increase min​Query​Labels​Per​Document beyond the default value of 1, Lingo4G removes from the similarity matrix those document pairs that have fewer than min​Query​Labels​Per​Document labels in common.

If you don't want to base document similarities on a single label shared between documents, increase min​Query​Labels​Per​Document beyond the default value of 1. Clusters arising from such similarity matrices are usually smaller and 2d maps are more sparse.

normalized

Type
boolean
Default
true
Required
no

If true, Lingo4G normalizes values in rows of the similarity matrix to fall in the 0...1 range.

If you enable normalization, in each similarity matrix row Lingo4G finds the maximum value and divides all entries in that row by that value.

If you plan to feed the similarity matrix to the embedding2d:​lv 2d embedding algorithm, make sure the normalized property is true. Otherwise, Lingo4G throws an error to prevent incorrect 2d embedding results.

threads

Type
threads
Default
auto
Required
no

The number of concurrent threads to use to execute document similarity search queries.

matrix:​keyword​Label​Document​Similarity

Computes a rectangular similarity matrix a list of labels and a set of documents you provide.

{
  "type": "matrix:keywordLabelDocumentSimilarity",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labels": {
    "type": "labels:reference",
    "auto": true
  },
  "maxSimilarDocumentsPerLabel": 5,
  "threads": "auto"
}

You can use this stage to overlay labels on a pre-existing 2d map of documents in such a way that labels summarize the documents lying in various areas of the map.

Lingo4G computes the label-document similarity matrix in the following way:

  1. For each label from the input labels list, Lingo4G builds a search query consisting of that label, covering the fields you provide and limited to the set of target documents.

  2. For each input label, Lingo4G executes the search query it built in step 1 and collects up to max​Similar​Documents​Per​Label documents. Then, Lingo4G sets the value in the output similarity matrix at the row corresponding to label index and column corresponding to the index of the matching document to equal the search score of the matching document.

The following request uses the matrix:​keyword​Label​Document​Similarity stage to describe a 2d map of documents by placing labels near groups of documents related to those labels.

{
  "name": "Labeling a 2d map of documents using label-document similarity matrix.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      },
      "limit": 10000
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 500
      }
    },
    "documents2dEmbedding": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:keywordDocumentSimilarity"
      }
    },
    "label2dOverlay": {
      "type": "embedding2d:lvOverlay",
      "matrix": {
        "type": "matrix:keywordLabelDocumentSimilarity",
        "labels": {
          "type": "labels:reference",
          "use": "labels"
        },
        "documents": {
          "type": "documents:reference",
          "use": "documents"
        }
      },
      "embedding2d": {
        "type": "embedding2d:reference",
        "use": "documents2dEmbedding"
      }
    }
  }
}

Labeling a 2d map of documents using label-document similarity matrix.

See the reference documentation for the embedding2d:​lv​Overlay stage for in-depth explanation of this kind of requests.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents for which to compute the similarity matrix.

Column indices in the output matrix correspond to indices on the document list you provide.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

The document fields to search for labels when building the similarity matrix.

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The labels for which to build the similarity matrix.

Row indices in the output matrix correspond to indices on the label list you provide.

max​Similar​Documents​Per​Label

Type
integer
Default
5
Constraints
value >= 0
Required
no

The maximum number of similar documents to retrieve for each label.

Each row of the output matrix has at most max​Similar​Documents​Per​Label.

threads

Type
threads
Default
auto
Required
no

The number of parallel threads to use when building the similarity matrix.

matrix:​knn2d​Distance​Similarity

Computes a matrix of similarities based on Euclidean distance between 2d points.

{
  "type": "matrix:knn2dDistanceSimilarity",
  "embedding2d": {
    "type": "embedding2d:reference",
    "auto": true
  },
  "maxNearestPoints": 8
}

You can use this stage to identify clusters of nearby points in a 2d map.

Lingo4G computes the matrix in the following way:

  1. For each 2d point p in the input embedding2d, find up to max​Nearest​Points with respect to the Euclidean distance.

  2. For each nearest point p n found in step 1, compute the similarity using the following formula:

    s = 1 1 + e d

    where d is the Euclidean distance between points p and p n .

    The above similarity formula converts 2d distances to the 0...1 in such a way that the similarity for points with zero distance is 1 and similarity for points that are infinitely apart is 0.

  3. Put similarity s computed in step 2 into the similarity matrix at the row and column location corresponding to points p and p n .

The following request uses the matrix:​knn2d​Distance​Similarity stage to identify clusters of nearby points in a 2d map:

{
  "name": "Clustering of points on a 2d map of labels",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      },
      "limit": "unlimited"
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 2000
      }
    },
    "map2d": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedLabelEmbeddings"
        }
      }
    },
    "clustersOnMap2d": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:knn2dDistanceSimilarity",
        "embedding2d": {
          "type": "embedding2d:reference",
          "use": "map2d"
        },
        "maxNearestPoints": 16
      },
      "inputPreference": -1,
      "softening": 0.5
    }
  }
}

Clustering of points on a 2d map of labels.

The above request collects 2000 labels from documents containing the word clustering and arranges them on a 2d map using the matrix:​knn​Vectors​Similarity similarity. Then, it clusters the label points on the 2d map using the clusters:​ap clustering algorithm and the matrix:​knn2d​Distance​Similarity similarity.

If you run the above request in the JSON Sandbox app, you should see the map of labels with proximity-based clusters represented as different colors. Switch to the diagram for a visualization of the data flow in the request.

embedding2d

Type
embedding2d
Default
{
  "type": "embedding2d:reference",
  "auto": true
}
Required
no

The input 2d points for which to find the nearest neighbors.

max​Nearest​Points

Type
integer
Default
8
Constraints
value >= 0
Required
no

The maximum number of the nearest points to find for each point.

The larger the max​Nearest​Points value, the denser the matrix and the larger the clusters you obtain from clustering that matrix.

matrix:​knn​Vectors​Similarity

Computes a label or document similarity matrix based on multidimensional vector distance.

{
  "type": "matrix:knnVectorsSimilarity",
  "maxNeighbors": 10,
  "threads": "auto",
  "vectors": {
    "type": "vectors:reference",
    "auto": true
  }
}

You can use this stage to cluster or 2d-map a set of documents or labels based on the similarity of their corresponding embedding vectors.

For each vector in the input set of multidimensional vectors, Lingo4G finds up to max​Neighbors closest vectors in the same vector set and transfers the cosine similarities between the nearest vectors to the output matrix. Therefore, the output matrix is square, its size is equal to the number of vectors in the input vectors set and values fall in the 0...1 range.

If you use vectors:​precomputed​Document​Embeddings as the input vectors set, this stage computes similarities between documents. Similarly, if you provide vectors:​precomputed​Label​Embeddings on input, this stage computes similarities between labels.

In many cases, embedding-based similarities offer better clustering and 2d mapping results compared to their co-occurrence and keyword-based counterparts. The following request arranges 2000 labels into a 2d map where similar labels are close to each other.

{
  "name": "Computing a 2d map of labels based on multidimensional embedding similarity",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      },
      "limit": "unlimited"
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 2000
      },
      "documents": {
        "type": "documents:reference",
        "use": "documents"
      }
    },
    "map2d": {
      "type": "embedding2d:lv",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedLabelEmbeddings",
          "labels": {
            "type": "labels:reference",
            "use": "labels"
          }
        }
      }
    }
  }
}

Computing a 2d map of labels based on multidimensional embedding similarity.

max​Neighbors

Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of nearest vectors to find for each vector.

The larger max​Neighbors, the denser the matrix and the larger the clusters resulting from clustering the matrix.

threads

Type
threads
Default
auto
Required
no

The number of threads to use for the computation.

vectors

Type
vectors
Default
{
  "type": "vectors:reference",
  "auto": true
}
Required
no

The set of multidimensional vectors among which to find similarities.

The size of the output square matrix is equal to the number of vectors in the input vector set.

Consumers of matrix:​*

The following stages and components take matrix:​* as input:

Stage or component Property
clusters:​ap
  • matrix
  • embedding2d:​lv
  • matrix
  • embedding2d:​lv​Overlay
  • matrix
  • matrix:​element​Wise​Product
  • factor​A
  • factor​B
  • matrix​Rows:​from​Matrix
  • matrix