clusters

The clusters:​* stages organize labels or documents into larger clusters based on the criteria of your choice.

Lingo4G clustering stages take matrix:​* or matrix​Rows:​* on input rather than lists of documents or labels. Therefore, you can use the same clustering algorithm, such as clusters:​ap, to cluster either documents or labels. It is the input similarity matrix that defines the entities to cluster and the similarity function.


You can use the following clustering stages in your analysis requests:

clusters:​ap

Clusters entities based on the similarity matrix you provide. Uses an optimized version of the Soft-Constraint Affinity Propagation algorithm.

clusters:​by​Values

Creates clusters of based on the list of values, converting each distinct value into a cluster.

clusters:​from​Matrix​Columns

Creates clusters based on the matrix rows you provide. Each column gives rise to one cluster with cluster members corresponding to the values of the column.

clusters:​with​Remapped​Documents

Translates clusters from one document space to another, filtering out non-matching documents.


clusters:​reference

References the results of another clusters:​* stage defined in the request.


The JSON output of the clusters:​* stages has the following structure:

{
  "result" : {
    "clusters" : {
      "clusters" : [
        {
          "exemplar" : {
            "index" : 18,
            "weight" : 0.34286517
          },
          "clusters" : [
            {
              "exemplar" : {
                "index" : 27,
                "weight" : 0.3567484
              },
              "clusters" : [ ],
              "members" : [
                {
                  "index" : 1,
                  "weight" : 0.4809605
                },
                {
                  "index" : 26,
                  "weight" : 0.30889755
                }
              ]
            },
            {
              "exemplar" : {
                "index" : 39,
                "weight" : 0.3273351
              },
              "clusters" : [ ],
              "members" : [
                {
                  "index" : 11,
                  "weight" : 1.0
                },
                {
                  "index" : 54,
                  "weight" : 0.68905324
                }
              ]
            }
          ],
          "members" : [
            {
              "index" : 35,
              "weight" : 0.5968796
            }
          ]
        },
        {
          "exemplar" : {
            "index" : 93,
            "weight" : 0.8734215
          },
          "clusters" : [
            {
              "exemplar" : {
                "index" : 2,
                "weight" : 0.25992832
              },
              "clusters" : [
                {
                  "exemplar" : {
                    "index" : 16,
                    "weight" : 0.6403978
                  },
                  "clusters" : [ ],
                  "members" : [
                    {
                      "index" : 84,
                      "weight" : 0.5679599
                    },
                    {
                      "index" : 28,
                      "weight" : 0.11486327
                    }
                  ]
                }
              ],
              "members" : [ ]
            }
          ],
          "members" : [
            {
              "index" : 24,
              "weight" : 0.985477
            },
            {
              "index" : 59,
              "weight" : 0.3255045
            },
            {
              "index" : 7,
              "weight" : 0.17952214
            }
          ]
        }
      ],
      "unclustered" : [
        12,
        34,
        36,
        79,
        80,
        89,
        91,
        92
      ]
    }
  }
}

As a general rule, Lingo4G represents cluster members as indices of rows or columns of the input matrix or matrix rows. See the reference for a specific clustering algorithm, such as clusters:​ap, for examples requests and ways to resolve cluster member indices.

The clusters array consists of objects representing the top-level clusters the clustering stage produced. Each cluster object can have the following properties:

exemplar

The cluster member that characterizes the entire cluster. The exact semantics of the exemplar member depends on the specific clustering algorithm. Some clustering algorithms do not produce cluster exemplars.

index

The index of the input matrix row or column that corresponds to this exemplar. You can use this index to identify the actual document or label the exemplar represents.

weight

The importance of the exemplar cluster member. The semantics of cluster member weights depends on the specific clustering algorithm.

clusters

The list of this cluster's child clusters.

members

The list of members of this cluster.

index

The index of the input matrix row or column that corresponds to this member. You can use this index to identify the actual document or label the cluster member represents. See the specific clustering algorithm reference for example requests and ways to resolve cluster member indices.

weight

The importance of the cluster member. The semantics of cluster member weights depends on the specific clustering algorithm.

The unclustered array contains indices of matrix rows or columns that the clustering algorithm was not able to organize into clusters.

clusters:​ap

Clusters the matrix you provide using an optimized version of the Soft-Constraint Affinity Propagation algorithm.

{
  "type": "clusters:ap",
  "damping": 0.9,
  "inputPreference": -1000,
  "matrix": {
    "type": "matrix:reference",
    "auto": true
  },
  "maxIterations": 2000,
  "minPruningGain": 0.3,
  "minSteadyIterations": 100,
  "softening": 0.2,
  "threads": "auto"
}

Characteristics

Lingo4G's implementation of Affinity Propagation clustering produces clusters with the following characteristics:

  • Non-overlapping. Each member can belong to only one cluster or remain unclustered.

  • Described by an exemplar. Each cluster has one designated member - the exemplar - that serves as the most characteristic "description" of the other members in the cluster.

Example requests

The following request uses the clusters:​ap stage to cluster the top labels occurring in documents matching the clustering query.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 200
      }
    },
    "clusters": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:cooccurrenceLabelSimilarity",
        "labels": {
          "type": "labels:reference",
          "use": "labels"
        },
        "documents": {
          "type": "documents:reference",
          "use": "documents"
        }
      }
    }
  },
  "output": {
    "stages": [
      "labels",
      "clusters"
    ]
  }
}

Using clusters:​ap to cluster labels based on how they co-occur in documents.

In the response to the above request, member indices in the clusters stage result point to the list of labels. That is, member 0 in a cluster is the label at the 0-th index in the result of the labels stage.

The example request uses the matrix:​cooccurrence​Label​Similarity matrix, which computes similarities between labels based on how they co-occur with other labels in a set of documents. Alternatively, if your index contains label embeddings, you could use the matrix:​knn​Vectors​Similarity matrix to compute similarities based on multidimensional embedding vectors.

The following request uses clusters:​ap to cluster the top 10k documents matching the clustering query.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    },
    "clusters": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:keywordDocumentSimilarity",
        "documents": {
          "type": "documents:reference",
          "use": "documents"
        }
      },
      "inputPreference": -1000,
      "softening": 0.05
    },
    "clusterLabels":{
      "type": "labelClusters:documentClusterLabels",
      "clusters": {
        "type": "clusters:reference",
        "use": "clusters"
      }
    }
  },
  "output": {
    "stages": [
      "documents",
      "clusters",
      "clusterLabels"
    ]
  }
}

Using clusters:​ap to cluster documents based on the common keywords they share.

Again, member indices in the clusters stage result point to the list of documents. That is, member 0 in a cluster is the document at the 0-th index in the result of the documents stage.

The request uses the matrix:​keyword​Document​Similarity stage, which computes similarities between document based on the number of common words and phrases they share. If your index contains document embeddings, you could also use the matrix:​knn​Vectors​Similarity stage, which computes the similarities based on documents' multidimensional embedding vectors.

Finally, the example request uses the label​Clusters:​document​Cluster​Labels stage to identify the most frequent labels in each document cluster. Such labels may serve as a summary of the contents of each cluster.

Tuning

The output of the clusters:​ap stage depends not only on the clustering parameters you choose, but also on the density of the similarity matrix you provide on input. The following subsections offer some advice for a number of typical clustering tuning scenarios.

Number and size of clusters

The number of clusters you can get from the clusters:​ap stage depends on the following factors, in the order of impact:

  1. Density of the input similarity matrix. The more dense the similarity matrix, or, in other words, the more neighbors each label or document has in the similarity graph, the larger clusters you get.

    To make the similarity matrix more dense, increase the number of per-row neighbors in the input matrix stage:

    Lowering the number of neighbors lowers the density of the similarity matrix and therefore leads to a larger number of smaller clusters.

  2. Input preference. Within a limited range, lowering the input​Preference property results in a smaller number of larger clusters. Input preference values below -1000 usually don't lead to further cluster size increases. For even larger clusters, increase the density of the similarity matrix.

  3. Softening. Increasing the softening property above 0 to introduce cluster links also increases the total number of clusters.

Clustering time

The time required to cluster a specific input depends on two factors:

  • Number of elements in the input matrix. The larger or the more dense the matrix, the longer it takes to cluster the matrix. Reducing the density of the input matrix speeds up processing but also creates a larger number of smaller clusters.

    Alternatively, when you intend to cluster a large number of documents only to get an overview of the topics covered by the documents, you can speed up processing by taking a random sample of the input collection.

  • Number of clustering iterations. Clustering time depends on the number of processing iterations Lingo4G performs when processing the input matrix. Lowering max​Iterations decreases the clustering time at the cost of lower clustering quality.

damping

Type
number
Default
0.9
Constraints
value >= 0 and value <= 1
Required
no

Determines the speed of the updates to the clustering solution.

We recommend leaving damping at the default value of 0.9. If you notice that clustering does not converge for a specific data set (Lingo4G uses up all max​Iterations of clustering), first try increasing max​Iterations. If increasing the number of iterations does not lead to convergence, try increasing damping to reach the 0.95—0.98 range.

input​Preference

Type
number
Default
-1000
Required
no

Influences the number of clusters the algorithm produces.

The lower the input preference value, the lower the number of clusters. When input preference is 0, the number of clusters is usually higher than practical. Use the input preference value of -5 or lower to get a smaller set of clusters.

See the Tuning section for more information how the input preference affects the characteristics of clusters.

matrix

Type
matrix
Default
{
  "type": "matrix:reference",
  "auto": true
}
Required
no

The matrix of similarities for clustering.

Affinity Propagation clustering requires a square similarity matrix. You can use the following stages as input for clustering:

max​Iterations

Type
integer
Default
2000
Constraints
value >= 0
Required
no

The maximum number of clustering iterations to perform.

When clustering more than about 10k documents or labels, consider increasing the number of allowed iterations to 5000 or even 10000 for better clustering results (at the cost of longer processing time).

min​Pruning​Gain

Type
number
Default
0.3
Constraints
value >= 0 and value <= 1
Required
no

The minimum estimated relationship pruning gain required to apply the pruning during clustering.

Pruning may reduce the time of clustering for dense relationship matrices at the cost of memory usage increase by about 60%.

min​Steady​Iterations

Type
number
Default
100
Constraints
value >= 0
Required
no

The minimum number of Affinity Propagation iterations during which the clusters don't change required to assume that the clustering process is complete.

If you notice clustering does not converge for a specific data set, try increasing max​Iterations first, and then possibly increasing damping, if still required.

softening

Type
number
Default
0.2
Constraints
value >= 0 and value <= 1
Required
no

Determines the amount of internal structure to generate for large label clusters.

A value of 0 keeps the internal structure to a minimum, producing a flat cluster structure for most inputs. As you increase softening, Lingo4G splits larger clusters into to smaller, connected subclusters. Values close to 1.0 produce the richest internal structure of clusters.

threads

Type
threads
Default
auto
Required
no

The number of parallel threads to use to perform clustering.

clusters:​by​Values

Creates one cluster for each unique value from the list of values you provide.

{
  "type": "clusters:byValues",
  "values": {
    "type": "values:reference",
    "auto": true
  }
}

The clusters:​by​Value stages outputs a flat list of clusters. Each cluster has the following properties:

name

The value that gave rise to this cluster.

members

Each member corresponds to one occurrence of the cluster's value on the list of values. Member id property is the index of the value's occurrence on the input list, member weight is always 1.

If you combine this stage with values:​from​Document​Field, you can count how many times a specific field value occurred in a specific list of documents. A typical use case for such counting is a k-nearest-neighbors (kNN) classifier.

The following request is a simple kNN classifier that suggests an arXiv category for the piece of text you provide.

{
  "name": "arXiv category suggestions (kNN)",
  "variables": {
    "textToClassify": {
      "name": "Text to classify",
      "comment": "Paper title and abstract for which to generate category suggestions. Provide at least one paragraph of text for best results.",
      "value": "Word Mover's Embedding: From Word2Vec to Document Embedding. While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called Word Mover's Distance (WMD) that aligns semantically similar words, yields unprecedented KNN classification accuracy. However, WMD is expensive to compute, and it is hard to extend its use beyond a KNN classifier. In this paper, we propose the Word Mover's Embedding (WME), a novel approach to building an unsupervised document (sentence) embedding from pre-trained word embeddings. In our experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matches or outperforms state-of-the-art techniques, with significantly higher accuracy on problems of short length."
    },
    "classFieldName": {
      "name": "Document field to suggest",
      "comment": "The document field whose value to suggest for the provided text.",
      "value": "category"
    },
    "maxKeywordSimilarDocuments": {
      "name": "Max keyword-similar documents",
      "comment": "Maximum number of similar documents to find using the keyword method.",
      "value": 20
    }
  },
  "stages": {
    "seedLabels": {
      "type": "labels:fromText",
      "text": {
        "@var": "textToClassify"
      }
    },
    "keywordMlt": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:forLabels",
        "labels": {
          "type": "labels:reference",
          "use": "seedLabels"
        }
      },
      "limit": {
        "@var": "maxKeywordSimilarDocuments"
      }
    },
    "keywordMltContent": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "keywordMlt"
      },
      "queries": {
        "k": {
          "type": "query:fromDocuments",
          "documents": {
            "type": "documents:reference",
            "use": "keywordMlt"
          }
        }
      },
      "limit": {
        "@var": "maxKeywordSimilarDocuments"
      }
    },
    "classes": {
      "type": "clusters:byValues",
      "values": {
        "type": "values:fromDocumentField",
        "documents": {
          "type": "documents:reference",
          "use": "keywordMlt"
        },
        "multipleValues": "COLLECT_ALL",
        "fieldName": {
          "@var": "classFieldName"
        }
      }
    }
  },
  "output": {
    "stages": [
      "classes",
      "keywordMlt",
      "keywordMltContent",
      "seedLabels"
    ]
  }
}

Using clusters:​by​Values compute arXiv category suggestions for a piece of text using a k-nearest neighbors classifier.

The classifier request performs the following steps:

  1. The seed​Labels stage extracts top-frequency labels from the input text using labels:​from​Text.

  2. The keyword​Mlt stage uses documents:​by​Query and query:​for​Labels to find more documents containing seed​Labels.

  3. The classes stage uses values:​from​Document​Field to fetch the arXiv category field value for each document returned by the keyword​Mlt stage. Then, it uses clusters:​by​Value to compute the most frequent categories. These are likely to be a good category choice for the input passage.

values

Type
values
Default
{
  "type": "values:reference",
  "auto": true
}
Required
no

The list of values for which to create clusters.

clusters:​from​Matrix​Columns

Creates clusters based on the matrix rows you provide. Each column gives rise to one cluster with cluster members corresponding to the values of the column.

{
  "type": "clusters:fromMatrixColumns",
  "limit": 100,
  "matrixRows": {
    "type": "matrixRows:reference",
    "auto": true
  },
  "sortOrder": "DESCENDING",
  "weightAggregation": "SUM"
}

This stage performs the following steps:

  1. For each column of the input matrix​Rows, aggregate the column's values using the weight​Aggregation function.

  2. Sort columns by their aggregated value computed in step 1, according to the sort​Order.

  3. Return a flat list of clusters corresponding to up to limit first columns on the sorted list. Each cluster has the following properties:

    exemplar

    Describes the matrix column that gave rise to this cluster.

    index

    Index of the column that gave rise to this cluster.

    weight
    The aggregate of column values computed in step 1.
    members

    Describes the individual values (rows) of the column that gave rise to this cluster.

    index

    Index of the row.

    weight
    Matrix value at this member's row and this cluster's column coordinate.

The clusters:​from​Matrix​Columns stage is an extension of the documents:​from​Matrix​Columns. While the latter outputs only documents corresponding to the top-valued columns, this stage also outputs the indices and values in rows that contributed to the specific column's value.

Like with documents:​from​Matrix​Columns, you can use this stage to select top-scoring documents where the score is an aggregation of a number of values. For example, if you build matrix​Rows of cross-similarities between a set of cs.* and physics.* arXiv papers, clusters:​from​Matrix​Columns can reveal the top physics.* papers that are most similar to cs.* papers, showing where the two areas overlap. Unlike documents:​from​Matrix​Columns, this stage also outputs the cs.* papers that contribute to the aggregated similarity value of each physics.* paper.

limit

Type
limit
Default
100
Required
no

The maximum number of top-scoring matrix columns to convert into clusters.

matrix​Rows

Type
matrixRows
Default
{
  "type": "matrixRows:reference",
  "auto": true
}
Required
no

The matrix rows whose columns to aggregate.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Determines the sorting order for the aggregated column values.

A​S​C​E​N​D​I​N​G

Creates up to limit clusters corresponding to columns with the largest aggregated values.

D​E​S​C​E​N​D​I​N​G

Creates up to limit clusters corresponding to columns with the smallest aggregated values.

U​N​S​P​E​C​I​F​I​E​D

Creates up to limit clusters in the order their corresponding columns appear in the input matrix​Rows.

weight​Aggregation

Type
weightAggregation
Default
"SUM"
Required
no

The column value aggregation function.

clusters:​with​Remapped​Documents

Translates clusters from one document space to another, filtering out non-matching documents.

{
  "type": "clusters:withRemappedDocuments",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "exemplarsFrom": null,
  "exemplarsTo": null,
  "membersFrom": null,
  "membersTo": null
}

This stage is fairly specialized and has very rare use cases. You may need it when you have a list of clusters created for a certain set of documents, but want to re-map cluster member indices to a different but related set of documents, such as a subset or a superset of the one that gave rise to clusters.

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

The clusters whose exemplar and member indices to remap.

exemplars​From

Type
documents
Default
null
Required
yes

The list of documents which gave rise to cluster exemplars.

exemplars​To

Type
documents
Default
null
Required
yes

The list of document to which to translate the cluster exemplars.

members​From

Type
documents
Default
null
Required
yes

The list of documents which gave rise to cluster members.

members​To

Type
documents
Default
null
Required
yes

The list of document to which to translate the cluster members.

Consumers of clusters:​*

The following stages and components take clusters:​* as input:

Stage or component Property
clusters:​with​Remapped​Documents
  • clusters
  • documents:​from​Cluster​Exemplars
  • clusters
  • documents:​from​Cluster​Members
  • clusters
  • label​Clusters:​document​Cluster​Labels
  • clusters