clusters
The clusters:​*
stages organize labels or documents into larger clusters based on the criteria of your
choice.
Lingo4G clustering stages take matrix:​*
or
matrix​Rows:​*
on input rather than lists of documents or
labels. Therefore, you can use the same clustering algorithm, such as
clusters:​ap
, to cluster either documents or labels. It is the input similarity matrix that defines the entities to cluster and
the similarity function.
The Similarity matrices tutorial contains step-by-step guides to different similarity functions available in Lingo4G and their applications to 2d mapping and clustering of labels and documents.
You can use the following clustering stages in your analysis requests:
-
clusters:​ap
-
Clusters entities based on the similarity matrix you provide. Uses an optimized version of the Soft-Constraint Affinity Propagation algorithm.
-
clusters:​by​Values
-
Creates clusters of based on the list of values, converting each distinct value into a cluster.
-
clusters:​cd
-
Clusters items based on the similarity matrix you provide. Uses the Community Detection graph clustering algorithm.
-
clusters:​from​Matrix​Columns
-
Creates clusters based on the matrix rows you provide. Each column gives rise to one cluster with cluster members corresponding to the values of the column.
-
clusters:​transformed
-
Flattens and truncates the clustering you provide to the specified hierarchy depth and maximum number of cluster members.
-
clusters:​with​Remapped​Documents
-
Translates clusters from one document space to another, filtering out non-matching documents.
clusters:​reference
-
References the results of another
clusters:​*
stage defined in the request.
The JSON output of the clusters:​*
stages has the following structure:
{
"result" : {
"clusters" : {
"clusters" : [
{
"exemplar" : {
"index" : 18,
"weight" : 0.34286517
},
"clusters" : [
{
"exemplar" : {
"index" : 27,
"weight" : 0.3567484
},
"clusters" : [ ],
"members" : [
{
"index" : 1,
"weight" : 0.4809605
},
{
"index" : 26,
"weight" : 0.30889755
}
]
},
{
"exemplar" : {
"index" : 39,
"weight" : 0.3273351
},
"clusters" : [ ],
"members" : [
{
"index" : 11,
"weight" : 1.0
},
{
"index" : 54,
"weight" : 0.68905324
}
]
}
],
"members" : [
{
"index" : 35,
"weight" : 0.5968796
}
]
},
{
"exemplar" : {
"index" : 93,
"weight" : 0.8734215
},
"clusters" : [
{
"exemplar" : {
"index" : 2,
"weight" : 0.25992832
},
"clusters" : [
{
"exemplar" : {
"index" : 16,
"weight" : 0.6403978
},
"clusters" : [ ],
"members" : [
{
"index" : 84,
"weight" : 0.5679599
},
{
"index" : 28,
"weight" : 0.11486327
}
]
}
],
"members" : [ ]
}
],
"members" : [
{
"index" : 24,
"weight" : 0.985477
},
{
"index" : 59,
"weight" : 0.3255045
},
{
"index" : 7,
"weight" : 0.17952214
}
]
}
],
"unclustered" : [
12,
34,
36,
79,
80,
89,
91,
92
]
}
}
}
As a general rule, Lingo4G represents cluster members as indices of rows or columns of the input
matrix or matrix rows. See
the reference for a specific clustering algorithm, such as clusters:​ap
, for examples requests and ways to resolve cluster member indices.
The clusters
array consists of objects representing the top-level clusters the clustering stage
produced. Each cluster object can have the following properties:
-
exemplar
-
The cluster member that characterizes the entire cluster. The exact semantics of the exemplar member depends on the specific clustering algorithm. Some clustering algorithms do not produce cluster exemplars.
index
-
The index of the input matrix row or column that corresponds to this exemplar. You can use this index to identify the actual document or label the exemplar represents.
weight
-
The importance of the exemplar cluster member. The semantics of cluster member weights depends on the specific clustering algorithm.
clusters
-
The list of this cluster's child clusters.
members
-
The list of members of this cluster.
index
-
The index of the input matrix row or column that corresponds to this member. You can use this index to identify the actual document or label the cluster member represents. See the specific clustering algorithm reference for example requests and ways to resolve cluster member indices.
weight
-
The importance of the cluster member. The semantics of cluster member weights depends on the specific clustering algorithm.
The unclustered
array contains indices of matrix rows or columns that the clustering algorithm was not
able to organize into clusters.
clusters:​ap
Clusters the matrix you provide using an optimized version of the Soft-Constraint Affinity Propagation algorithm.
{
"type": "clusters:ap",
"damping": 0.9,
"inputPreference": -1000,
"matrix": {
"type": "matrix:reference",
"auto": true
},
"maxIterations": 2000,
"minPruningGain": 0.3,
"minSteadyIterations": 100,
"softening": 0.2,
"threads": "auto"
}
Characteristics
Lingo4G's implementation of Affinity Propagation clustering produces clusters with the following characteristics:
-
Non-overlapping. Each member can belong to only one cluster or remain unclustered.
-
Described by an exemplar. Each cluster has one designated member - the exemplar - that serves as the most characteristic "description" of the other members in the cluster.
-
Connected to other clusters. The exemplar member can itself be a member of another cluster. This creates links between clusters which are similar in nature to the member–exemplar member relation.
The following figure illustrates the idea of cluster links applied to label clustering. If you applied Affinity Propagation Clustering to a set of labels related to web browsers, Lingo4G might create the following clusters of labels:
Example clusters of labels related to the topic of web browsers.
The graph shows five label clusters defined by the following exemplars: Browser, Firefox, Malware, Google Chrome and Html. The Firefox, Malware, Google Chrome and Html labels are members of the Browser cluster, but at the same time serve as exemplars to other labels, forming clusters of their own. This creates links between the Browser cluster and the other four clusters. Notice, however, that the link is not of the parent / child kind, but rather of the is related to type.
Example requests
The following request uses the clusters:​ap
stage to cluster the top
labels occurring in documents matching the clustering query.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
}
},
"labels": {
"type": "labels:fromDocuments",
"maxLabels": {
"type": "labelCount:fixed",
"value": 200
}
},
"clusters": {
"type": "clusters:ap",
"matrix": {
"type": "matrix:cooccurrenceLabelSimilarity",
"labels": {
"type": "labels:reference",
"use": "labels"
},
"documents": {
"type": "documents:reference",
"use": "documents"
}
}
}
},
"output": {
"stages": [
"labels",
"clusters"
]
}
}
Using clusters:​ap
to cluster labels based on how they co-occur in
documents.
In the response to the above request, member indices in the
clusters
stage result point to the list of labels. That is, member 0 in a cluster is the label at
the 0-th index in the result of the labels
stage.
The example request uses the
matrix:​cooccurrence​Label​Similarity
matrix, which computes similarities between labels based on how they co-occur with other labels in a set of
documents. Alternatively, if your index contains label embeddings, you could use the
matrix:​knn​Vectors​Similarity
matrix to compute similarities based on multidimensional embedding vectors.
The following request uses clusters:​ap
to cluster the top 10k documents
matching the clustering query.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
}
},
"clusters": {
"type": "clusters:ap",
"matrix": {
"type": "matrix:keywordDocumentSimilarity",
"documents": {
"type": "documents:reference",
"use": "documents"
}
},
"inputPreference": -1000,
"softening": 0.05
},
"clusterLabels":{
"type": "labelClusters:documentClusterLabels",
"clusters": {
"type": "clusters:reference",
"use": "clusters"
}
}
},
"output": {
"stages": [
"documents",
"clusters",
"clusterLabels"
]
}
}
Using clusters:​ap
to cluster documents based on the common keywords
they share.
Again, member indices in the clusters
stage result point to the list of documents. That is, member
0 in a cluster is the document at the 0-th index in the result of the
documents
stage.
The request uses the
matrix:​keyword​Document​Similarity
stage, which computes similarities between document based on the number of common words and phrases they share.
If your index contains document embeddings, you could also use the
matrix:​knn​Vectors​Similarity
stage, which computes the similarities based on documents' multidimensional embedding vectors.
Finally, the example request uses the
label​Clusters:​document​Cluster​Labels
stage to identify the most frequent labels in each document cluster. Such labels may serve as a summary of the
contents of each cluster.
Tuning
The output of the clusters:​ap
stage depends not only on the clustering parameters you choose, but
also on the density of the similarity matrix you provide on input. The following subsections offer some advice
for a number of typical clustering tuning scenarios.
Number and size of clusters
The number of clusters you can get from the clusters:​ap
stage depends on the following factors, in the order of impact:
-
Density of the input similarity matrix. The more dense the similarity matrix, or, in other words, the more neighbors each label or document has in the similarity graph, the larger clusters you get.
To make the similarity matrix more dense, increase the number of per-row neighbors in the input matrix stage:
-
matrix:​keyword​Document​Similarity.max​Neighbors
. Beyond a certain number of neighbors, to get even larger clusters you may also need to increasekeyword​Document​Similarity.max​Query​Labels​Per​Document
.
Lowering the number of neighbors lowers the density of the similarity matrix and therefore leads to a larger number of smaller clusters.
-
-
Input preference. Within a limited range, lowering the
input​Preference
property results in a smaller number of larger clusters. Input preference values below -1000 usually don't lead to further cluster size increases. For even larger clusters, increase the density of the similarity matrix. -
Softening. Increasing the
softening
property above 0 to introduce cluster links also increases the total number of clusters.
Clustering time
The time required to cluster a specific input depends on two factors:
-
Number of elements in the input matrix. The larger or the more dense the matrix, the longer it takes to cluster the matrix. Reducing the density of the input matrix speeds up processing but also creates a larger number of smaller clusters.
Alternatively, when you intend to cluster a large number of documents only to get an overview of the topics covered by the documents, you can speed up processing by taking a random sample of the input collection.
-
Number of clustering iterations. Clustering time depends on the number of processing iterations Lingo4G performs when processing the input matrix. Lowering
max​Iterations
decreases the clustering time at the cost of lower clustering quality.
damping
Determines the speed of the updates to the clustering solution.
We recommend leaving damping at the default value of 0.9. If you notice that clustering does not converge for a
specific data set (Lingo4G uses up all
max​Iterations
of clustering), first try increasing
max​Iterations
. If increasing the number of iterations does not lead to convergence, try increasing damping to reach the
0.95—0.98 range.
input​Preference
Influences the number of clusters the algorithm produces.
The lower the input preference value, the lower the number of clusters. When input preference is 0, the number of clusters is usually higher than practical. Use the input preference value of -5 or lower to get a smaller set of clusters.
See the Tuning section for more information how the input preference affects the characteristics of clusters.
matrix
The matrix of similarities for clustering.
Affinity Propagation clustering requires a square similarity matrix. You can use the following stages as input for clustering:
-
matrix:​cooccurrence​Label​Similarity
for co-occurrence based clustering of labels. -
matrix:​keyword​Document​Similarity
for keyword-based clustering of documents. -
matrix:​knn​Vectors​Similarity
for embedding based clustering of labels or documents. -
matrix:​knn2d​Distance​Similarity
for spatial clustering of points on a 2d map.
max​Iterations
The maximum number of clustering iterations to perform.
When clustering more than about 10k documents or labels, consider increasing the number of allowed iterations to 5000 or even 10000 for better clustering results (at the cost of longer processing time).
min​Pruning​Gain
The minimum estimated relationship pruning gain required to apply the pruning during clustering.
Pruning may reduce the time of clustering for dense relationship matrices at the cost of memory usage increase by about 60%.
min​Steady​Iterations
The minimum number of Affinity Propagation iterations during which the clusters don't change required to assume that the clustering process is complete.
If you notice clustering does not converge for a specific data set, try increasing
max​Iterations
first, and then possibly increasing
damping
, if still required.
softening
Determines the amount of internal structure to generate for large label clusters.
A value of 0 keeps the internal structure to a minimum, producing a flat cluster structure for most inputs. As you increase softening, Lingo4G splits larger clusters into to smaller, connected subclusters. Values close to 1.0 produce the richest internal structure of clusters.
threads
The number of parallel threads to use to perform clustering.
clusters:​by​Values
Creates one cluster for each unique value from the list of values you provide.
{
"type": "clusters:byValues",
"values": {
"type": "values:reference",
"auto": true
}
}
The clusters:​by​Value
stages outputs a flat list of clusters. Each cluster has the following
properties:
name
-
The value that gave rise to this cluster.
members
-
Each member corresponds to one occurrence of the cluster's value on the list of values. Member
id
property is the index of the value's occurrence on the input list, memberweight
is always 1.
If you combine this stage with
values:​from​Document​Field
, you can count how many times a specific field value occurred in a specific list of documents. A typical use
case for such counting is a k-nearest-neighbors (kNN) classifier.
The following request is a simple kNN classifier that suggests an arXiv category for the piece of text you provide.
{
"name": "arXiv category suggestions (kNN)",
"variables": {
"textToClassify": {
"name": "Text to classify",
"comment": "Paper title and abstract for which to generate category suggestions. Provide at least one paragraph of text for best results.",
"value": "Word Mover's Embedding: From Word2Vec to Document Embedding. While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called Word Mover's Distance (WMD) that aligns semantically similar words, yields unprecedented KNN classification accuracy. However, WMD is expensive to compute, and it is hard to extend its use beyond a KNN classifier. In this paper, we propose the Word Mover's Embedding (WME), a novel approach to building an unsupervised document (sentence) embedding from pre-trained word embeddings. In our experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matches or outperforms state-of-the-art techniques, with significantly higher accuracy on problems of short length."
},
"classFieldName": {
"name": "Document field to suggest",
"comment": "The document field whose value to suggest for the provided text.",
"value": "category"
},
"maxKeywordSimilarDocuments": {
"name": "Max keyword-similar documents",
"comment": "Maximum number of similar documents to find using the keyword method.",
"value": 20
}
},
"stages": {
"seedLabels": {
"type": "labels:fromText",
"text": {
"@var": "textToClassify"
}
},
"keywordMlt": {
"type": "documents:byQuery",
"query": {
"type": "query:forLabels",
"labels": {
"type": "labels:reference",
"use": "seedLabels"
}
},
"limit": {
"@var": "maxKeywordSimilarDocuments"
}
},
"keywordMltContent": {
"type": "documentContent",
"documents": {
"type": "documents:reference",
"use": "keywordMlt"
},
"queries": {
"k": {
"type": "query:fromDocuments",
"documents": {
"type": "documents:reference",
"use": "keywordMlt"
}
}
},
"limit": {
"@var": "maxKeywordSimilarDocuments"
}
},
"classes": {
"type": "clusters:byValues",
"values": {
"type": "values:fromDocumentField",
"documents": {
"type": "documents:reference",
"use": "keywordMlt"
},
"multipleValues": "COLLECT_ALL",
"fieldName": {
"@var": "classFieldName"
}
}
}
},
"output": {
"stages": [
"classes",
"keywordMlt",
"keywordMltContent",
"seedLabels"
]
}
}
Using clusters:​by​Values
compute arXiv category suggestions for a
piece of text using a k-nearest neighbors classifier.
The classifier request performs the following steps:
-
The
seed​Labels
stage extracts top-frequency labels from the input text usinglabels:​from​Text
. -
The
keyword​Mlt
stage usesdocuments:​by​Query
andquery:​for​Labels
to find more documents containingseed​Labels
. -
The
classes
stage usesvalues:​from​Document​Field
to fetch the arXiv category field value for each document returned by thekeyword​Mlt
stage. Then, it usesclusters:​by​Value
to compute the most frequent categories. These are likely to be a good category choice for the input passage.
values
The list of values for which to create clusters.
clusters:​cd
Clusters items based on the similarity matrix you provide. Uses the Community Detection graph clustering algorithm.
{
"type": "clusters:cd",
"linkDensityThreshold": 0.1,
"matrix": {
"type": "matrix:reference",
"auto": true
},
"maxIterations": 20,
"randomSeed": 0
}
Characteristics
Community Detection clustering produces clusterings with the following characteristics:
-
Flat. Cluster hierarchy has one level, clusters don't have sub-clusters.
-
Non-overlapping. Each item belongs to only one cluster or remains unclustered.
-
Described by an exemplar. Each cluster has one designated member - the exemplar - that serves as the most characteristic "description" of the other members in the cluster.
The Community Detection algorithm transforms the input similarity matrix into a graph, taking items (for example
documents or labels) as graph nodes and linking nodes with edges if two items are similar according to the input
matrix. Then, the algorithm tries to find densely-connected groups of nodes in the graph. The algorithm tries to
ensure that the density of in-cluster edges is larger than
link​Density​Threshold
while the density of the out-of-cluster edges is lower than the threshold.
Compared to Affinity Propagation clustering, this algorithm has only one parameter affecting the number and size of clusters, so it may be easier to tune. Additionally, despite the single-threaded implementation, this algorithm is usually faster than Affinity Propagation clustering.
Example requests
The following request uses the clusters:​cd
stage to cluster the top
labels occurring in documents matching the clustering query.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
}
},
"labels": {
"type": "labels:fromDocuments",
"maxLabels": {
"type": "labelCount:fixed",
"value": 200
}
},
"clusters": {
"type": "clusters:cd",
"matrix": {
"type": "matrix:knnVectorsSimilarity",
"vectors": {
"type": "vectors:precomputedLabelEmbeddings"
}
}
}
},
"output": {
"stages": [
"labels",
"clusters"
]
}
}
Using clusters:​cd
to cluster labels based on the similarity of their
embedding vectors.
In the response to the above request, member indices in the
clusters
stage result point to the list of labels. That is, member 0 in a cluster is the label at
the 0-th index in the result of the labels
stage.
The following request uses clusters:​cd
to cluster the top 10k documents
matching the clustering query.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
}
},
"clusters": {
"type": "clusters:ap",
"matrix": {
"type": "matrix:keywordDocumentSimilarity",
"documents": {
"type": "documents:reference",
"use": "documents"
}
},
"inputPreference": -1000,
"softening": 0.05
},
"clusterLabels":{
"type": "labelClusters:documentClusterLabels",
"clusters": {
"type": "clusters:reference",
"use": "clusters"
}
}
},
"output": {
"stages": [
"documents",
"clusters",
"clusterLabels"
]
}
}
Using clusters:​cd
to cluster documents based on the common keywords
they share.
Again, member indices in the clusters
stage result point to the list of documents. That is, member
0 in a cluster is the document at the 0-th index in the result of the documents
stage.
The example request uses the
label​Clusters:​document​Cluster​Labels
stage to identify the most frequent labels in each document cluster. Such labels may serve as a summary of the
contents of each cluster.
Tuning
With only one parameter affecting the number and size of clusters, Community Detection clustering is very easy
to tune. To increase the number of clusters, increase the value of the
link​Density​Threshold
parameter. To decrease the number of clusters, lower the link density threshold.
link​Density​Threshold
Determines the number and size of clusters the algorithm creates.
The number of clusters the algorithm creates is proportional to the link density threshold. For smaller values of the threshold, the algorithm produces a smaller number of large clusters. Increasing the threshold value causes the algorithm to produce more clusters of smaller size.
matrix
Community Detection clustering requires a square similarity matrix. You can use the following stages as input for clustering:
-
matrix:​cooccurrence​Label​Similarity
for co-occurrence based clustering of labels. -
matrix:​keyword​Document​Similarity
for keyword-based clustering of documents. -
matrix:​knn​Vectors​Similarity
for embedding based clustering of labels or documents. -
matrix:​knn2d​Distance​Similarity
for spatial clustering of points on a 2d map.
max​Iterations
The maximum number of clustering improvement iterations to perform. In most cases, the default maximum number of iterations should ensure high-quality clusters.
random​Seed
The seed value to use to initialize the random number generator.
Different seed values may lead to slightly different clustering results for the same input similarity matrix.
clusters:​from​Matrix​Columns
Creates clusters based on the matrix rows you provide. Each column gives rise to one cluster with cluster members corresponding to the values of the column.
{
"type": "clusters:fromMatrixColumns",
"limit": 100,
"matrixRows": {
"type": "matrixRows:reference",
"auto": true
},
"sortOrder": "DESCENDING",
"weightAggregation": "SUM"
}
This stage performs the following steps:
-
For each column of the input
matrix​Rows
, aggregate the column's values using theweight​Aggregation
function. -
Sort columns by their aggregated value computed in step 1, according to the
sort​Order
. -
Return a flat list of clusters corresponding to up to
limit
first columns on the sorted list. Each cluster has the following properties:exemplar
-
Describes the matrix column that gave rise to this cluster.
index
-
Index of the column that gave rise to this cluster.
weight
- The aggregate of column values computed in step 1.
members
-
Describes the individual values (rows) of the column that gave rise to this cluster.
index
-
Index of the row.
weight
- Matrix value at this member's row and this cluster's column coordinate.
The clusters:​from​Matrix​Columns
stage is an extension of the
documents:​from​Matrix​Columns
. While the latter outputs only documents corresponding to the top-valued columns, this stage also outputs the
indices and values in rows that contributed to the specific column's value.
Like with
documents:​from​Matrix​Columns
, you can use this stage to select top-scoring documents where the score is an aggregation of a number of values.
For example, if you build matrix​Rows
of
cross-similarities between a set of cs.* and physics.* arXiv papers,
clusters:​from​Matrix​Columns
can reveal the top
physics.* papers that are most similar to cs.* papers, showing where the two areas overlap.
Unlike
documents:​from​Matrix​Columns
, this stage also outputs the cs.* papers that contribute to the aggregated similarity value of each
physics.* paper.
limit
The maximum number of top-scoring matrix columns to convert into clusters.
matrix​Rows
The matrix rows whose columns to aggregate.
sort​Order
Determines the sorting order for the aggregated column values.
A​S​C​E​N​D​I​N​G
-
Creates up to
limit
clusters corresponding to columns with the largest aggregated values. -
D​E​S​C​E​N​D​I​N​G
-
Creates up to
limit
clusters corresponding to columns with the smallest aggregated values. -
U​N​S​P​E​C​I​F​I​E​D
-
Creates up to
limit
clusters in the order their corresponding columns appear in the inputmatrix​Rows
.
weight​Aggregation
The column value aggregation function.
clusters:​transformed
Transforms the clustering you provide by flattening the cluster hierarchy and truncating cluster member lists.
{
"type": "clusters:transformed",
"clusters": {
"type": "clusters:reference",
"auto": true
},
"maxLevel": 1,
"maxMembersPerCluster": 5
}
You can use this stage to create a quick overview of a multi-level clustering of a large number of items.
clusters
The input clustering to transform.
max​Level
The maximum hierarchy level.
The maximum number of hierarchy levels to preserve in the transformed clustering. Set to
1
to obtain flat clustering.
The transformation moves all members in lower-level clusters to the closest parent cluster preserved in the transformed clustering.
max​Members​Per​Cluster
The maximum number of members to retain in each cluster.
The transformation moves all cluster members past the maximum to the unclustered members list.
The transformer applies member list truncation after hierarchy flattening.
clusters:​with​Remapped​Documents
Translates clusters from one document space to another, filtering out non-matching documents.
{
"type": "clusters:withRemappedDocuments",
"clusters": {
"type": "clusters:reference",
"auto": true
},
"exemplarsFrom": null,
"exemplarsTo": null,
"membersFrom": null,
"membersTo": null
}
This stage is fairly specialized and has very rare use cases. You may need it when you have a list of clusters created for a certain set of documents, but want to re-map cluster member indices to a different but related set of documents, such as a subset or a superset of the one that gave rise to clusters.
clusters
The clusters whose exemplar and member indices to remap.
exemplars​From
The list of documents which gave rise to cluster exemplars.
exemplars​To
The list of document to which to translate the cluster exemplars.
members​From
The list of documents which gave rise to cluster members.
members​To
The list of document to which to translate the cluster members.
clusters:​*
Consumers of
The following stages and components take clusters:​*
as
input: