clusters
The clusters:â€‹*
stages organize labels or documents into larger clusters based on the criteria of your
choice.
Lingo4G clustering stages take matrix:â€‹*
or
matrixâ€‹Rows:â€‹*
on input rather than lists of documents or
labels. Therefore, you can use the same clustering algorithm, such as
clusters:â€‹ap
, to cluster either documents or labels. It is the input similarity matrix that defines the entities to cluster and
the similarity function.
You can use the following clustering stages in your analysis requests:

clusters:â€‹ap

Clusters entities based on the similarity matrix you provide. Uses an optimized version of the SoftConstraint Affinity Propagation algorithm.

clusters:â€‹byâ€‹Values

Creates clusters of based on the list of values, converting each distinct value into a cluster.

clusters:â€‹fromâ€‹Matrixâ€‹Columns

Creates clusters based on the matrix rows you provide. Each column gives rise to one cluster with cluster members corresponding to the values of the column.

clusters:â€‹withâ€‹Remappedâ€‹Documents

Translates clusters from one document space to another, filtering out nonmatching documents.
clusters:â€‹reference

References the results of another
clusters:â€‹*
stage defined in the request.
The JSON output of the clusters:â€‹*
stages has the following structure:
As a general rule, Lingo4G represents cluster members as indices of rows or columns of the input
matrix or matrix rows. See
the reference for a specific clustering algorithm, such as clusters:â€‹ap
, for examples requests and ways to resolve cluster member indices.
The clusters
array consists of objects representing the toplevel clusters the clustering stage
produced. Each cluster object can have the following properties:

exemplar

The cluster member that characterizes the entire cluster. The exact semantics of the exemplar member depends on the specific clustering algorithm. Some clustering algorithms do not produce cluster exemplars.
index

The index of the input matrix row or column that corresponds to this exemplar. You can use this index to identify the actual document or label the exemplar represents.
weight

The importance of the exemplar cluster member. The semantics of cluster member weights depends on the specific clustering algorithm.
clusters

The list of this cluster's child clusters.
members

The list of members of this cluster.
index

The index of the input matrix row or column that corresponds to this member. You can use this index to identify the actual document or label the cluster member represents. See the specific clustering algorithm reference for example requests and ways to resolve cluster member indices.
weight

The importance of the cluster member. The semantics of cluster member weights depends on the specific clustering algorithm.
The unclustered
array contains indices of matrix rows or columns that the clustering algorithm was not
able to organize into clusters.
clusters:â€‹ap
Clusters the matrix you provide using an optimized version of the SoftConstraint Affinity Propagation algorithm.
{
"type": "clusters:ap",
"damping": 0.9,
"inputPreference": 1000,
"matrix": {
"type": "matrix:reference",
"auto": true
},
"maxIterations": 2000,
"minPruningGain": 0.3,
"minSteadyIterations": 100,
"softening": 0.2,
"threads": "auto"
}
Characteristics
Lingo4G's implementation of Affinity Propagation clustering produces clusters with the following characteristics:

Nonoverlapping. Each member can belong to only one cluster or remain unclustered.

Described by an exemplar. Each cluster has one designated member  the exemplar  that serves as the most characteristic "description" of the other members in the cluster.

Connected to other clusters. The exemplar member can itself be a member of another cluster. This creates links between clusters which are similar in nature to the memberâ€“exemplar member relation.
The following figure illustrates the idea of cluster links applied to label clustering. If you applied Affinity Propagation Clustering to a set of labels related to web browsers, Lingo4G might create the following clusters of labels:
The graph shows five label clusters defined by the following exemplars: Browser, Firefox, Malware, Google Chrome and Html. The Firefox, Malware, Google Chrome and Html labels are members of the Browser cluster, but at the same time serve as exemplars to other labels, forming clusters of their own. This creates links between the Browser cluster and the other four clusters. Notice, however, that the link is not of the parent / child kind, but rather of the is related to type.
Example requests
The following request uses the clusters:â€‹ap
stage to cluster the top
labels occurring in documents matching the clustering query.
In the response to the above request, member indices in the
clusters
stage result point to the list of labels. That is, member 0 in a cluster is the label at
the 0th index in the result of the labels
stage.
The example request uses the
matrix:â€‹cooccurrenceâ€‹Labelâ€‹Similarity
matrix, which computes similarities between labels based on how they cooccur with other labels in a set of
documents. Alternatively, if your index contains label embeddings, you could use the
matrix:â€‹knnâ€‹Vectorsâ€‹Similarity
matrix to compute similarities based on multidimensional embedding vectors.
The following request uses clusters:â€‹ap
to cluster the top 10k documents
matching the clustering query.
Again, member indices in the clusters
stage result point to the list of documents. That is, member
0 in a cluster is the document at the 0th index in the result of the
documents
stage.
The request uses the
matrix:â€‹keywordâ€‹Documentâ€‹Similarity
stage, which computes similarities between document based on the number of common words and phrases they share.
If your index contains document embeddings, you could also use the
matrix:â€‹knnâ€‹Vectorsâ€‹Similarity
stage, which computes the similarities based on documents' multidimensional embedding vectors.
Finally, the example request uses the
labelâ€‹Clusters:â€‹documentâ€‹Clusterâ€‹Labels
stage to identify the most frequent labels in each document cluster. Such labels may serve as a summary of the
contents of each cluster.
Tuning
The output of the clusters:â€‹ap
stage depends not only on the clustering parameters you choose, but
also on the density of the similarity matrix you provide on input. The following subsections offer some advice
for a number of typical clustering tuning scenarios.
Number and size of clusters
The number of clusters you can get from the clusters:â€‹ap
stage depends on the following factors, in the order of impact:

Density of the input similarity matrix. The more dense the similarity matrix, or, in other words, the more neighbors each label or document has in the similarity graph, the larger clusters you get.
To make the similarity matrix more dense, increase the number of perrow neighbors in the input matrix stage:

matrix:â€‹keywordâ€‹Documentâ€‹Similarity.maxâ€‹Neighbors
. Beyond a certain number of neighbors, to get even larger clusters you may also need to increasekeywordâ€‹Documentâ€‹Similarity.maxâ€‹Queryâ€‹Labelsâ€‹Perâ€‹Document
.
Lowering the number of neighbors lowers the density of the similarity matrix and therefore leads to a larger number of smaller clusters.


Input preference. Within a limited range, lowering the
inputâ€‹Preference
property results in a smaller number of larger clusters. Input preference values below 1000 usually don't lead to further cluster size increases. For even larger clusters, increase the density of the similarity matrix. 
Softening. Increasing the
softening
property above 0 to introduce cluster links also increases the total number of clusters.
Clustering time
The time required to cluster a specific input depends on two factors:

Number of elements in the input matrix. The larger or the more dense the matrix, the longer it takes to cluster the matrix. Reducing the density of the input matrix speeds up processing but also creates a larger number of smaller clusters.
Alternatively, when you intend to cluster a large number of documents only to get an overview of the topics covered by the documents, you can speed up processing by taking a random sample of the input collection.

Number of clustering iterations. Clustering time depends on the number of processing iterations Lingo4G performs when processing the input matrix. Lowering
maxâ€‹Iterations
decreases the clustering time at the cost of lower clustering quality.
damping
Determines the speed of the updates to the clustering solution.
We recommend leaving damping at the default value of 0.9. If you notice that clustering does not converge for a
specific data set (Lingo4G uses up all
maxâ€‹Iterations
of clustering), first try increasing
maxâ€‹Iterations
. If increasing the number of iterations does not lead to convergence, try increasing damping to reach the
0.95â€”0.98 range.
inputâ€‹Preference
Influences the number of clusters the algorithm produces.
The lower the input preference value, the lower the number of clusters. When input preference is 0, the number of clusters is usually higher than practical. Use the input preference value of 5 or lower to get a smaller set of clusters.
See the Tuning section for more information how the input preference affects the characteristics of clusters.
matrix
The matrix of similarities for clustering.
Affinity Propagation clustering requires a square similarity matrix. You can use the following stages as input for clustering:

matrix:â€‹cooccurrenceâ€‹Labelâ€‹Similarity
for cooccurrence based clustering of labels. 
matrix:â€‹keywordâ€‹Documentâ€‹Similarity
for keywordbased clustering of documents. 
matrix:â€‹knnâ€‹Vectorsâ€‹Similarity
for embedding based clustering of labels or documents. 
matrix:â€‹knn2dâ€‹Distanceâ€‹Similarity
for spatial clustering of points on a 2d map.
maxâ€‹Iterations
The maximum number of clustering iterations to perform.
When clustering more than about 10k documents or labels, consider increasing the number of allowed iterations to 5000 or even 10000 for better clustering results (at the cost of longer processing time).
minâ€‹Pruningâ€‹Gain
The minimum estimated relationship pruning gain required to apply the pruning during clustering.
Pruning may reduce the time of clustering for dense relationship matrices at the cost of memory usage increase by about 60%.
minâ€‹Steadyâ€‹Iterations
The minimum number of Affinity Propagation iterations during which the clusters don't change required to assume that the clustering process is complete.
If you notice clustering does not converge for a specific data set, try increasing
maxâ€‹Iterations
first, and then possibly increasing
damping
, if still required.
softening
Determines the amount of internal structure to generate for large label clusters.
A value of 0 keeps the internal structure to a minimum, producing a flat cluster structure for most inputs. As you increase softening, Lingo4G splits larger clusters into to smaller, connected subclusters. Values close to 1.0 produce the richest internal structure of clusters.
threads
The number of parallel threads to use to perform clustering.
clusters:â€‹byâ€‹Values
Creates one cluster for each unique value from the list of values you provide.
{
"type": "clusters:byValues",
"values": {
"type": "values:reference",
"auto": true
}
}
The clusters:â€‹byâ€‹Value
stages outputs a flat list of clusters. Each cluster has the following
properties:
name

The value that gave rise to this cluster.
members

Each member corresponds to one occurrence of the cluster's value on the list of values. Member
id
property is the index of the value's occurrence on the input list, memberweight
is always 1.
If you combine this stage with
values:â€‹fromâ€‹Documentâ€‹Field
, you can count how many times a specific field value occurred in a specific list of documents. A typical use
case for such counting is a knearestneighbors (kNN) classifier.
The following request is a simple kNN classifier that suggests an arXiv category for the piece of text you provide.
The classifier request performs the following steps:

The
seedâ€‹Labels
stage extracts topfrequency labels from the input text usinglabels:â€‹fromâ€‹Text
. 
The
keywordâ€‹Mlt
stage usesdocuments:â€‹byâ€‹Query
andquery:â€‹forâ€‹Labels
to find more documents containingseedâ€‹Labels
. 
The
classes
stage usesvalues:â€‹fromâ€‹Documentâ€‹Field
to fetch the arXiv category field value for each document returned by thekeywordâ€‹Mlt
stage. Then, it usesclusters:â€‹byâ€‹Value
to compute the most frequent categories. These are likely to be a good category choice for the input passage.
values
The list of values for which to create clusters.
clusters:â€‹fromâ€‹Matrixâ€‹Columns
Creates clusters based on the matrix rows you provide. Each column gives rise to one cluster with cluster members corresponding to the values of the column.
{
"type": "clusters:fromMatrixColumns",
"limit": 100,
"matrixRows": {
"type": "matrixRows:reference",
"auto": true
},
"sortOrder": "DESCENDING",
"weightAggregation": "SUM"
}
This stage performs the following steps:

For each column of the input
matrixâ€‹Rows
, aggregate the column's values using theweightâ€‹Aggregation
function. 
Sort columns by their aggregated value computed in step 1, according to the
sortâ€‹Order
. 
Return a flat list of clusters corresponding to up to
limit
first columns on the sorted list. Each cluster has the following properties:exemplar

Describes the matrix column that gave rise to this cluster.
index

Index of the column that gave rise to this cluster.
weight
 The aggregate of column values computed in step 1.
members

Describes the individual values (rows) of the column that gave rise to this cluster.
index

Index of the row.
weight
 Matrix value at this member's row and this cluster's column coordinate.
The clusters:â€‹fromâ€‹Matrixâ€‹Columns
stage is an extension of the
documents:â€‹fromâ€‹Matrixâ€‹Columns
. While the latter outputs only documents corresponding to the topvalued columns, this stage also outputs the
indices and values in rows that contributed to the specific column's value.
Like with
documents:â€‹fromâ€‹Matrixâ€‹Columns
, you can use this stage to select topscoring documents where the score is an aggregation of a number of values.
For example, if you build matrixâ€‹Rows
of
crosssimilarities between a set of cs.* and physics.* arXiv papers,
clusters:â€‹fromâ€‹Matrixâ€‹Columns
can reveal the top
physics.* papers that are most similar to cs.* papers, showing where the two areas overlap.
Unlike
documents:â€‹fromâ€‹Matrixâ€‹Columns
, this stage also outputs the cs.* papers that contribute to the aggregated similarity value of each
physics.* paper.
limit
The maximum number of topscoring matrix columns to convert into clusters.
matrixâ€‹Rows
The matrix rows whose columns to aggregate.
sortâ€‹Order
Determines the sorting order for the aggregated column values.
Aâ€‹Sâ€‹Câ€‹Eâ€‹Nâ€‹Dâ€‹Iâ€‹Nâ€‹G

Creates up to
limit
clusters corresponding to columns with the largest aggregated values. 
Dâ€‹Eâ€‹Sâ€‹Câ€‹Eâ€‹Nâ€‹Dâ€‹Iâ€‹Nâ€‹G

Creates up to
limit
clusters corresponding to columns with the smallest aggregated values. 
Uâ€‹Nâ€‹Sâ€‹Pâ€‹Eâ€‹Câ€‹Iâ€‹Fâ€‹Iâ€‹Eâ€‹D

Creates up to
limit
clusters in the order their corresponding columns appear in the inputmatrixâ€‹Rows
.
weightâ€‹Aggregation
The column value aggregation function.
clusters:â€‹withâ€‹Remappedâ€‹Documents
Translates clusters from one document space to another, filtering out nonmatching documents.
{
"type": "clusters:withRemappedDocuments",
"clusters": {
"type": "clusters:reference",
"auto": true
},
"exemplarsFrom": null,
"exemplarsTo": null,
"membersFrom": null,
"membersTo": null
}
This stage is fairly specialized and has very rare use cases. You may need it when you have a list of clusters created for a certain set of documents, but want to remap cluster member indices to a different but related set of documents, such as a subset or a superset of the one that gave rise to clusters.
clusters
The clusters whose exemplar and member indices to remap.
exemplarsâ€‹From
The list of documents which gave rise to cluster exemplars.
exemplarsâ€‹To
The list of document to which to translate the cluster exemplars.
membersâ€‹From
The list of documents which gave rise to cluster members.
membersâ€‹To
The list of document to which to translate the cluster members.
clusters:â€‹*
Consumers of
The following stages and components take clusters:â€‹*
as
input: