labelClusters

The label​Clusters:​* produce clusters of labels. One typical use case of these stages is to generate label-based descriptions for clusters of documents.

You can use the following label clustering stages in your requests:

label​Clusters:​document​Cluster​Labels

Creates label clusters aligned with the document clusters you provide. Use this stage to generate label-based descriptions for clusters of documents.


label​Clusters:​reference

References the results of another label​Clusters:​* stage defined in the request.


The JSON output of the labelClusters stage has the following structure:

{
  "clusters": [
    {
      "clusters": [
        // sub-clusters (recursive structure)
      ],
      "labels": [
        {
          "label": "first-label",
          "weight": 44
        },
        ...
      ]
    },
    {
      ...
      second
      cluster
    },
    ...
    more
    clusters
  ]
}

The clusters property contains an array of clusters. Each cluster has an array of labels (labels property) and a nested array named clusters with recursive sub-clusters (the array is empty when no sub-clusters are present).

Each label inside labels has a display label and weight.

label​Clusters:​document​Cluster​Labels

Creates label clusters aligned with the document clusters you provide in such a way that each label cluster contains labels describe the documents from the corresponding document cluster.

{
  "type": "labelClusters:documentClusterLabels",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "labelCollector": {
    "type": "labelCollector:topFromFeatureFields",
    "failIfEmbeddingsNotAvailable": true,
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "labelFilter": {
      "type": "labelFilter:reference",
      "auto": true
    },
    "labelListFilter": {
      "type": "labelListFilter:truncatedPhrases"
    },
    "labelWeighting": "EMBEDDING",
    "minTf": 0,
    "minWeight": 0,
    "minWeightMass": 1,
    "tieResolution": "AUTO"
  },
  "labelListFilter": {
    "type": "labelListFilter:acceptAll"
  },
  "maxLabels": 4,
  "maxLabelsPerDocument": 100,
  "mutualInformationWeight": 0.5,
  "tfIdfWeight": 0.5,
  "threads": "auto"
}

In the following example, we request the top documents matching the query photon, arrange them into clusters and describe each cluster with labels.

{
  "name": "Document clusters by More-Like-This similarity",
  "comment": "Clusters a set of top documents matching the provided query, based on the common labels the documents share. Attempts to describe the clusters by top-frequency labels from each cluster's documents. Fetches the content of clustered documents.",
  "variables": {
    "query": {
      "name": "Documents query",
      "comment": "Defines the set of documents to cluster.",
      "value": "photon"
    },
    "limit": {
      "name": "Max documents",
      "comment": "The maximum number of documents matching the query to select for clustering.",
      "value": 2000
    },
    "clusterCreationPreference": {
      "name": "Cluster creation preference",
      "comment": "How many clusters to create. The more negative the preference, the fewer clusters. The closer the preference to 0, the more clusters.",
      "value": -1000
    },
    "clusterLinkingPreference": {
      "name": "Cluster linking preference",
      "comment": "How many links to create between clusters. Softening of 0 creates unlinked, flat structure of clusters. Softening of 1.0 creates a highly-linked structure of clusters.",
      "value": 0
    },
    "maxSimilarDocuments": {
      "name": "Max similar documents",
      "comment": "How many similar documents to find for each document in the similarity matrix. The larger the number of similar documents, the larger and more general the clusters and the longer clustering time.",
      "value": 10
    },
    "maxClusterLabels": {
      "name": "Max cluster labels",
      "comment": "How many labels to use to label each cluster.",
      "value": 3
    }
  },
  "components": {
    "query": {
      "type": "query:string",
      "query": {
        "@var": "query"
      }
    }
  },
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:reference",
        "use": "query"
      },
      "limit": {
        "@var": "limit"
      }
    },
    "content": {
      "type": "documentContent",
      "limit": {
        "@var": "limit"
      }
    },
    "clusters": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:keywordDocumentSimilarity",
        "maxNeighbors": {
          "@var": "maxSimilarDocuments"
        }
      },
      "inputPreference": {
        "@var": "clusterCreationPreference"
      },
      "softening": {
        "@var": "clusterLinkingPreference"
      }
    },
    "labelClusters": {
      "type": "labelClusters:documentClusterLabels",
      "maxLabels": {
        "@var": "maxClusterLabels"
      },
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "labelFilter": {
          "type": "labelFilter:dictionary",
          "exclude": [
            {
              "type": "dictionary:queryTerms",
              "query": {
                "type": "query:reference",
                "use": "query"
              }
            }
          ]
        }
      }
    }
  },
  "output": {
    "stages": [
      "content",
      "clusters",
      "labelClusters"
    ]
  }
}

Label clusters for clusters 1-3 are shown below:

"clusters": [
  {
    "clusters": [],
    "labels": [
      {
        "label": "cross section",
        "weight": 29
      },
      {
        "label": "hadronic",
        "weight": 11
      }
    ]
  },
  {
    "clusters": [
      {
        "clusters": [],
        "labels": [
          {
            "label": "particle",
            "weight": 2
          },
          {
            "label": "spinless particles",
            "weight": 2
          },
          {
            "label": "coupled",
            "weight": 2
          },
          {
            "label": "new",
            "weight": 2
          },
          {
            "label": "constraints",
            "weight": 2
          },
          {
            "label": "light",
            "weight": 2
          }
        ]
      }
    ],
    "labels": [
      {
        "label": "Îł",
        "weight": 82
      },
      {
        "label": "γγ",
        "weight": 36
      },
      {
        "label": "e",
        "weight": 26
      }
    ]
  },
  {
    "clusters": [],
    "labels": [
      {
        "label": "baryon",
        "weight": 7
      },
      {
        "label": "running vacuum",
        "weight": 6
      }
    ]
  },
  {
    "clusters": [],
    "labels": [
      {
        "label": "black hole",
        "weight": 94
      },
      {
        "label": "photon ring",
        "weight": 32
      },
      {
        "label": "ring",
        "weight": 28
      }
    ]
  }
]

By default, this stage produces half of the labels using the Mutual Information method and the other half of labels using the cluster-TF-IDF method. Use the mutual​Information​Weight and tf​Idf​Weight properties to adjust the balance between these two methods.

Method Label characteristics
Mutual Information

Maximizes the Mutual Information between the cluster and the cluster labels. Produces labels that are frequent in the cluster and rare in other clusters. Usually the labels are shorter (one or two words), but appear in many of the cluster's documents.

cluster-TF-IDF

Maximizes the TF-IDF of the labels with respect to the cluster. Usually the labels are longer (two or more words), but appear in fewer of the cluster's documents.

The weight of the labels in the output JSON represents the number of documents in the cluster containing that label.

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

documents clusters to create label clusters for.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The source documents of clusters referenced in clusters.

label​Collector

Type
labelCollector
Default
{
  "type": "labelCollector:topFromFeatureFields",
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "minTf": 0,
  "minWeight": 0,
  "minWeightMass": 1,
  "tieResolution": "AUTO",
  "labelWeighting": "EMBEDDING",
  "failIfEmbeddingsNotAvailable": true
}
Required
no

Configures the collection of labels from individual documents.

The default collector configuration should provide reasonable labels in typical cases.

Use this property to override the label collection configuration to, for example, apply custom label filtering.

label​List​Filter

Type
labelListFilter
Default
{
  "type": "labelListFilter:acceptAll"
}
Required
no

The label list filter to apply to the labels describing each cluster.

A particularly useful filter in this context is the label​List​Filter:​diversified filter, which attempts to remove repetitive labels from the cluster description. See the documentation of the filter for the example request.

max​Labels

Type
integer
Default
4
Constraints
value > 0
Required
no

The maximum number of labels to output for each cluster.

max​Labels​Per​Document

Type
integer
Default
100
Constraints
value > 0
Required
no

The maximum number of labels to collect from each document when describing clusters.

mutual​Information​Weight

Type
number
Default
0.5
Constraints
value >= 0 and value <= 1
Required
no

Determines the proportion of labels to collect using the Mutual Information method.

If both mutual​Information​Weight and tf​Idf​Weight are equal (e.g. both equal to 0.5), each method produces half of the labels you request for each cluster.

See the methods overview for more information about the two methods.

tf​Idf​Weight

Type
number
Default
0.5
Constraints
value >= 0 and value <= 1
Required
no

If both tf​Idf​Weight and mutual​Information​Weight are equal (e.g. both equal to 0.5), each method produces half of the labels you request for each cluster.

See the methods overview for more information about the two methods.

threads

Type
threads
Default
auto
Required
no

The number of threads to engage for collection of labels from documents.