labelClusters

The label​Clusters:​* produce clusters of labels. One typical use case of these stages is to generate label-based descriptions for clusters of documents.

You can use the following label clustering stages in your requests:

label​Clusters:​document​Cluster​Labels

Creates label clusters aligned with the document clusters you provide. Use this stage to generate label-based descriptions for clusters of documents.


label​Clusters:​reference

References the results of another label​Clusters:​* stage defined in the request.


The JSON output of the labelClusters stage has the following structure:

{
  "clusters": [
    {
      "clusters": [
        // sub-clusters (recursive structure)
      ],
      "labels": [
        {
          "label": "first-label",
          "weight": 44
        },
        ...
      ]
    },
    {
      ... second cluster
    },
    ... more clusters
  ]
}

The clusters property contains an array of clusters. Each cluster has an array of labels (labels property) and a nested array named clusters with recursive sub-clusters (the array is empty when no sub-clusters are present).

Each label inside labels has a display label and weight.

label​Clusters:​document​Cluster​Labels

Creates label clusters aligned with the document clusters you provide in such a way that each label cluster contains labels that occur most frequently in the documents from the corresponding document cluster.

{
  "type": "labelClusters:documentClusterLabels",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "labelAggregator": {
    "type": "labelAggregator:topWeight",
    "labelCollector": {
      "type": "labelCollector:topFromFeatureFields",
      "fields": {
        "type": "featureFields:reference",
        "auto": true
      },
      "labelFilter": {
        "type": "labelFilter:reference",
        "auto": true
      },
      "labelListFilter": {
        "type": "labelListFilter:truncatedPhrases"
      },
      "minTf": 0,
      "minTfMass": 1,
      "tieResolution": "AUTO"
    },
    "maxLabelsPerDocument": 10,
    "maxRelativeDf": 1,
    "minAbsoluteDf": 1,
    "minRelativeDf": 0,
    "minWeight": 0,
    "outputWeightFormula": "TF",
    "threads": "auto",
    "tieResolution": "AUTO"
  },
  "maxLabels": 3
}

In the example below, we request the top documents matching the query photon, compute their clusters and describe them with cluster labels.

{
  "name": "Document clusters by More-Like-This similarity",
  "comment": "Clusters a set of top documents matching the provided query, based on the common labels the documents share. Attempts to describe the clusters by top-frequency labels from each cluster's documents. Fetches the content of clustered documents.",
  "variables": {
    "query": {
      "name": "Documents query",
      "comment": "Defines the set of documents to cluster.",
      "value": "photon"
    },
    "limit": {
      "name": "Max documents",
      "comment": "The maximum number of documents matching the query to select for clustering.",
      "value": 2000
    },
    "clusterCreationPreference": {
      "name": "Cluster creation preference",
      "comment": "How many clusters to create. The more negative the preference, the fewer clusters. The closer the preference to 0, the more clusters.",
      "value": -1000
    },
    "clusterLinkingPreference": {
      "name": "Cluster linking preference",
      "comment": "How many links to create between clusters. Softening of 0 creates unlinked, flat structure of clusters. Softening of 1.0 creates a highly-linked structure of clusters.",
      "value": 0
    },
    "maxSimilarDocuments": {
      "name": "Max similar documents",
      "comment": "How many similar documents to find for each document in the similarity matrix. The larger the number of similar documents, the larger and more general the clusters and the longer clustering time.",
      "value": 10
    },
    "maxClusterLabels": {
      "name": "Max cluster labels",
      "comment": "How many labels to use to label each cluster.",
      "value": 3
    }
  },
  "components": {
    "query": {
      "type": "query:string",
      "query": {
        "@var": "query"
      }
    }
  },
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:reference",
        "use": "query"
      },
      "limit": {
        "@var": "limit"
      }
    },
    "content": {
      "type": "documentContent",
      "limit": {
        "@var": "limit"
      }
    },
    "clusters": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:keywordDocumentSimilarity",
        "maxNeighbors": {
          "@var": "maxSimilarDocuments"
        }
      },
      "inputPreference": {
        "@var": "clusterCreationPreference"
      },
      "softening": {
        "@var": "clusterLinkingPreference"
      }
    },
    "labelClusters": {
      "type": "labelClusters:documentClusterLabels",
      "maxLabels": {
        "@var": "maxClusterLabels"
      },
      "labelAggregator": {
        "type": "labelAggregator:topWeight",
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields",
          "labelFilter": {
            "type": "labelFilter:dictionary",
            "exclude": [
              {
                "type": "dictionary:queryTerms",
                "query": {
                  "type": "query:reference",
                  "use": "query"
                }
              }
            ]
          }
        }
      }
    }
  },
  "output": {
    "stages": [
      "content",
      "clusters",
      "labelClusters"
    ]
  }
}

Label clusters for clusters 1-3 are shown below:

"clusters": [
  {
    "clusters": [],
    "labels": [
      {
        "label": "cross section",
        "weight": 29
      },
      {
        "label": "hadronic",
        "weight": 11
      }
    ]
  },
  {
    "clusters": [
      {
        "clusters": [],
        "labels": [
          {
            "label": "particle",
            "weight": 2
          },
          {
            "label": "spinless particles",
            "weight": 2
          },
          {
            "label": "coupled",
            "weight": 2
          },
          {
            "label": "new",
            "weight": 2
          },
          {
            "label": "constraints",
            "weight": 2
          },
          {
            "label": "light",
            "weight": 2
          }
        ]
      }
    ],
    "labels": [
      {
        "label": "Îł",
        "weight": 82
      },
      {
        "label": "γγ",
        "weight": 36
      },
      {
        "label": "e",
        "weight": 26
      }
    ]
  },
  {
    "clusters": [],
    "labels": [
      {
        "label": "baryon",
        "weight": 7
      },
      {
        "label": "running vacuum",
        "weight": 6
      }
    ]
  },
  {
    "clusters": [],
    "labels": [
      {
        "label": "black hole",
        "weight": 94
      },
      {
        "label": "photon ring",
        "weight": 32
      },
      {
        "label": "ring",
        "weight": 28
      }
    ]
  }
]

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

documents clusters to create label clusters for.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The source documents of clusters referenced in clusters.

label​Aggregator

Type
labelAggregator
Default
{
  "type": "labelAggregator:topWeight",
  "labelCollector": {
    "type": "labelCollector:topFromFeatureFields",
    "labelFilter": {
      "type": "labelFilter:reference",
      "auto": true
    },
    "labelListFilter": {
      "type": "labelListFilter:truncatedPhrases"
    },
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "minTf": 0,
    "minTfMass": 1,
    "tieResolution": "AUTO"
  },
  "maxLabelsPerDocument": 10,
  "minAbsoluteDf": 1,
  "minRelativeDf": 0,
  "maxRelativeDf": 1,
  "minWeight": 0,
  "tieResolution": "AUTO",
  "outputWeightFormula": "TF",
  "threads": "auto"
}
Required
no

The label​Aggregator:​* component used to filter and aggregate labels from each document cluster.

max​Labels

Type
integer
Default
3
Constraints
value > 0
Required
no

Maximum labels for each cluster, retrieved from label​Aggregator.