labelClusters

The label​Clusters:​* produce clusters of labels. One typical use case of these stages is to generate label-based descriptions for clusters of documents.

You can use the following label clustering stages in your requests:

label​Clusters:​document​Cluster​Labels

Creates label clusters aligned with the document clusters you provide. Use this stage to generate label-based descriptions for clusters of documents.


label​Clusters:​reference

References the results of another label​Clusters:​* stage defined in the request.


The JSON output of the labelClusters stage has the following structure:

{
  "clusters": [
    {
      "clusters": [
        // sub-clusters (recursive structure)
      ],
      "labels": [
        {
          "label": "first-label",
          "weight": 44
        },
        ...
      ]
    },
    {
      ...
      second
      cluster
    },
    ...
    more
    clusters
  ]
}

The clusters property contains an array of clusters. Each cluster has an array of labels (labels property) and a nested array named clusters with recursive sub-clusters (the array is empty when no sub-clusters are present).

Each label inside labels has a display label and weight.

label​Clusters:​document​Cluster​Labels

Creates label clusters aligned with the document clusters you provide in such a way that each label cluster contains labels describe the documents from the corresponding document cluster.

{
  "type": "labelClusters:documentClusterLabels",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "labelCollector": {
    "type": "labelCollector:topFromFeatureFields",
    "failIfEmbeddingsNotAvailable": true,
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "labelFilter": {
      "type": "labelFilter:reference",
      "auto": true
    },
    "labelListFilter": {
      "type": "labelListFilter:truncatedPhrases"
    },
    "labelWeighting": "EMBEDDING",
    "minTf": 0,
    "minWeight": 0,
    "minWeightMass": 1,
    "tieResolution": "AUTO"
  },
  "labelListFilter": {
    "type": "labelListFilter:acceptAll"
  },
  "maxLabels": 3,
  "maxLabelsPerDocument": 100,
  "mutualInformationWeight": 1,
  "threads": "auto"
}

In the following example, we request the top documents matching the query photon, arrange them into clusters and describe each cluster with labels.

{
  "name": "Document clusters by More-Like-This similarity",
  "comment": "Clusters a set of top documents matching the provided query, based on the common labels the documents share. Attempts to describe the clusters by top-frequency labels from each cluster's documents. Fetches the content of clustered documents.",
  "variables": {
    "query": {
      "name": "Documents query",
      "comment": "Defines the set of documents to cluster.",
      "value": "photon"
    },
    "limit": {
      "name": "Max documents",
      "comment": "The maximum number of documents matching the query to select for clustering.",
      "value": 2000
    },
    "clusterCreationPreference": {
      "name": "Cluster creation preference",
      "comment": "How many clusters to create. The more negative the preference, the fewer clusters. The closer the preference to 0, the more clusters.",
      "value": -1000
    },
    "clusterLinkingPreference": {
      "name": "Cluster linking preference",
      "comment": "How many links to create between clusters. Softening of 0 creates unlinked, flat structure of clusters. Softening of 1.0 creates a highly-linked structure of clusters.",
      "value": 0
    },
    "maxSimilarDocuments": {
      "name": "Max similar documents",
      "comment": "How many similar documents to find for each document in the similarity matrix. The larger the number of similar documents, the larger and more general the clusters and the longer clustering time.",
      "value": 10
    },
    "maxClusterLabels": {
      "name": "Max cluster labels",
      "comment": "How many labels to use to label each cluster.",
      "value": 3
    }
  },
  "components": {
    "query": {
      "type": "query:string",
      "query": {
        "@var": "query"
      }
    }
  },
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:reference",
        "use": "query"
      },
      "limit": {
        "@var": "limit"
      }
    },
    "content": {
      "type": "documentContent",
      "limit": {
        "@var": "limit"
      }
    },
    "clusters": {
      "type": "clusters:ap",
      "matrix": {
        "type": "matrix:keywordDocumentSimilarity",
        "maxNeighbors": {
          "@var": "maxSimilarDocuments"
        }
      },
      "inputPreference": {
        "@var": "clusterCreationPreference"
      },
      "softening": {
        "@var": "clusterLinkingPreference"
      }
    },
    "labelClusters": {
      "type": "labelClusters:documentClusterLabels",
      "maxLabels": {
        "@var": "maxClusterLabels"
      },
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "labelFilter": {
          "type": "labelFilter:dictionary",
          "exclude": [
            {
              "type": "dictionary:queryTerms",
              "query": {
                "type": "query:reference",
                "use": "query"
              }
            }
          ]
        }
      }
    }
  },
  "output": {
    "stages": [
      "content",
      "clusters",
      "labelClusters"
    ]
  }
}

Label clusters for clusters 1-3 are shown below:

"clusters": [
  {
    "clusters": [],
    "labels": [
      {
        "label": "cross section",
        "weight": 29
      },
      {
        "label": "hadronic",
        "weight": 11
      }
    ]
  },
  {
    "clusters": [
      {
        "clusters": [],
        "labels": [
          {
            "label": "particle",
            "weight": 2
          },
          {
            "label": "spinless particles",
            "weight": 2
          },
          {
            "label": "coupled",
            "weight": 2
          },
          {
            "label": "new",
            "weight": 2
          },
          {
            "label": "constraints",
            "weight": 2
          },
          {
            "label": "light",
            "weight": 2
          }
        ]
      }
    ],
    "labels": [
      {
        "label": "Îł",
        "weight": 82
      },
      {
        "label": "γγ",
        "weight": 36
      },
      {
        "label": "e",
        "weight": 26
      }
    ]
  },
  {
    "clusters": [],
    "labels": [
      {
        "label": "baryon",
        "weight": 7
      },
      {
        "label": "running vacuum",
        "weight": 6
      }
    ]
  },
  {
    "clusters": [],
    "labels": [
      {
        "label": "black hole",
        "weight": 94
      },
      {
        "label": "photon ring",
        "weight": 32
      },
      {
        "label": "ring",
        "weight": 28
      }
    ]
  }
]

By default, for each cluster the label​Clusters:​document​Cluster​Labels stage chooses the labels that maximize the Mutual Information with respect to the contents of the cluster. Use the mutual​Information​Weight property to set the balance between Mutual Information and simple maximum-weight label selection.

Note that label scoring only determines the selection and order of the labels for each cluster. The weight of the labels in the output JSON represents the number of documents in the cluster containing that label.

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

documents clusters to create label clusters for.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The source documents of clusters referenced in clusters.

label​Collector

Type
labelCollector
Default
{
  "type": "labelCollector:topFromFeatureFields",
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "minTf": 0,
  "minWeight": 0,
  "minWeightMass": 1,
  "tieResolution": "AUTO",
  "labelWeighting": "EMBEDDING",
  "failIfEmbeddingsNotAvailable": true
}
Required
no

Configures the collection of labels from individual documents.

The default collector configuration should provide reasonable labels in typical cases.

Use this property to override the label collection configuration to, for example, apply custom label filtering.

label​List​Filter

Type
labelListFilter
Default
{
  "type": "labelListFilter:acceptAll"
}
Required
no

The label list filter to apply to the labels describing each cluster.

A particularly useful filter in this context is the label​List​Filter:​diversified filter, which attempts to remove repetitive labels from the cluster description. See the documentation of the filter for the example request.

max​Labels

Type
integer
Default
3
Constraints
value > 0
Required
no

The maximum number of labels to output for each cluster.

max​Labels​Per​Document

Type
integer
Default
100
Constraints
value > 0
Required
no

The maximum number of labels to collect from each document when describing clusters.

mutual​Information​Weight

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

Determines the type of scoring Lingo4G uses to select cluster labels.

This property accepts values in the 0.0...1.0 range.

Values Scoring
1.0

Selects labels that maximize Mutual Information with respect to the cluster. These labels frequently occur within the cluster and are less common in documents from other clusters.

0.0

Chooses labels that occur most frequently in the cluster's documents. This method may promote labels that are frequent also in other clusters.

less than 1.0

Combines Mutual Information and occurrence-count scoring. The label score is a weighted geometric mean of the label's Mutual Information value and its in-cluster frequency. As mutual​Information​Weight approaches 1.0, the Mutual Information component of the score gains more significance.

threads

Type
threads
Default
auto
Required
no

The number of threads to engage for collection of labels from documents.