labelClusters
The label​Clusters:​*
produce clusters of labels. One typical use case of these stages is to generate
label-based descriptions for clusters of documents.
You can use the following label clustering stages in your requests:
-
label​Clusters:​document​Cluster​Labels
-
Creates label clusters aligned with the document clusters you provide. Use this stage to generate label-based descriptions for clusters of documents.
label​Clusters:​reference
-
References the results of another
label​Clusters:​*
stage defined in the request.
The JSON output of the labelClusters stage has the following structure:
{
"clusters": [
{
"clusters": [
// sub-clusters (recursive structure)
],
"labels": [
{
"label": "first-label",
"weight": 44
},
...
]
},
{
...
second
cluster
},
...
more
clusters
]
}
The clusters
property contains an array of clusters. Each cluster has an array of labels (labels
property) and a nested array named clusters
with recursive sub-clusters (the array is empty when no
sub-clusters are present).
Each label inside labels
has a display label
and weight
.
label​Clusters:​document​Cluster​Labels
Creates label clusters aligned with the document clusters you provide in such a way that each label cluster contains labels describe the documents from the corresponding document cluster.
{
"type": "labelClusters:documentClusterLabels",
"clusters": {
"type": "clusters:reference",
"auto": true
},
"documents": {
"type": "documents:reference",
"auto": true
},
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"failIfEmbeddingsNotAvailable": true,
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
},
"labelWeighting": "EMBEDDING",
"minTf": 0,
"minWeight": 0,
"minWeightMass": 1,
"tieResolution": "AUTO"
},
"labelListFilter": {
"type": "labelListFilter:acceptAll"
},
"maxLabels": 3,
"maxLabelsPerDocument": 100,
"mutualInformationWeight": 1,
"threads": "auto"
}
In the following example, we request the top documents matching the query photon, arrange them into clusters and describe each cluster with labels.
{
"name": "Document clusters by More-Like-This similarity",
"comment": "Clusters a set of top documents matching the provided query, based on the common labels the documents share. Attempts to describe the clusters by top-frequency labels from each cluster's documents. Fetches the content of clustered documents.",
"variables": {
"query": {
"name": "Documents query",
"comment": "Defines the set of documents to cluster.",
"value": "photon"
},
"limit": {
"name": "Max documents",
"comment": "The maximum number of documents matching the query to select for clustering.",
"value": 2000
},
"clusterCreationPreference": {
"name": "Cluster creation preference",
"comment": "How many clusters to create. The more negative the preference, the fewer clusters. The closer the preference to 0, the more clusters.",
"value": -1000
},
"clusterLinkingPreference": {
"name": "Cluster linking preference",
"comment": "How many links to create between clusters. Softening of 0 creates unlinked, flat structure of clusters. Softening of 1.0 creates a highly-linked structure of clusters.",
"value": 0
},
"maxSimilarDocuments": {
"name": "Max similar documents",
"comment": "How many similar documents to find for each document in the similarity matrix. The larger the number of similar documents, the larger and more general the clusters and the longer clustering time.",
"value": 10
},
"maxClusterLabels": {
"name": "Max cluster labels",
"comment": "How many labels to use to label each cluster.",
"value": 3
}
},
"components": {
"query": {
"type": "query:string",
"query": {
"@var": "query"
}
}
},
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:reference",
"use": "query"
},
"limit": {
"@var": "limit"
}
},
"content": {
"type": "documentContent",
"limit": {
"@var": "limit"
}
},
"clusters": {
"type": "clusters:ap",
"matrix": {
"type": "matrix:keywordDocumentSimilarity",
"maxNeighbors": {
"@var": "maxSimilarDocuments"
}
},
"inputPreference": {
"@var": "clusterCreationPreference"
},
"softening": {
"@var": "clusterLinkingPreference"
}
},
"labelClusters": {
"type": "labelClusters:documentClusterLabels",
"maxLabels": {
"@var": "maxClusterLabels"
},
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"labelFilter": {
"type": "labelFilter:dictionary",
"exclude": [
{
"type": "dictionary:queryTerms",
"query": {
"type": "query:reference",
"use": "query"
}
}
]
}
}
}
},
"output": {
"stages": [
"content",
"clusters",
"labelClusters"
]
}
}
Label clusters for clusters 1-3 are shown below:
"clusters": [
{
"clusters": [],
"labels": [
{
"label": "cross section",
"weight": 29
},
{
"label": "hadronic",
"weight": 11
}
]
},
{
"clusters": [
{
"clusters": [],
"labels": [
{
"label": "particle",
"weight": 2
},
{
"label": "spinless particles",
"weight": 2
},
{
"label": "coupled",
"weight": 2
},
{
"label": "new",
"weight": 2
},
{
"label": "constraints",
"weight": 2
},
{
"label": "light",
"weight": 2
}
]
}
],
"labels": [
{
"label": "Îł",
"weight": 82
},
{
"label": "γγ",
"weight": 36
},
{
"label": "e",
"weight": 26
}
]
},
{
"clusters": [],
"labels": [
{
"label": "baryon",
"weight": 7
},
{
"label": "running vacuum",
"weight": 6
}
]
},
{
"clusters": [],
"labels": [
{
"label": "black hole",
"weight": 94
},
{
"label": "photon ring",
"weight": 32
},
{
"label": "ring",
"weight": 28
}
]
}
]
By default, for each cluster the label​Clusters:​document​Cluster​Labels
stage chooses the labels that
maximize the Mutual Information with respect to the contents of the cluster. Use the
mutual​Information​Weight
property to set the balance between Mutual Information and simple maximum-weight label selection.
Note that label scoring only determines the selection and order of the labels for each cluster. The weight of the labels in the output JSON represents the number of documents in the cluster containing that label.
clusters
documents
clusters to create label clusters for.
documents
The source documents of clusters referenced in
clusters
.
label​Collector
Configures the collection of labels from individual documents.
The default collector configuration should provide reasonable labels in typical cases.
Use this property to override the label collection configuration to, for example, apply custom label filtering.
label​List​Filter
The label list filter to apply to the labels describing each cluster.
A particularly useful filter in this context is the
label​List​Filter:​diversified
filter, which attempts to remove repetitive labels from the cluster description. See the documentation of the
filter for the example request.
max​Labels
The maximum number of labels to output for each cluster.
max​Labels​Per​Document
The maximum number of labels to collect from each document when describing clusters.
mutual​Information​Weight
Determines the type of scoring Lingo4G uses to select cluster labels.
This property accepts values in the 0.0...1.0 range.
Values | Scoring |
---|---|
1.0 |
Selects labels that maximize Mutual Information with respect to the cluster. These labels frequently occur within the cluster and are less common in documents from other clusters. |
0.0 |
Chooses labels that occur most frequently in the cluster's documents. This method may promote labels that are frequent also in other clusters. |
less than 1.0 |
Combines Mutual Information and occurrence-count scoring. The label score is a weighted geometric mean
of the label's Mutual Information value and its in-cluster frequency. As
|
threads
The number of threads to engage for collection of labels from documents.