labelCollector
label​Collector:​*
components extract labels from feature fields of a single document. Label collectors
play a crucial role when
fetching and aggregating labels from many documents
or when
computing similarities between documents.
You can use the following label collector components in your analysis requests:
-
label​Collector:​all​From​Content​Fields
-
Collects values of the document's content fields you specify.
-
label​Collector:​all​From​Feature​Fields
-
Collects all labels from the document's feature fields you specify.
-
label​Collector:​top​Embedding​Nearest​Neighbors
-
Collects labels whose embedding vectors are most similar to the document's embedding vector.
-
label​Collector:​top​From​Feature​Fields
-
Collects the document's most frequent labels based on the frequency thresholds of your choice.
label​Collector:​reference
-
References a
label​Collector:​*
component defined in the request or in the project's default components.
label​Collector:​all​From​Content​Fields
Collects all values of the document's content fields you specify.
{
"type": "labelCollector:allFromContentFields",
"fields": {
"type": "contentFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelFilterTargetFieldName": null
}
A typical use case for this collector is counting the number of occurrences of a
content field values across document clusters. For example, the
following request clusters the top 10k documents matching the
clustering
search query and then labels the clusters based on the most-frequent values of the documents'
category
field.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
},
"limit": 10000
},
"clusters": {
"type": "clusters:cd",
"matrix": {
"type": "matrix:knnVectorsSimilarity",
"vectors": {
"type": "vectors:precomputedDocumentEmbeddings"
}
}
},
"categoriesInClusters": {
"type": "labelClusters:documentClusterLabels",
"labelCollector": {
"type": "labelCollector:allFromContentFields",
"fields": {
"type": "contentFields:simple",
"fields": {
"category": {}
}
}
}
}
}
}
The response contains, for each cluster, the top three values of the category
field that occur most frequently in each cluster's documents. Since the clusters group content-wise similar
documents, the categories reflect that similarity.
{
"result" : {
"categoriesInClusters" : {
"clusters" : [
{
"clusters" : [ ],
"labels" : [
{
"label" : "nlin.AO",
"weight" : 44.0
},
{
"label" : "nlin.CD",
"weight" : 37.0
},
{
"label" : "cs.SY",
"weight" : 21.0
}
]
},
{
"clusters" : [ ],
"labels" : [
{
"label" : "astro-ph.GA",
"weight" : 495.0
},
{
"label" : "astro-ph",
"weight" : 464.0
},
{
"label" : "astro-ph.SR",
"weight" : 425.0
}
]
},
{
"clusters" : [ ],
"labels" : [
{
"label" : "astro-ph.HE",
"weight" : 150.0
},
{
"label" : "astro-ph.GA",
"weight" : 140.0
},
{
"label" : "gr-qc",
"weight" : 98.0
}
]
},
{
"clusters" : [ ],
"labels" : [
{
"label" : "cs.CV",
"weight" : 401.0
},
{
"label" : "cs.LG",
"weight" : 369.0
},
{
"label" : "cs.AI",
"weight" : 274.0
}
]
}
],
"unclustered" : [ ]
}
}
}
Weights of the labels collected by label​Collector:​all​From​Content​Fields
correspond to the number of
documents in the cluster that contain the specific field value (Document Frequency).
fields
The fields from which to extract the labels.
Note that this collector ignores the
content​Field
specifications you provide. It always collects all values of the fields you specify.
label​Filter
Determines whether to allow Lingo4G to collect a specific field value from the document.
You can use the label filter to shape the label list to your linking, for example to remove one-word field values.
label​Filter​Target​Field​Name
Enables applying the label filter to a different field than the one whose value this component collects.
This mechanism is useful when you have two or more value-aligned fields and would like to collect values of one
field, but include or exclude some the values based on the corresponding value of the other field. For example,
if your documents have the name
and address
fields, you might want to collect the name
only if the address contains a specific city or country name. To achieve this, you'd provide the
name
field in the
fields
property and set the
label​Filter​Target​Field​Name
property to address
.
If label​Filter​Target​Field​Name
is null
, the
label​Filter
receives the field values being collected.
label​Collector:​all​From​Feature​Fields
Collects all labels from the document's feature fields you specify.
{
"type": "labelCollector:allFromFeatureFields",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
},
"minOccurrences": 0
}
This is a simpler variant of the
label​Collector:​top​From​Feature​Fields
collector. The
label​Collector:​top​From​Feature​Fields
is better suited for most typical applications.
You may want to use the label​Collector:​all​From​Feature​Fields
collector to collect an exhaustive list
of labels, for example, for document cluster labeling.
fields
The feature field from which to collect labels.
label​Filter
Determines whether to allow Lingo4G to collect a specific label from the document.
label​List​Filter
Performs filtering of the complete list of labels collected from one document.
Label list filters, such as the
label​List​Filter:​truncated​Phrases
filter, make decisions based the relationships between labels on the list.
min​Occurrences
The minimum number of occurrences a label must have in the document to be eligible for collection. The collector
ignores labels appearing fewer than min​Occurrences
times in the document.
You can use this threshold to remove infrequent labels from the collection results.
label​Collector:​top​Embedding​Nearest​Neighbors
Collects labels whose embedding vectors are most similar to the document's embedding vector.
{
"type": "labelCollector:topEmbeddingNearestNeighbors",
"failIfEmbeddingsNotAvailable": true,
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
}
}
For each input document, this collector searches for labels whose embedding vectors are most similar to the
document's embedding vector. Compared to the
label​Collector:​top​From​Feature​Fields
collector, this collector favors longer multi-word labels.
Notice, that this collector may return labels that do not occur in the document. To take advantage of
embedding-based weighting and at the same time ensure that the collected labels occur in the input documents, use
the label​Collector:​top​From​Feature​Fields
with
E​M​B​E​D​D​I​N​G
label​Weighting
.
fail​If​Embeddings​Not​Available
Determines the behavior of this stage if the index does not contain document or label embeddings.
If the index does not contain document or label embeddings and fail​If​Embeddings​Not​Available
is:
true
- this stage fails and logs an error.
false
- this stage collects an empty set of labels.
If your request combines keyword- and embedding-based processing, you can set
fail​If​Embeddings​Not​Available
to false
to have Lingo4G degrade gently to keyword-based
processing if the index does not contain document embeddings.
label​Filter
Determines whether to allow Lingo4G to collect a specific label from the document.
You can use the label filter to shape the label list to your linking, for example to remove one-word labels.
label​List​Filter
Performs filtering of the complete list of labels collected from one document.
Label list filters, such as the
label​List​Filter:​truncated​Phrases
filter, make decisions based the relationships between labels on the list.
label​Collector:​top​From​Feature​Fields
Collects the document's most frequent labels from one or more feature fields, based on the frequency thresholds of your choice.
{
"type": "labelCollector:topFromFeatureFields",
"failIfEmbeddingsNotAvailable": true,
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labelListFilter": {
"type": "labelListFilter:truncatedPhrases"
},
"labelWeighting": "EMBEDDING",
"minTf": 0,
"minWeight": 0,
"minWeightMass": 1,
"tieResolution": "AUTO"
}
This component works by collecting all labels occurring the document feature fields you choose to process. Then, Lingo4G removes the labels that don't pass labelFilter criteria or the minimum weight thresholds. Finally, Lingo4G applies the label list filter to the entire result to eliminate truncated phrases, for example.
In the example request shown below, we look for the top three most frequent labels in each document returned for the provided query.
{
"name": "Top 3 per-document labels aggregated from the title and abstract fields.",
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "photon"
},
"limit": 4
},
"documentLabels": {
"type": "documentLabels",
"maxLabels": 3,
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"fields": {
"type": "featureFields:simple",
"fields": [
"title$phrases",
"abstract$phrases"
]
},
"tieResolution": "TRUNCATE"
}
}
},
"output": {
"stages": [
"documentLabels"
]
}
}
The above request produces the following response.
"documentLabels": {
"documents": [
{
"id": 482237,
"labels": [
{
"label": "jet",
"weight": 2
},
{
"label": "photon-photon",
"weight": 2
}
]
},
{
"id": 298152,
"labels": [
{
"label": "Collider",
"weight": 3
}
]
},
{
"id": 275187,
"labels": [
{
"label": "interference",
"weight": 7
}
]
},
{
"id": 408761,
"labels": [
{
"label": "Rydberg",
"weight": 2
},
{
"label": "photon-photon",
"weight": 2
},
{
"label": "quantum",
"weight": 2
}
]
}
]
}
fail​If​Embeddings​Not​Available
Determines the behavior of this stage if the index does not contain document or label embeddings.
If the index does not contain document or label embeddings and fail​If​Embeddings​Not​Available
is:
true
- this stage fails and logs an error.
false
- this stage collects an empty set of labels.
If your request combines keyword- and embedding-based processing, you can set
fail​If​Embeddings​Not​Available
to false
to have Lingo4G degrade gently to keyword-based
processing if the index does not contain document embeddings.
fields
One or more fields from which labels are retrieved. The value of this property should contain or reference one
of the
content​Fields:​*
components.
label​Filter
A
label​Filter:​*
component that can be used to remove undesired labels.
label​List​Filter
A
label​List​Filter:​*
component that can be used to remove undesired labels, similar to the
label filter. Label list filters have access to
the entire set of labels of each document, so they can make more optimal global choices.
Incomplete phrase removal
filter is an example of this.
label​Weighting
Determines how Lingo4G computes the weights of the labels collected from the document.
The label​Weighting
property supports the following values:
T​F
-
Lingo4G uses the occurrence count in the document as the label's weight. Lingo4G sums up the occurrences across all the
fields
you provide.Frequency-based weighting favors single-word high-frequency labels.
E​M​B​E​D​D​I​N​G
-
Lingo4G uses the similarity between the label's embedding vector and the document's embedding vector as label weight.
Embedding-based weighting favors longer, more descriptive labels.
min​Tf
The minimum number of times the label must occur in the document for Lingo4G to collect it.
If the number of occurrences is lower than the threshold, Lingo4G ignores the label during the collection process.
Lingo4G aggregates the occurrence counts across all feature fields you choose to process.
min​Weight
The minimum weight each label must have to be eligible for collection from the document.
If the label's weight is smaller than the threshold, Lingo4G ignores the label during the collection process.
For term frequency weighting, Lingo4G aggregates the occurrence counts across all feature fields you choose to process.
min​Weight​Mass
The minimum fraction of the total weight mass the collected labels must represent.
The total document weight mass is the sum of the weights of all labels available in the document.
min​Weight​Mass
determines the minimum fraction of the total mass the collected labels must
represent.
If min​Weight​Mass
is 1.0, Lingo4G collects all labels that meet the minimum weight and label
filtering criteria. If min​Weight​Mass
is smaller than 1.0, for example 0.7, Lingo4G skips the
lowest-weight labels that account for 30% weight mass.
You can use min​Weight​Mass
to dynamically lower the number of labels Lingo4G collects from a
document without setting fixed weight thresholds. The min​Weight​Mass
threshold is especially useful
when processing documents of varied lengths.
tie​Resolution
The strategy of computing the number of returned labels when their weights at the tail of the sorted list are equal and the consuming component requests a fixed number of labels.
The tie​Resolution
property supports the following values:
A​U​T​O
-
Behaves the same as
R​E​D​U​C​E
, unless the returned list of labels would be empty, in which case behaves likeE​X​T​E​N​D
. E​X​T​E​N​D
-
Extend the list of labels past the limit to include all labels with the same weight.
R​E​D​U​C​E
-
Reduce the list of labels so that all labels with non-tied weights are included.
T​R​U​N​C​A​T​E
-
Truncate the output at the limit of labels set by the consumer component.
We don't recommend using the
T​R​U​N​C​A​T​E
resolution when collecting labels from large numbers of documents as this mode requires additional computing overhead to ensure collection of top-weight labels.The only use case for the
T​R​U​N​C​A​T​E
resolution is collection of an exact number of labels from individual documents.
label​Collector:​*
Consumers of
The following stages and components take label​Collector:​*
as
input: