labelCollector

label​Collector:​* components extract labels from feature fields of a single document. Label collectors play a crucial role when fetching and aggregating labels from many documents or when computing similarities between documents.

You can use the following label collector components in your analysis requests:

label​Collector:​all​From​Content​Fields

Collects values of the document's content fields you specify.

label​Collector:​all​From​Feature​Fields

Collects all labels from the document's feature fields you specify.

label​Collector:​top​Embedding​Nearest​Neighbors

Collects labels whose embedding vectors are most similar to the document's embedding vector.

label​Collector:​top​From​Feature​Fields

Collects the document's most frequent labels based on the frequency thresholds of your choice.


label​Collector:​reference

References a label​Collector:​* component defined in the request or in the project's default components.


label​Collector:​all​From​Content​Fields

Collects all values of the document's content fields you specify.

{
  "type": "labelCollector:allFromContentFields",
  "fields": {
    "type": "contentFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelFilterTargetFieldName": null
}

A typical use case for this collector is counting the number of occurrences of a content field values across document clusters. For example, the following request clusters the top 10k documents matching the clustering search query and then labels the clusters based on the most-frequent values of the documents' category field.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      },
      "limit": 10000
    },
    "clusters": {
      "type": "clusters:cd",
      "matrix": {
        "type": "matrix:knnVectorsSimilarity",
        "vectors": {
          "type": "vectors:precomputedDocumentEmbeddings"
        }
      }
    },
    "categoriesInClusters": {
      "type": "labelClusters:documentClusterLabels",
      "labelCollector": {
        "type": "labelCollector:allFromContentFields",
        "fields": {
          "type": "contentFields:simple",
          "fields": {
            "category": {}
          }
        }
      }
    }
  }
}

The response contains, for each cluster, the top three values of the category field that occur most frequently in each cluster's documents. Since the clusters group content-wise similar documents, the categories reflect that similarity.

{
  "result" : {
    "categoriesInClusters" : {
      "clusters" : [
        {
          "clusters" : [ ],
          "labels" : [
            {
              "label" : "nlin.AO",
              "weight" : 44.0
            },
            {
              "label" : "nlin.CD",
              "weight" : 37.0
            },
            {
              "label" : "cs.SY",
              "weight" : 21.0
            }
          ]
        },
        {
          "clusters" : [ ],
          "labels" : [
            {
              "label" : "astro-ph.GA",
              "weight" : 495.0
            },
            {
              "label" : "astro-ph",
              "weight" : 464.0
            },
            {
              "label" : "astro-ph.SR",
              "weight" : 425.0
            }
          ]
        },
        {
          "clusters" : [ ],
          "labels" : [
            {
              "label" : "astro-ph.HE",
              "weight" : 150.0
            },
            {
              "label" : "astro-ph.GA",
              "weight" : 140.0
            },
            {
              "label" : "gr-qc",
              "weight" : 98.0
            }
          ]
        },
        {
          "clusters" : [ ],
          "labels" : [
            {
              "label" : "cs.CV",
              "weight" : 401.0
            },
            {
              "label" : "cs.LG",
              "weight" : 369.0
            },
            {
              "label" : "cs.AI",
              "weight" : 274.0
            }
          ]
        }
      ],
      "unclustered" : [ ]
    }
  }
}

Weights of the labels collected by label​Collector:​all​From​Content​Fields correspond to the number of documents in the cluster that contain the specific field value (Document Frequency).

fields

Type
contentFields
Default
{
  "type": "contentFields:reference",
  "auto": true
}
Required
no

The fields from which to extract the labels.

Note that this collector ignores the content​Field specifications you provide. It always collects all values of the fields you specify.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

Determines whether to allow Lingo4G to collect a specific field value from the document.

You can use the label filter to shape the label list to your linking, for example to remove one-word field values.

label​Filter​Target​Field​Name

Type
one of [string, null]
Default
null
Required
no

Enables applying the label filter to a different field than the one whose value this component collects.

This mechanism is useful when you have two or more value-aligned fields and would like to collect values of one field, but include or exclude some the values based on the corresponding value of the other field. For example, if your documents have the name and address fields, you might want to collect the name only if the address contains a specific city or country name. To achieve this, you'd provide the name field in the fields property and set the label​Filter​Target​Field​Name property to address.

If label​Filter​Target​Field​Name is null, the label​Filter receives the field values being collected.

label​Collector:​all​From​Feature​Fields

Collects all labels from the document's feature fields you specify.

{
  "type": "labelCollector:allFromFeatureFields",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "minOccurrences": 0
}

This is a simpler variant of the label​Collector:​top​From​Feature​Fields collector. The label​Collector:​top​From​Feature​Fields is better suited for most typical applications.

You may want to use the label​Collector:​all​From​Feature​Fields collector to collect an exhaustive list of labels, for example, for document cluster labeling.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

The feature field from which to collect labels.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

Determines whether to allow Lingo4G to collect a specific label from the document.

label​List​Filter

Type
labelListFilter
Default
{
  "type": "labelListFilter:truncatedPhrases"
}
Required
no

Performs filtering of the complete list of labels collected from one document.

Label list filters, such as the label​List​Filter:​truncated​Phrases filter, make decisions based the relationships between labels on the list.

min​Occurrences

Type
integer
Default
0
Constraints
value >= 0
Required
no

The minimum number of occurrences a label must have in the document to be eligible for collection. The collector ignores labels appearing fewer than min​Occurrences times in the document.

You can use this threshold to remove infrequent labels from the collection results.

label​Collector:​top​Embedding​Nearest​Neighbors

Collects labels whose embedding vectors are most similar to the document's embedding vector.

{
  "type": "labelCollector:topEmbeddingNearestNeighbors",
  "failIfEmbeddingsNotAvailable": true,
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  }
}

For each input document, this collector searches for labels whose embedding vectors are most similar to the document's embedding vector. Compared to the label​Collector:​top​From​Feature​Fields collector, this collector favors longer multi-word labels.

Notice, that this collector may return labels that do not occur in the document. To take advantage of embedding-based weighting and at the same time ensure that the collected labels occur in the input documents, use the label​Collector:​top​From​Feature​Fields with E​M​B​E​D​D​I​N​G label​Weighting.

fail​If​Embeddings​Not​Available

Type
boolean
Default
true
Required
no

Determines the behavior of this stage if the index does not contain document or label embeddings.

If the index does not contain document or label embeddings and fail​If​Embeddings​Not​Available is:

true
this stage fails and logs an error.
false
this stage collects an empty set of labels.

If your request combines keyword- and embedding-based processing, you can set fail​If​Embeddings​Not​Available to false to have Lingo4G degrade gently to keyword-based processing if the index does not contain document embeddings.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

Determines whether to allow Lingo4G to collect a specific label from the document.

You can use the label filter to shape the label list to your linking, for example to remove one-word labels.

label​List​Filter

Type
labelListFilter
Default
{
  "type": "labelListFilter:truncatedPhrases"
}
Required
no

Performs filtering of the complete list of labels collected from one document.

Label list filters, such as the label​List​Filter:​truncated​Phrases filter, make decisions based the relationships between labels on the list.

label​Collector:​top​From​Feature​Fields

Collects the document's most frequent labels from one or more feature fields, based on the frequency thresholds of your choice.

{
  "type": "labelCollector:topFromFeatureFields",
  "failIfEmbeddingsNotAvailable": true,
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "labelWeighting": "EMBEDDING",
  "minTf": 0,
  "minWeight": 0,
  "minWeightMass": 1,
  "tieResolution": "AUTO"
}

This component works by collecting all labels occurring the document feature fields you choose to process. Then, Lingo4G removes the labels that don't pass labelFilter criteria or the minimum weight thresholds. Finally, Lingo4G applies the label list filter to the entire result to eliminate truncated phrases, for example.

In the example request shown below, we look for the top three most frequent labels in each document returned for the provided query.

{
  "name": "Top 3 per-document labels aggregated from the title and abstract fields.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 4
    },
    "documentLabels": {
      "type": "documentLabels",
      "maxLabels": 3,
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "fields": {
          "type": "featureFields:simple",
          "fields": [
            "title$phrases",
            "abstract$phrases"
          ]
        },
        "tieResolution": "TRUNCATE"
      }
    }
  },
  "output": {
    "stages": [
      "documentLabels"
    ]
  }
}

The above request produces the following response.

"documentLabels": {
  "documents": [
    {
      "id": 482237,
      "labels": [
        {
          "label": "jet",
          "weight": 2
        },
        {
          "label": "photon-photon",
          "weight": 2
        }
      ]
    },
    {
      "id": 298152,
      "labels": [
        {
          "label": "Collider",
          "weight": 3
        }
      ]
    },
    {
      "id": 275187,
      "labels": [
        {
          "label": "interference",
          "weight": 7
        }
      ]
    },
    {
      "id": 408761,
      "labels": [
        {
          "label": "Rydberg",
          "weight": 2
        },
        {
          "label": "photon-photon",
          "weight": 2
        },
        {
          "label": "quantum",
          "weight": 2
        }
      ]
    }
  ]
}

fail​If​Embeddings​Not​Available

Type
boolean
Default
true
Required
no

Determines the behavior of this stage if the index does not contain document or label embeddings.

If the index does not contain document or label embeddings and fail​If​Embeddings​Not​Available is:

true
this stage fails and logs an error.
false
this stage collects an empty set of labels.

If your request combines keyword- and embedding-based processing, you can set fail​If​Embeddings​Not​Available to false to have Lingo4G degrade gently to keyword-based processing if the index does not contain document embeddings.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

One or more fields from which labels are retrieved. The value of this property should contain or reference one of the content​Fields:​* components.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

A label​Filter:​* component that can be used to remove undesired labels.

label​List​Filter

Type
labelListFilter
Default
{
  "type": "labelListFilter:truncatedPhrases"
}
Required
no

A label​List​Filter:​* component that can be used to remove undesired labels, similar to the label filter. Label list filters have access to the entire set of labels of each document, so they can make more optimal global choices. Incomplete phrase removal filter is an example of this.

label​Weighting

Type
string
Default
"EMBEDDING"
Constraints
one of [TF, EMBEDDING]
Required
no

Determines how Lingo4G computes the weights of the labels collected from the document.

The label​Weighting property supports the following values:

T​F

Lingo4G uses the occurrence count in the document as the label's weight. Lingo4G sums up the occurrences across all the fields you provide.

Frequency-based weighting favors single-word high-frequency labels.

E​M​B​E​D​D​I​N​G

Lingo4G uses the similarity between the label's embedding vector and the document's embedding vector as label weight.

Embedding-based weighting favors longer, more descriptive labels.

min​Tf

Type
number
Default
0
Constraints
value >= 0
Required
no

The minimum number of times the label must occur in the document for Lingo4G to collect it.

If the number of occurrences is lower than the threshold, Lingo4G ignores the label during the collection process.

Lingo4G aggregates the occurrence counts across all feature fields you choose to process.

min​Weight

Type
number
Default
0
Constraints
value >= 0
Required
no

The minimum weight each label must have to be eligible for collection from the document.

If the label's weight is smaller than the threshold, Lingo4G ignores the label during the collection process.

For term frequency weighting, Lingo4G aggregates the occurrence counts across all feature fields you choose to process.

min​Weight​Mass

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

The minimum fraction of the total weight mass the collected labels must represent.

The total document weight mass is the sum of the weights of all labels available in the document. min​Weight​Mass determines the minimum fraction of the total mass the collected labels must represent.

If min​Weight​Mass is 1.0, Lingo4G collects all labels that meet the minimum weight and label filtering criteria. If min​Weight​Mass is smaller than 1.0, for example 0.7, Lingo4G skips the lowest-weight labels that account for 30% weight mass.

You can use min​Weight​Mass to dynamically lower the number of labels Lingo4G collects from a document without setting fixed weight thresholds. The min​Weight​Mass threshold is especially useful when processing documents of varied lengths.

tie​Resolution

Type
string
Default
"AUTO"
Constraints
one of [TRUNCATE, EXTEND, REDUCE, AUTO]
Required
no

The strategy of computing the number of returned labels when their weights at the tail of the sorted list are equal and the consuming component requests a fixed number of labels.

The tie​Resolution property supports the following values:

A​U​T​O

Behaves the same as R​E​D​U​C​E, unless the returned list of labels would be empty, in which case behaves like E​X​T​E​N​D.

E​X​T​E​N​D

Extend the list of labels past the limit to include all labels with the same weight.

R​E​D​U​C​E

Reduce the list of labels so that all labels with non-tied weights are included.

T​R​U​N​C​A​T​E

Truncate the output at the limit of labels set by the consumer component.

We don't recommend using the T​R​U​N​C​A​T​E resolution when collecting labels from large numbers of documents as this mode requires additional computing overhead to ensure collection of top-weight labels.

The only use case for the T​R​U​N​C​A​T​E resolution is collection of an exact number of labels from individual documents.

Consumers of label​Collector:​*

The following stages and components take label​Collector:​* as input:

Stage or component Property
document​Labels
  • label​Collector
  • label​Aggregator:​top​Weight
  • label​Collector
  • label​Clusters:​document​Cluster​Labels
  • label​Collector
  • matrix:​keyword​Document​Similarity
  • label​Collector
  • matrix​Rows:​keyword​Document​Similarity
  • label​Collector