labelCollector

label​Collector:​* components extract labels from feature fields of a single document. Label collectors play a crucial role when fetching and aggregating labels from many documents or when computing similarities between documents.

You can use the following label collector components in your analysis requests:

label​Collector:​top​From​Feature​Fields

Collects the document's most frequent labels based on the frequency thresholds of your choice.


label​Collector:​reference

References a label​Collector:​* component defined in the request or in the project's default components.


label​Collector:​top​From​Feature​Fields

Collects the document's most frequent labels from one or more feature fields, based on the frequency thresholds of your choice.

{
  "type": "labelCollector:topFromFeatureFields",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "minTf": 0,
  "minTfMass": 1,
  "tieResolution": "AUTO"
}

This component works by summing up occurrences of all input labels from the provided fields. Then, labels that don't pass labelFilter criteria or the minimum frequency thresholds are removed. An additional label list filter can be applied to the entire result to eliminate truncated phrases, for example.

In the example request shown below, we look for the top three most frequent labels in each document returned for the provided query.

{
  "name": "Top 3 per-document labels aggregated from the title and abstract fields.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 4
    },
    "documentLabels": {
      "type": "documentLabels",
      "maxLabels": 3,
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "fields": {
          "type":"featureFields:simple",
          "fields": [
            "title$phrases",
            "abstract$phrases"
          ]
        },
        "tieResolution": "TRUNCATE",
        "labelListFilter":{
          "type": "labelListFilter:truncatedPhrases"
        },
        "labelFilter": {
          "type": "labelFilter:acceptLabels",
          "labels": {
            "type": "labels:fromDocuments"
          }
        }
      }
    }
  },
  "output": {
    "stages": [
      "documentLabels"
    ]
  }
}

The above request produces the following response.

"documentLabels": {
  "documents": [
    {
      "id": 482237,
      "labels": [
        {
          "label": "jet",
          "weight": 2
        },
        {
          "label": "photon-photon",
          "weight": 2
        }
      ]
    },
    {
      "id": 298152,
      "labels": [
        {
          "label": "Collider",
          "weight": 3
        }
      ]
    },
    {
      "id": 275187,
      "labels": [
        {
          "label": "interference",
          "weight": 7
        }
      ]
    },
    {
      "id": 408761,
      "labels": [
        {
          "label": "Rydberg",
          "weight": 2
        },
        {
          "label": "photon-photon",
          "weight": 2
        },
        {
          "label": "quantum",
          "weight": 2
        }
      ]
    }
  ]
}

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

One or more fields from which labels are retrieved. The value of this property should contain or reference one of the content​Fields:​* components.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

A label​Filter:​* component that can be used to remove undesired labels.

label​List​Filter

Type
labelListFilter
Default
{
  "type": "labelListFilter:truncatedPhrases"
}
Required
no

A label​List​Filter:​* component that can be used to remove undesired labels, similar to the label filter. Label list filters have access to the entire set of labels of each document so they can make more optimal global choices. Incomplete phrase removal filter is an example of this.

min​Tf

Type
integer
Default
0
Constraints
value >= 0
Required
no

Minimum label frequency (inclusive). Label frequency is aggregated across all selected fields.

min​Tf​Mass

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

Minimum relative term frequency of a label with respect to the total frequency of all labels (after filtering) retrieved from the document's fields.

The values of this parameter must be between 0 and 1.

tie​Resolution

Type
string
Default
"AUTO"
Constraints
one of [TRUNCATE, EXTEND, REDUCE, AUTO]
Required
no

The strategy of computing the number of returned labels when their frequencies at the tail of the sorted list are equal and the consuming component requests a fixed number of labels.

The tie​Resolution property supports the following values:

T​R​U​N​C​A​T​E

Truncate the output at the limit of labels set by the consumer component.

E​X​T​E​N​D

Extend the list of labels past the limit to include all labels with the same weight.

R​E​D​U​C​E

Reduce the list of labels so that all labels with non-tied weights are included.

A​U​T​O

Behaves the same as R​E​D​U​C​E, unless the returned list of labels would be empty, in which case behaves like E​X​T​E​N​D.

Consumers of label​Collector:​*

The following stages and components take label​Collector:​* as input:

Stage or component Property
document​Labels
  • label​Collector
  • label​Aggregator:​top​Weight
  • label​Collector
  • matrix:​keyword​Document​Similarity
  • label​Collector
  • matrix​Rows:​keyword​Document​Similarity
  • label​Collector