labelCollector

label‚ÄčCollector:‚Äč* components extract labels from feature fields of a single document. Label collectors play a crucial role when fetching and aggregating labels from many documents or when computing similarities between documents.

You can use the following label collector components in your analysis requests:

label‚ÄčCollector:‚Äčtop‚ÄčFrom‚ÄčFeature‚ÄčFields

Collects the document's most frequent labels based on the frequency thresholds of your choice.


label‚ÄčCollector:‚Äčreference

References a label‚ÄčCollector:‚Äč* component defined in the request or in the project's default components.


label‚ÄčCollector:‚Äčtop‚ÄčFrom‚ÄčFeature‚ÄčFields

Collects the document's most frequent labels from one or more feature fields, based on the frequency thresholds of your choice.

{
  "type": "labelCollector:topFromFeatureFields",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:truncatedPhrases"
  },
  "minTf": 0,
  "minTfMass": 1,
  "tieResolution": "AUTO"
}

This component works by summing up occurrences of all input labels from the provided fields. Then, labels that don't pass labelFilter criteria or the minimum frequency thresholds are removed. An additional label list filter can be applied to the entire result to eliminate truncated phrases, for example.

In the example request shown below, we look for the top three most frequent labels in each document returned for the provided query.

{
  "name": "Top 3 per-document labels aggregated from the title and abstract fields.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 4
    },
    "documentLabels": {
      "type": "documentLabels",
      "maxLabels": 3,
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "fields": {
          "type":"featureFields:simple",
          "fields": [
            "title$phrases",
            "abstract$phrases"
          ]
        },
        "tieResolution": "TRUNCATE",
        "labelListFilter":{
          "type": "labelListFilter:truncatedPhrases"
        },
        "labelFilter": {
          "type": "labelFilter:acceptLabels",
          "labels": {
            "type": "labels:fromDocuments"
          }
        }
      }
    }
  },
  "output": {
    "stages": [
      "documentLabels"
    ]
  }
}

The above request produces the following response.

"documentLabels": {
  "documents": [
    {
      "id": 188201,
      "labels": [
        {
          "label": "photon",
          "weight": 16
        },
        {
          "label": "photon-jet",
          "weight": 5
        },
        {
          "label": "e‚Āļe",
          "weight": 5
        }
      ]
    },
    {
      "id": 62168,
      "labels": [
        {
          "label": "photon-photon",
          "weight": 3
        }
      ]
    },
    {
      "id": 252264,
      "labels": [
        {
          "label": "photon-photon",
          "weight": 2
        }
      ]
    },
    {
      "id": 404103,
      "labels": [
        {
          "label": "photon-photon",
          "weight": 3
        }
      ]
    }
  ]
}

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

One or more fields from which labels are retrieved. The value of this property should contain or reference one of the content‚ÄčFields:‚Äč* components.

label‚ÄčFilter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

A label‚ÄčFilter:‚Äč* component that can be used to remove undesired labels.

label‚ÄčList‚ÄčFilter

Type
labelListFilter
Default
{
  "type": "labelListFilter:truncatedPhrases"
}
Required
no

A label‚ÄčList‚ÄčFilter:‚Äč* component that can be used to remove undesired labels, similar to the label filter. Label list filters have access to the entire set of labels of each document so they can make more optimal global choices. Incomplete phrase removal filter is an example of this.

min‚ÄčTf

Type
integer
Default
0
Constraints
value >= 0
Required
no

Minimum label frequency (inclusive). Label frequency is aggregated across all selected fields.

min‚ÄčTf‚ÄčMass

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

Minimum relative term frequency of a label with respect to the total frequency of all labels (after filtering) retrieved from the document's fields.

The values of this parameter must be between 0 and 1.

tie‚ÄčResolution

Type
string
Default
"AUTO"
Constraints
one of [TRUNCATE, EXTEND, REDUCE, AUTO]
Required
no

The strategy of computing the number of returned labels when their frequencies at the tail of the sorted list are equal and the consuming component requests a fixed number of labels.

The tie‚ÄčResolution property supports the following values:

T‚ÄčR‚ÄčU‚ÄčN‚ÄčC‚ÄčA‚ÄčT‚ÄčE

Truncate the output at the limit of labels set by the consumer component.

E‚ÄčX‚ÄčT‚ÄčE‚ÄčN‚ÄčD

Extend the list of labels past the limit to include all labels with the same weight.

R‚ÄčE‚ÄčD‚ÄčU‚ÄčC‚ÄčE

Reduce the list of labels so that all labels with non-tied weights are included.

A‚ÄčU‚ÄčT‚ÄčO

Behaves the same as R‚ÄčE‚ÄčD‚ÄčU‚ÄčC‚ÄčE, unless the returned list of labels would be empty, in which case behaves like E‚ÄčX‚ÄčT‚ÄčE‚ÄčN‚ÄčD.

Consumers of label‚ÄčCollector:‚Äč*

The following stages and components take label‚ÄčCollector:‚Äč* as input:

Stage or component Property
document‚ÄčLabels
  • label‚ÄčCollector
  • label‚ÄčAggregator:‚Äčtop‚ÄčWeight
  • label‚ÄčCollector
  • matrix:‚Äčkeyword‚ÄčDocument‚ÄčSimilarity
  • label‚ÄčCollector
  • matrix‚ÄčRows:‚Äčkeyword‚ÄčDocument‚ÄčSimilarity
  • label‚ÄčCollector