labelScorer

label​Scorer:​* components compute a numerical score for each label (such as document or term frequency). You can use label scores to output additional information about labels for display or filtering purposes or to recompute scores of a set of labels using information from different fields or document scopes.

You can use the following label scorers in your analysis requests:

label​Scorer:​composite

Aggregates the scores computed by the scorers you provide into a single score.

label​Scorer:​df

Computes the label's document frequency (DF).

label​Scorer:​identity

Returns the label's original weight.

label​Scorer:​idf

Computes the label's inverse document frequency (IDF).

label​Scorer:​probability​Ratio

Computes the probability ratio coefficient, which you can use to identify labels that are more probable to appear in the subset of documents of your choice than in the whole collection.

label​Scorer:​tf

Computes the label's term frequency (TF).

Here is an example where we use the labels:​scored stage to recompute the weights of labels from the title field, using document frequencies retrieved from the abstract field of the reference arXiv data set.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 500
    },
    "tf": {
      "type": "labels:scored",
      "scorer": {
        "type": "labelScorer:df",
        "fields": {
          "type": "featureFields:simple",
          "fields": [
            "abstract$phrases"
          ]
        },
        "scope": {
          "type": "documents:reference",
          "use": "documents"
        }
      },
      "labels": {
        "type": "labels:fromDocuments",
        "documents": {
          "type": "documents:reference",
          "use": "documents"
        },
        "maxLabels": {
          "type": "labelCount:fixed",
          "value": 10
        },
        "labelAggregator": {
          "type": "labelAggregator:topWeight",
          "labelCollector": {
            "type": "labelCollector:topFromFeatureFields",
            "fields": {
              "type": "featureFields:simple",
              "fields": [
                "title$phrases"
              ]
            }
          }
        }
      }
    }
  },
  "output": {
    "stages": [
      "tf"
    ]
  }
}

The result of the above request, on the reference Arxiv index:

{
  "result" : {
    "tf" : {
      "labels" : [
        {
          "label" : "single-photon",
          "weight" : 1172.25
        },
        {
          "label" : "entangled photons",
          "weight" : 356.364
        },
        {
          "label" : "photon-photon",
          "weight" : 525.168
        },
        {
          "label" : "two-photon",
          "weight" : 759.618
        },
        {
          "label" : "interference",
          "weight" : 403.254
        },
        {
          "label" : "collisions",
          "weight" : 403.254
        },
        {
          "label" : "detector",
          "weight" : 637.704
        },
        {
          "label" : "entanglement",
          "weight" : 487.656
        },
        {
          "label" : "dark photon",
          "weight" : 187.56
        }
      ]
    }
  }
}

label​Scorer:​composite

Computes an aggregate score that is a geometric mean of values returned by the scorers you provide.

{
  "type": "labelScorer:composite",
  "scorers": []
}

scorers

Type
array of labelScorer
Default
[]
Required
no

An array of scorers from which an aggregate should be computed.

label​Scorer:​df

Computes label document frequency (DF). Document frequency corresponds to the number of documents the label occurred in (contrary to the term frequency, multiple occurrences within a single document count as one).

{
  "type": "labelScorer:df",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "scope": {
    "type": "documents:reference",
    "auto": true
  }
}

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

A reference to feature fields used to retrieve label statistics from.

scope

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

A reference to a set of documents used to compute label statistics from.

label​Scorer:​identity

This label scorer does not change the externally provided score of the label, it is an identity scorer.

{
  "type": "labelScorer:identity"
}

label​Scorer:​idf

Computes the inverse document frequency (IDF) for each label.

{
  "type": "labelScorer:idf",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "scope": {
    "type": "documents:reference",
    "auto": true
  }
}

The inverse document frequency is computed using the following formula:

log N+1 min ( N , df(label) ) + 1

where N is the total number of unique documents in scope and df(label) is the number of documents with occurrences of the label.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

A reference to feature fields used to retrieve label statistics from.

scope

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

A reference to a set of documents used to compute label statistics from.

label​Scorer:​probability​Ratio

Computes label's probability ratio coefficient, which you can use to identify labels that are more probable to appear in the subset of documents of your choice than in the whole collection.

{
  "type": "labelScorer:probabilityRatio",
  "baseScope": {
    "type": "documents:reference",
    "auto": true
  },
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "referenceScope": {
    "type": "documents:sample",
    "limit": 10000,
    "query": {
      "type": "query:all"
    },
    "randomSeed": 0,
    "samplingRatio": 1
  }
}

The probability ratio coefficient is computed as a ratio between the given label's probability of occurrence in the base scope against the reference scope. Labels with higher probability of occurrence in the base​Scope (compared to their probability of occurrence in the reference​Scope) should receive higher scores:

ratio = P(label,base scope) P(label,reference scope)

The probability P for each scope is computed using the following formula, where N is equal to the number of unique labels being scored (a form of additive probability smoothing function):

P(label,scope) = tf(label) + 1 +N ∑ i ∈ scope labels tf(i) + 1 +N

base​Scope

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

A reference to a set of documents providing the "focus" document scope.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

A reference to feature fields used to retrieve label statistics from.

reference​Scope

Type
documents
Default
{
  "type": "documents:sample",
  "limit": 10000,
  "query": {
    "type": "query:all"
  },
  "randomSeed": 0,
  "samplingRatio": 1
}
Required
no

A reference to a set of documents providing the "reference" scope.

label​Scorer:​tf

Computes the term frequency (TF) for each label. The term frequency is the number of times a given label occurred throughout the scope (including repetitions within the document's fields).

{
  "type": "labelScorer:tf",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "scope": {
    "type": "documents:reference",
    "auto": true
  }
}

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

A reference to feature fields used to retrieve label statistics from.

scope

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

A reference to a set of documents used to compute label statistics from.

Consumers of label​Scorer:​*

The following stages and components take label​Scorer:​* as input:

Stage or component Property
label​Scorer:​composite
  • scorers
  • labels:​scored
  • scorer