labelScorer
label​Scorer:​*
components compute a numerical score for each label (such as document or term frequency).
You can use label scores to output additional information about labels for display or filtering purposes or to
recompute scores of a set of labels using information from different fields or document scopes.
You can use the following label scorers in your analysis requests:
-
label​Scorer:​composite
-
Aggregates the scores computed by the scorers you provide into a single score.
-
label​Scorer:​df
-
Computes the label's document frequency (DF).
-
label​Scorer:​identity
-
Returns the label's original weight.
-
label​Scorer:​idf
-
Computes the label's inverse document frequency (IDF).
-
label​Scorer:​probability​Ratio
-
Computes the probability ratio coefficient, which you can use to identify labels that are more probable to appear in the subset of documents of your choice than in the whole collection.
-
label​Scorer:​tf
-
Computes the label's term frequency (TF).
Here is an example where we use the
labels:​scored
stage to recompute the weights of
labels from the title
field, using document frequencies retrieved from the abstract
field
of the reference arXiv data set.
The result of the above request, on the reference Arxiv index:
label​Scorer:​composite
Computes an aggregate score that is a geometric mean of values returned by the scorers you provide.
{
"type": "labelScorer:composite",
"scorers": []
}
scorers
An array of scorers from which an aggregate should be computed.
label​Scorer:​df
Computes label document frequency (DF). Document frequency corresponds to the number of documents the label occurred in (contrary to the term frequency, multiple occurrences within a single document count as one).
{
"type": "labelScorer:df",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"scope": {
"type": "documents:reference",
"auto": true
}
}
fields
A reference to feature fields used to retrieve label statistics from.
scope
A reference to a set of documents used to compute label statistics from.
label​Scorer:​identity
This label scorer does not change the externally provided score of the label, it is an identity scorer.
{
"type": "labelScorer:identity"
}
label​Scorer:​idf
Computes the inverse document frequency (IDF) for each label.
{
"type": "labelScorer:idf",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"scope": {
"type": "documents:reference",
"auto": true
}
}
The inverse document frequency is computed using the following formula:
where is the total number of unique documents in scope and is the number of documents with occurrences of the label.
fields
A reference to feature fields used to retrieve label statistics from.
scope
A reference to a set of documents used to compute label statistics from.
label​Scorer:​probability​Ratio
Computes label's probability ratio coefficient, which you can use to identify labels that are more probable to appear in the subset of documents of your choice than in the whole collection.
{
"type": "labelScorer:probabilityRatio",
"baseScope": {
"type": "documents:reference",
"auto": true
},
"fields": {
"type": "featureFields:reference",
"auto": true
},
"referenceScope": {
"type": "documents:sample",
"limit": 10000,
"query": {
"type": "query:all"
},
"randomSeed": 0,
"samplingRatio": 1
}
}
The probability ratio coefficient is computed as a ratio between the given label's probability of occurrence in
the base
scope against the reference
scope. Labels with higher probability of occurrence
in the base​Scope
(compared to their probability of occurrence in the reference​Scope
)
should receive higher scores:
The probability for each scope is computed using the following formula, where is equal to the number of unique labels being scored (a form of additive probability smoothing function):
base​Scope
A reference to a set of documents providing the "focus" document scope.
fields
A reference to feature fields used to retrieve label statistics from.
reference​Scope
A reference to a set of documents providing the "reference" scope.
label​Scorer:​tf
Computes the term frequency (TF) for each label. The term frequency is the number of times a given label occurred throughout the scope (including repetitions within the document's fields).
{
"type": "labelScorer:tf",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"scope": {
"type": "documents:reference",
"auto": true
}
}
fields
A reference to feature fields used to retrieve label statistics from.
scope
A reference to a set of documents used to compute label statistics from.
label​Scorer:​*
Consumers of
The following stages and components take label​Scorer:​*
as
input:
Stage or component | Property |
---|---|
label​Scorer:​composite | scorers |
labels:​scored | scorer |