labels

The labels:​* stages group various ways of producing lists of labels. You can display the labels directly or feed them as input to other stages, such as similarity matrix computation and subsequently clustering and 2d embedding.

You can use the following labels stages in your analysis requests:

labels:​by​Prefix

Returns labels with a string prefix you provide.

labels:​composite

Returns a union or intersection of label lists you provide, aggregating their weights according to the provided criteria.

labels:​direct

Returns a list of labels whose text you provide directly.

labels:​embedding​Nearest​Neighbors

Selects labels that are most similar to the multidimensional vector you provide.

labels:​filtered

Applies the label filters of your choice to the list of labels you provide.

labels:​from​Documents

Collects labels occurring in the documents you provide.

labels:​from​Text

Extracts labels from the raw text you provide.

labels:​scored

Computes new weights for the labels you provide using the label scorer of your choice.


labels:​reference

References the results of another labels:​* stage defined in the request.


labels:​by​Prefix

Returns labels containing at least one term starting with the string prefix you provide.

{
  "type": "labels:byPrefix",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "limit": 30,
  "prefix": ""
}

This stage can be used to return a list of suggestions for labels present in a set of provided fields. Typically, the returned result will contain labels where the first word starts with the provided prefix. It is possible for the suggestion engine to return results where the prefixed word is in the middle of the label.

For example, here is a request fetching the ten labels present in the title field and containing the prefix pha:

{
  "name": "Return most common labels in titles, starting with a prefix 'pha'.",
  "stages": {
    "labels": {
      "type": "labels:byPrefix",
      "prefix": "pha",
      "limit": 10,
      "fields": {
        "type": "featureFields:simple",
        "fields": [
          "title$phrases"
        ]
      }
    }
  }
}

The result of the above request, on the reference Arxiv index:

{
  "result" : {
    "labels" : {
      "labels" : [
        {
          "label" : "phase-field",
          "weight" : 223.0
        },
        {
          "label" : "phantom",
          "weight" : 158.0
        },
        {
          "label" : "phase-coherent",
          "weight" : 27.0
        },
        {
          "label" : "phase-locked",
          "weight" : 24.0
        },
        {
          "label" : "phase-resolved",
          "weight" : 21.0
        },
        {
          "label" : "phase-dependent",
          "weight" : 19.0
        },
        {
          "label" : "phase-matching",
          "weight" : 16.0
        },
        {
          "label" : "pharmaceutical",
          "weight" : 14.0
        },
        {
          "label" : "phase-flip",
          "weight" : 7.0
        },
        {
          "label" : "Pham",
          "weight" : 1.0
        }
      ]
    }
  }
}

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

One or more sources of labels (a featureField:* component).

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

An optional labelFilter:* component used to filter out undesired labels.

limit

Type
integer
Default
30
Constraints
value >= 0
Required
no

The maximum number of labels to return.

prefix

Type
string
Default
<empty string>
Required
no

Case-insensitive prefix of at least one word contained in the label. The suggestion engine will favor labels starting with this prefix but may also return labels where the prefix is in the middle of the label.

labels:​composite

Returns a union or intersection of label lists you provide, aggregating their weights according to the provided criteria.

{
  "type": "labels:composite",
  "operator": "OR",
  "sortOrder": "DESCENDING",
  "sources": [],
  "weightAggregation": "SUM"
}

operator

Type
string
Default
"OR"
Constraints
one of [OR, AND]
Required
no

Declares the way labels from sources are combined. The operator property supports the following values:

O​R

Produces the union of all unique labels from all sources.

A​N​D

Produces the intersection of all labels from all sources. A label must appear in all sources to appear in the output.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Controls the sort order for the output list of labels. Labels are sorted by their weight after aggregation, the sort order can be ascending or descending.

See sort​Order in the documentation of common types for the list of possible values.

sources

Type
array of labels
Default
[]
Required
no

A source list of other labels:​* components.

weight​Aggregation

Type
weightAggregation
Default
"SUM"
Required
no

Controls how label weights are aggregated for labels that exist in more than one source (or more than one time within a single source).

See weight​Aggregation in the documentation of common types for the list of possible values.

labels:​direct

Returns a list of labels whose text you provide directly.

{
  "type": "labels:direct",
  "labels": []
}

labels

Type
array of object
Default
[]
Required
no

An array of labels and their optional weights, for example:

"labels": [
  {
    "label": "foo",
    "weight": 2
  },
  {
    "label": "bar"
  }
]

labels:​embedding​Nearest​Neighbors

Selects labels that are most similar to the multidimensional embedding vector you provide.

{
  "type": "labels:embeddingNearestNeighbors",
  "failIfEmbeddingsNotAvailable": true,
  "labelFilter": {
    "type": "labelFilter:acceptAll"
  },
  "limit": 10,
  "vector": {
    "type": "vector:reference",
    "auto": true
  }
}

This stage requires label embeddings to be present in the index.

This example request searches for the closest embedding-space neighbors of an explicit label (synonymous or related labels):

{
  "stages": {
    "labels": {
      "type": "labels:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:labelEmbedding",
        "labels": {
          "type": "labels:direct",
          "labels": [
            {
              "label": "solar power"
            }
          ]
        }
      }
    }
  }
}

The result of the above request, on the reference Arxiv index:

{
  "result" : {
    "labels" : {
      "labels" : [
        {
          "label" : "solar power",
          "weight" : 1.0
        },
        {
          "label" : "solar panels",
          "weight" : 0.90619296
        },
        {
          "label" : "renewable energy sources",
          "weight" : 0.8924967
        },
        {
          "label" : "power demand",
          "weight" : 0.8894522
        },
        {
          "label" : "offshore",
          "weight" : 0.8869064
        },
        {
          "label" : "electrification",
          "weight" : 0.8851942
        },
        {
          "label" : "electricity generation",
          "weight" : 0.8846577
        },
        {
          "label" : "renewable energy",
          "weight" : 0.8820064
        },
        {
          "label" : "fossil fuels",
          "weight" : 0.88147855
        },
        {
          "label" : "power generation",
          "weight" : 0.88033044
        }
      ]
    }
  }
}

fail​If​Embeddings​Not​Available

Type
boolean
Default
true
Required
no

Determines the behavior of this stage if the index does not contain label embeddings.

If the index does not contain label embeddings and fail​If​Embeddings​Not​Available is:

true
this stage fails and logs an error.
false
this stage returns an empty set of label embeddings.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:acceptAll"
}
Required
no

An optional labelFilter:* component used to filter out undesired labels.

limit

Type
integer
Default
10
Constraints
value >= 0
Required
no

The maximum number of labels to return.

vector

Type
vector
Default
{
  "type": "vector:reference",
  "auto": true
}
Required
no

The source vector for which neighboring labels should be returned.

labels:​filtered

Applies the label filters you provide to the provided list of labels.

{
  "type": "labels:filtered",
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labelListFilter": {
    "type": "labelListFilter:acceptAll"
  },
  "labels": {
    "type": "labels:reference",
    "auto": true
  }
}

In combination with the label​Filter:​accept​Labels and label​Filter:​reject​Labels filters, you can use this stage to compare two lists of labels.

The following request uses two different methods to extract labels from the same set of documents. The request uses the labels:​filtered stage to compare the methods by taking the intersection and the differences between label lists produced by the two methods.

{
  "name": "Comparing label lists",
  "comment": "Compares two methods of collecting labels from documents. Computes the intersection and asymmetric differences between the labels produced by the two methods.",
  "stages": {
    "topFrequencyLabels": {
      "comment": "Collects the top-frequency labels from each document",
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 50
      },
      "labelAggregator": {
        "type": "labelAggregator:topWeight",
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields"
        }
      }
    },
    "embeddingLabels": {
      "comment": "Collects labels whose vectors are most similar to the document vector.",
      "type": "labels:fromDocuments",
      "labelAggregator": {
        "type": "labelAggregator:topWeight",
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields",
          "labelWeighting": "EMBEDDING"
        }
      },
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 50
      }
    },
    "commonLabels": {
      "comment": "Labels returned by both methods.",
      "type": "labels:filtered",
      "labels": {
        "type": "labels:reference",
        "use": "topFrequencyLabels"
      },
      "labelFilter": {
        "type": "labelFilter:acceptLabels",
        "labels": {
          "type": "labels:reference",
          "use": "embeddingLabels"
        }
      }
    },
    "onlyInTopFrequencyLabels": {
      "comment": "Labels returned by the top-frequency method, but not by the embedding method.",
      "type": "labels:filtered",
      "labels": {
        "type": "labels:reference",
        "use": "topFrequencyLabels"
      },
      "labelFilter": {
        "type": "labelFilter:rejectLabels",
        "labels": {
          "type": "labels:reference",
          "use": "embeddingLabels"
        }
      }
    },
    "onlyInEmbeddingLabels": {
      "comment": "Labels returned by the embeddings method, but not by the top-frequency method.",
      "type": "labels:filtered",
      "labels": {
        "type": "labels:reference",
        "use": "embeddingLabels"
      },
      "labelFilter": {
        "type": "labelFilter:rejectLabels",
        "labels": {
          "type": "labels:reference",
          "use": "topFrequencyLabels"
        }
      }
    },
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "clustering"
      }
    }
  },
  "output": {
    "stages": [
      "commonLabels",
      "topFrequencyLabels",
      "embeddingLabels",
      "onlyInTopFrequencyLabels",
      "onlyInEmbeddingLabels"
    ]
  }
}

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

The label filter to apply to the labels.

label​List​Filter

Type
labelListFilter
Default
{
  "type": "labelListFilter:acceptAll"
}
Required
no

The label list filter to apply to the labels.

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The labels to filter.

labels:​from​Documents

Collects and aggregates labels occurring in the documents you provide, using the selected label aggregator and label count limits.

{
  "type": "labels:fromDocuments",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "labelAggregator": {
    "type": "labelAggregator:topWeight",
    "labelCollector": {
      "type": "labelCollector:topFromFeatureFields",
      "failIfEmbeddingsNotAvailable": true,
      "fields": {
        "type": "featureFields:reference",
        "auto": true
      },
      "labelFilter": {
        "type": "labelFilter:reference",
        "auto": true
      },
      "labelListFilter": {
        "type": "labelListFilter:truncatedPhrases"
      },
      "labelWeighting": "EMBEDDING",
      "minWeight": 0,
      "minWeightMass": 1,
      "tieResolution": "AUTO"
    },
    "maxLabelsPerDocument": 10,
    "maxRelativeDf": 1,
    "minAbsoluteDf": 1,
    "minRelativeDf": 0,
    "minWeight": 0,
    "outputWeightFormula": "TF",
    "threads": "auto",
    "tieResolution": "AUTO"
  },
  "maxLabels": {
    "type": "labelCount:fixed",
    "value": 10000
  }
}

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The reference to the source list of documents:​* from which labels should be retrieved. The label aggregator property specifies which fields are used as label sources and how labels from these fields should be aggregated.

label​Aggregator

Type
labelAggregator
Default
{
  "type": "labelAggregator:topWeight",
  "labelCollector": {
    "type": "labelCollector:topFromFeatureFields",
    "labelFilter": {
      "type": "labelFilter:reference",
      "auto": true
    },
    "labelListFilter": {
      "type": "labelListFilter:truncatedPhrases"
    },
    "fields": {
      "type": "featureFields:reference",
      "auto": true
    },
    "minWeight": 0,
    "minWeightMass": 1,
    "tieResolution": "AUTO",
    "labelWeighting": "EMBEDDING",
    "failIfEmbeddingsNotAvailable": true
  },
  "maxLabelsPerDocument": 10,
  "minAbsoluteDf": 1,
  "minRelativeDf": 0,
  "maxRelativeDf": 1,
  "minWeight": 0,
  "tieResolution": "AUTO",
  "outputWeightFormula": "TF",
  "threads": "auto"
}
Required
no

The label aggregator used to aggregate labels from input documents into the final list.

max​Labels

Type
labelCount
Default
{
  "type": "labelCount:fixed",
  "value": 10000
}
Required
no

The maximum number of labels to be returned after aggregation.

labels:​from​Text

Extracts labels from the raw text you provide.

{
  "type": "labels:fromText",
  "analyzer": "english",
  "featureExtractor": "",
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "text": ""
}

This stage can be used to retrieve labels that would be produced by the referenced feature extractor from a snippet of text, if it were indexed as a document. For example:

{
  "stages": {
    "labels": {
      "type": "labels:fromText",
      "analyzer": "english",
      "text": "Canonical quantization of the photon, a free massless vector field."
    }
  }
}

The result of the above request, on the reference Arxiv index:

{
  "result" : {
    "labels" : {
      "labels" : [
        {
          "label" : "canonical quantization",
          "weight" : 1.0
        },
        {
          "label" : "massless",
          "weight" : 1.0
        },
        {
          "label" : "quantization",
          "weight" : 1.0
        },
        {
          "label" : "vector field",
          "weight" : 1.0
        }
      ]
    }
  }
}

analyzer

Type
string
Default
"english"
Required
no

The analyzer pipeline to use when splitting the input text into words.

feature​Extractor

Type
string
Default
<empty string>
Required
no

The feature extractor to use.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

An optional labelFilter:* component used to filter out undesired labels.

text

Type
string
Default
<empty string>
Required
no

The text to extract labels from.

labels:​scored

Computes new weights for the labels you provide using the label scorer of your choice.

{
  "type": "labels:scored",
  "labels": {
    "type": "labels:reference",
    "auto": true
  },
  "scorer": {
    "type": "labelScorer:identity"
  }
}

This stage can be used to recompute the weights of labels retrieved from one source, with statistics coming from another source. In this example request, we compute the document frequency of labels occurring in documents matching the query photon with occurrence statistics from a set of documents matching the query solar power.

{
  "stages": {
    "labels": {
      "type": "labels:scored",
      "labels": {
        "type": "labels:fromDocuments",
        "documents": {
          "type": "documents:byQuery",
          "query": {
            "type": "query:string",
            "query": "photon"
          }
        },
        "maxLabels": {
          "type": "labelCount:fixed",
          "value": 500
        }
      },
      "scorer": {
        "type": "labelScorer:df",
        "scope": {
          "type": "documents:byQuery",
          "query": {
            "type": "query:string",
            "query": "\"solar power\""
          }
        }
      }
    }
  }
}

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The source of labels to recompute weights for.

scorer

Type
labelScorer
Default
{
  "type": "labelScorer:identity"
}
Required
no

A labelScorer:* component used to recompute weights of the source labels.

Consumers of labels:​*

The following stages and components take labels:​* as input:

Stage or component Property
documents:​rwmd
  • labels
  • label​Filter:​accept​Labels
  • labels
  • label​Filter:​reject​Labels
  • labels
  • labels:​composite
  • sources
  • labels:​filtered
  • labels
  • labels:​scored
  • labels
  • matrix:​cooccurrence​Label​Similarity
  • labels
  • matrix:​keyword​Label​Document​Similarity
  • labels
  • query:​for​Labels
  • labels
  • vector:​label​Embedding
  • labels
  • vectors:​precomputed​Label​Embeddings
  • labels