labelFilter
label​Filter:​*
accepts or rejects an individual label based on the label's text. You can use label
filters to customize the processing inside label collectors as
well as label fetching stages, such as
labels:​by​Prefix
or
labels:​embedding​Nearest​Neighbors
.
You can use the following label filters in your analysis requests:
-
label​Filter:​accept​All
-
Accepts all labels.
-
label​Filter:​accept​Labels
-
Accepts labels appearing in the set of labels provided by the referenced
labels:​*
source. -
label​Filter:​auto​Stop​Labels
-
Rejects any meaningless labels Lingo4G automatically identified during indexing.
-
label​Filter:​character​Count
-
Accepts or rejects labels based on the number of characters they have.
-
label​Filter:​complement
-
Rejects labels that are accepted by another filter (inverts the result of another label filter).
-
label​Filter:​composite
-
Accepts labels if they are accepted by all or any of the label filters you provide.
-
label​Filter:​dictionary
-
Accepts labels matched by the label dictionary you provide.
-
label​Filter:​has​Embedding
-
Accepts labels that have a multidimensional embedding vector available.
-
label​Filter:​reject​Labels
-
Rejects label appearing on the closed list you provide.
-
label​Filter:​surface
-
Accepts labels based on their exact appearance (case-sensitive characters). You can configure this filter to remove all-uppercase, capitalized or acronym-like labels.
-
label​Filter:​switch
-
Enables or disables the label filter you provide.
-
label​Filter:​token​Count
-
Accepts labels based on the number of words in the label.
label​Filter:​reference
-
References a
label​Filter:​*
component defined in the request or in the project's default components.
label​Filter:​accept​All
Accepts all labels.
{
"type": "labelFilter:acceptAll"
}
label​Filter:​accept​Labels
Accepts labels appearing in the set of labels provided by the referenced
labels:​*
source.
{
"type": "labelFilter:acceptLabels",
"labels": {
"type": "labels:reference",
"auto": true
}
}
An example request below retrieves a list of top labels that appear in documents matching the query magnetic field, but limited to only those labels that also appear in documents that match the query pulsar. In effect, the result of this request is an intersection of label sets between these two document queries.
{
"name": "Retrieve document labels from documents matching 'magnetic field' but only appearing in documents matching 'pulsar'.",
"stages": {
"labels": {
"type": "labels:fromDocuments",
"maxLabels": {
"type": "labelCount:fixed",
"value": 10
},
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "\"magnetic field\""
},
"limit": "unlimited"
},
"labelAggregator": {
"type": "labelAggregator:topWeight",
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"labelFilter": {
"type": "labelFilter:acceptLabels",
"labels": {
"type": "labels:fromDocuments",
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "pulsar"
},
"limit": "unlimited"
}
}
}
}
}
}
}
}
The result of the above request on the reference Arxiv project is shown below. Compare it to a similar request
using the
reject​Labels
component.
{
"result" : {
"labels" : {
"labels" : [
{
"label" : "quantum",
"weight" : 6804.0
},
{
"label" : "electron",
"weight" : 4625.0
},
{
"label" : "star",
"weight" : 4096.0
},
{
"label" : "solar",
"weight" : 3070.0
},
{
"label" : "plasma",
"weight" : 2651.0
},
{
"label" : "Hall",
"weight" : 2350.0
},
{
"label" : "jet",
"weight" : 2252.0
},
{
"label" : "equation",
"weight" : 2044.0
},
{
"label" : "magnetization",
"weight" : 1889.0
},
{
"label" : "superconducting",
"weight" : 1889.0
}
]
}
}
}
labels
A reference source of labels to accept.
label​Filter:​auto​Stop​Labels
Rejects any meaningless labels Lingo4G automatically identified during indexing.
{
"type": "labelFilter:autoStopLabels",
"minCoverage": 0.4,
"removalStrength": 0.35
}
min​Coverage
Reject labels that appear in the
stop labels list and have the coverage lower
than min​Coverage
.
removal​Strength
Reject labels that appear in the
stop labels list and have the score lower
than removal​Strength
.
label​Filter:​character​Count
Accepts or rejects labels based on the number of characters they have.
{
"type": "labelFilter:characterCount",
"minCharacters": 4,
"minCharactersAveragePerToken": 2.9
}
min​Characters
Rejects labels that have fewer than min​Characters
Java characters (Unicode surrogate pairs count as
two characters).
min​Characters​Average​Per​Token
Rejects labels where: (label characters) / (word count) is smaller than
min​Characters​Average​Per​Token
. This option can be used to prune automatically discovered labels with
very short tokens (for example repeated MathML expressions).
label​Filter:​complement
Rejects labels that are accepted by another filter (inverts the result of another label filter).
{
"type": "labelFilter:complement",
"labelFilter": null
}
label​Filter
The label filter whose results should be negated.
label​Filter:​composite
Accepts labels if they are accepted by all or any of the label filters you provide.
{
"type": "labelFilter:composite",
"labelFilters": {},
"operator": "AND"
}
The keys of nested composite label filters are used for information purposes only. For example, this request retrieves labels from documents matching the query magnetic field, but limits labels to multi-term phrases avoiding the use of any query terms.
{
"name": "Retrieve document labels from documents matching 'magnetic field' with composite filtering criteria.",
"stages": {
"labels": {
"type": "labels:fromDocuments",
"maxLabels": {
"type": "labelCount:fixed",
"value": 10
},
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "\"magnetic field\""
},
"limit": "unlimited"
},
"labelAggregator": {
"type": "labelAggregator:topWeight",
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"labelFilter": {
"type": "labelFilter:composite",
"labelFilters": {
"phrases-only": {
"type": "labelFilter:tokenCount",
"minTokens": 2,
"maxTokens": 5
},
"acronym-filters": {
"type": "labelFilter:surface",
"removeCapitalized": true,
"removeUppercase": true,
"removeAcronyms": true
},
"avoid-query-terms": {
"type": "labelFilter:dictionary",
"exclude": [
{
"type": "dictionary:glob",
"entries": [
"* magnetic *",
"* field *"
]
}
]
}
}
}
}
}
}
}
}
The result of the above request on the reference Arxiv project is shown below.
{
"result" : {
"labels" : {
"labels" : [
{
"label" : "neutron star",
"weight" : 1259.0
},
{
"label" : "black hole",
"weight" : 1045.0
},
{
"label" : "quantum dot",
"weight" : 934.0
},
{
"label" : "cosmic rays",
"weight" : 839.0
},
{
"label" : "phase transition",
"weight" : 709.0
},
{
"label" : "ground state",
"weight" : 679.0
},
{
"label" : "solar wind",
"weight" : 582.0
},
{
"label" : "phase diagram",
"weight" : 532.0
},
{
"label" : "active region",
"weight" : 489.0
},
{
"label" : "angular momentum",
"weight" : 420.0
}
]
}
}
}
label​Filters
A set of other named label​Filter:​*
components.
operator
Declares the way label filters from filters
are combined. The operator
property
supports the following values:
O​R
-
Creates a disjunction composite filter. A label is filtered if it occurs in any nested filters.
A​N​D
-
Creates a conjunction composite filter. A label is filtered if it occurs in all nested filters.
label​Filter:​dictionary
Rejects labels that appear in any of the referenced dictionaries.
{
"type": "labelFilter:dictionary",
"exclude": []
}
exclude
An array of one or more
dictionary:​*
components.
label​Filter:​has​Embedding
Accepts only those labels that have a multidimensional label embedding vector available.
{
"type": "labelFilter:hasEmbedding"
}
label​Filter:​reject​Labels
Rejects labels appearing in the set of labels provided by the referenced
labels:​*
source.
{
"type": "labelFilter:rejectLabels",
"labels": {
"type": "labels:reference",
"auto": true
}
}
The example request below retrieves a list of top labels that appear in documents matching the query magnetic field, removing any labels that also appear in documents that match the query pulsar. In effect, the result of this request is an intersection of label sets between these two document queries.
{
"name": "Retrieve document labels from documents matching 'magnetic field' but only appearing in documents matching 'pulsar'.",
"stages": {
"labels": {
"type": "labels:fromDocuments",
"maxLabels": {
"type": "labelCount:fixed",
"value": 10
},
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "\"magnetic field\""
},
"limit": "unlimited"
},
"labelAggregator": {
"type": "labelAggregator:topWeight",
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"labelFilter": {
"type": "labelFilter:rejectLabels",
"labels": {
"type": "labels:fromDocuments",
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "pulsar"
},
"limit": "unlimited"
}
}
}
}
}
}
}
}
The result of the above request on the reference Arxiv project is shown below. Compare it to a similar request
using the
accept​Labels
component.
{
"result" : {
"labels" : {
"labels" : [
{
"label" : "magnetic",
"weight" : 22570.0
},
{
"label" : "field",
"weight" : 21930.0
},
{
"label" : "magnetic field",
"weight" : 15999.0
},
{
"label" : "model",
"weight" : 5156.0
},
{
"label" : "spin",
"weight" : 5057.0
},
{
"label" : "state",
"weight" : 4698.0
},
{
"label" : "energy",
"weight" : 3598.0
},
{
"label" : "effects",
"weight" : 3571.0
},
{
"label" : "phase",
"weight" : 2832.0
},
{
"label" : "temperature",
"weight" : 2815.0
}
]
}
}
}
labels
A reference source of labels to reject.
label​Filter:​surface
Accepts labels based on conditions applying to their exact surface appearance (case-sensitive characters). You can configure this filter to remove all-uppercase, capitalized or acronym-like labels.
{
"type": "labelFilter:surface",
"removeAcronyms": false,
"removeCapitalized": false,
"removeUppercase": false
}
remove​Acronyms
If code
, reject labels that appear to be acronyms (more than one letter, capitalized to
non-capitalized letter count ratio >= 0.5).
remove​Capitalized
If code
, reject labels that are capitalized (first letter in uppercase, remaining letters in
lowercase).
remove​Uppercase
If code
, reject labels with all-uppercase letters.
label​Filter:​switch
Enables or disables the label filter you provide.
{
"type": "labelFilter:switch",
"enabled": true,
"labelFilter": null
}
You can use this filter to dynamically activate or deactivate any label filter based on the value of a boolean
variable passed to the
enabled
property. Without the
label​Filter:​switch
component, the only way to deactivate a filter would be to remove the filter from
the request.
The following request shows how to control any label filter using a variable value. This requires three elements:
-
Defining a boolean variable, called
enable​Label​Filtering
in our request. -
Wrapping the filter to control with a
label​Filter:​switch
filter. -
Referencing the boolean variable in the
enabled
property of thelabel​Filter:​switch
filter.
{
"name": "Enabling / disabling a label filter using a variable.",
"variables": {
"enableLabelFiltering": {
"name": "Enable label filtering",
"value": false
}
},
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "clustering"
},
"limit": 1000
},
"labels": {
"type": "labels:fromDocuments",
"labelAggregator": {
"type": "labelAggregator:topWeight",
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"labelFilter": {
"type": "labelFilter:switch",
"enabled": {
"@var": "enableLabelFiltering"
},
"labelFilter": {
"type": "labelFilter:tokenCount",
"minTokens": 2,
"maxTokens": 5
}
}
}
}
}
}
}
With the label​Filter:​switch
you can dynamically control label filters without changing the structure
of the request. Additionally, boolean variables get represented as check boxes in the
request variable editor.
enabled
Enables or disables the label filter you provide.
If true
, applies the
label​Filter
to all input labels. If false
, does not apply any filtering and accepts all input labels.
label​Filter
The label filter to control.
label​Filter:​token​Count
Accepts labels based on the number of tokens (words) in the label.
{
"type": "labelFilter:tokenCount",
"maxTokens": 8,
"minTokens": 3
}
max​Tokens
Reject labels that have more than max​Tokens
words.
min​Tokens
Reject labels that have fewer than min​Tokens
words.
label​Filter:​*
Consumers of
The following stages and components take label​Filter:​*
as
input: