labelFilter

label​Filter:​* accepts or rejects an individual label based on the label's text. You can use label filters to customize the processing inside label collectors as well as label fetching stages, such as labels:​by​Prefix or labels:​embedding​Nearest​Neighbors.

You can use the following label filters in your analysis requests:

label​Filter:​accept​All

Accepts all labels.

label​Filter:​accept​Labels

Accepts labels appearing in the set of labels provided by the referenced labels:​* source.

label​Filter:​auto​Stop​Labels

Rejects any meaningless labels Lingo4G automatically identified during indexing.

label​Filter:​character​Count

Accepts or rejects labels based on the number of characters they have.

label​Filter:​complement

Rejects labels that are accepted by another filter (inverts the result of another label filter).

label​Filter:​composite

Accepts labels if they are accepted by all the label filters you provide.

label​Filter:​dictionary

Accepts labels matched by the label dictionary you provide.

label​Filter:​has​Embedding

Accepts labels that have a multidimensional embedding vector available.

label​Filter:​reject​Labels

Rejects label appearing on the closed list you provide.

label​Filter:​surface

Accepts labels based on their exact appearance (case-sensitive characters). You can configure this filter to remove all-uppercase, capitalized or acronym-like labels.

label​Filter:​token​Count

Accepts labels based on the number of words in the label.


label​Filter:​reference

References a label​Filter:​* component defined in the request or in the project's default components.


label​Filter:​accept​All

Accepts all labels.

{
  "type": "labelFilter:acceptAll"
}

label​Filter:​accept​Labels

Accepts labels appearing in the set of labels provided by the referenced labels:​* source.

{
  "type": "labelFilter:acceptLabels",
  "labels": {
    "type": "labels:reference",
    "auto": true
  }
}

An example request below retrieves a list of top labels that appear in documents matching the query magnetic field, but limited to only those labels that also appear in documents that match the query pulsar. In effect, the result of this request is an intersection of label sets between these two document queries.

{
  "name": "Retrieve document labels from documents matching 'magnetic field' but only appearing in documents matching 'pulsar'.",
  "stages": {
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 10
      },
      "documents": {
        "type": "documents:byQuery",
        "query": {
          "type": "query:string",
          "query": "\"magnetic field\""
        },
        "limit": "unlimited"
      },
      "labelAggregator": {
        "type": "labelAggregator:topWeight",
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields",
          "labelFilter": {
            "type": "labelFilter:acceptLabels",
            "labels": {
              "type": "labels:fromDocuments",
              "documents": {
                "type": "documents:byQuery",
                "query": {
                  "type": "query:string",
                  "query": "pulsar"
                },
                "limit": "unlimited"
              }
            }
          }
        }
      }
    }
  }
}

The result of the above request on the reference Arxiv project is shown below. Compare it to a similar request using the reject​Labels component.

{
  "result" : {
    "labels" : {
      "labels" : [
        {
          "label" : "quantum",
          "weight" : 6804.0
        },
        {
          "label" : "electron",
          "weight" : 4625.0
        },
        {
          "label" : "star",
          "weight" : 4096.0
        },
        {
          "label" : "solar",
          "weight" : 3070.0
        },
        {
          "label" : "plasma",
          "weight" : 2651.0
        },
        {
          "label" : "Hall",
          "weight" : 2350.0
        },
        {
          "label" : "jet",
          "weight" : 2252.0
        },
        {
          "label" : "equation",
          "weight" : 2044.0
        },
        {
          "label" : "magnetization",
          "weight" : 1889.0
        },
        {
          "label" : "superconducting",
          "weight" : 1889.0
        }
      ]
    }
  }
}

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

A reference source of labels to accept.

label​Filter:​auto​Stop​Labels

Rejects any meaningless labels Lingo4G automatically identified during indexing.

{
  "type": "labelFilter:autoStopLabels",
  "minCoverage": 0.4,
  "removalStrength": 0.35
}

min​Coverage

Type
number
Default
0.4
Constraints
value >= 0 and value <= 1
Required
no

Reject labels that appear in the stop labels list and have the coverage lower than min​Coverage.

removal​Strength

Type
number
Default
0.35
Constraints
value >= 0 and value <= 1
Required
no

Reject labels that appear in the stop labels list and have the score lower than removal​Strength.

label​Filter:​character​Count

Accepts or rejects labels based on the number of characters they have.

{
  "type": "labelFilter:characterCount",
  "minCharacters": 4,
  "minCharactersAveragePerToken": 2.9
}

min​Characters

Type
number
Default
4
Constraints
value >= 0
Required
no

Rejects labels that have fewer than min​Characters Java characters (Unicode surrogate pairs count as two characters).

min​Characters​Average​Per​Token

Type
number
Default
2.9
Constraints
value >= 0
Required
no

Rejects labels where: (label characters) / (word count) is smaller than min​Characters​Average​Per​Token. This option can be used to prune automatically discovered labels with very short tokens (for example repeated MathML expressions).

label​Filter:​complement

Rejects labels that are accepted by another filter (inverts the result of another label filter).

{
  "type": "labelFilter:complement",
  "labelFilter": null
}

label​Filter

Type
labelFilter
Default
null
Required
yes

The label filter whose results should be negated.

label​Filter:​composite

Accepts labels if they are accepted by all the label filters you provide.

{
  "type": "labelFilter:composite",
  "labelFilters": {},
  "operator": "AND"
}

The keys of nested composite label filters are used for information purposes only. For example, this request retrieves labels from documents matching the query magnetic field, but limits labels to multi-term phrases avoiding the use of any query terms.

{
  "name": "Retrieve document labels from documents matching 'magnetic field' with composite filtering criteria.",
  "stages": {
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 10
      },
      "documents": {
        "type": "documents:byQuery",
        "query": {
          "type": "query:string",
          "query": "\"magnetic field\""
        },
        "limit": "unlimited"
      },
      "labelAggregator": {
        "type": "labelAggregator:topWeight",
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields",
          "labelFilter": {
            "type": "labelFilter:composite",
            "labelFilters": {
              "phrases-only": {
                "type": "labelFilter:tokenCount",
                "minTokens": 2,
                "maxTokens": 5
              },
              "acronym-filters":  {
                "type": "labelFilter:surface",
                "removeCapitalized": true,
                "removeUppercase": true,
                "removeAcronyms": true
              },
              "avoid-query-terms": {
                "type": "labelFilter:dictionary",
                "exclude": [
                  {
                    "type": "dictionary:glob",
                    "entries": [
                      "* magnetic *",
                      "* field *"
                    ]
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}

The result of the above request on the reference Arxiv project is shown below.

{
  "result" : {
    "labels" : {
      "labels" : [
        {
          "label" : "neutron star",
          "weight" : 1259.0
        },
        {
          "label" : "black hole",
          "weight" : 1045.0
        },
        {
          "label" : "quantum dot",
          "weight" : 934.0
        },
        {
          "label" : "cosmic rays",
          "weight" : 839.0
        },
        {
          "label" : "phase transition",
          "weight" : 709.0
        },
        {
          "label" : "ground state",
          "weight" : 679.0
        },
        {
          "label" : "solar wind",
          "weight" : 582.0
        },
        {
          "label" : "phase diagram",
          "weight" : 532.0
        },
        {
          "label" : "active region",
          "weight" : 489.0
        },
        {
          "label" : "angular momentum",
          "weight" : 420.0
        }
      ]
    }
  }
}

label​Filters

Type
object of labelFilter
Default
{}
Required
no

A set of other named label​Filter:​* components.

operator

Type
string
Default
"AND"
Constraints
one of [OR, AND]
Required
no

Declares the way label filters from filters are combined. The operator property supports the following values:

O​R

Creates a disjunction composite filter. A label is filtered if it occurs in any nested filters.

A​N​D

Creates a conjunction composite filter. A label is filtered if it occurs in all nested filters.

label​Filter:​dictionary

Rejects labels that appear in any of the referenced dictionaries.

{
  "type": "labelFilter:dictionary",
  "addAllProjectDictionaries": true,
  "exclude": []
}

add​All​Project​Dictionaries

Type
boolean
Default
true
Required
no

If true, all project dictionaries are automatically appended to the exclude property.

exclude

Type
array of dictionary
Default
[]
Required
no

An array of one or more dictionary:​* components.

label​Filter:​has​Embedding

Accepts only those labels that have a multidimensional label embedding vector available.

{
  "type": "labelFilter:hasEmbedding"
}

label​Filter:​reject​Labels

Rejects labels appearing in the set of labels provided by the referenced labels:​* source.

{
  "type": "labelFilter:rejectLabels",
  "labels": {
    "type": "labels:reference",
    "auto": true
  }
}

The example request below retrieves a list of top labels that appear in documents matching the query magnetic field, removing any labels that also appear in documents that match the query pulsar. In effect, the result of this request is an intersection of label sets between these two document queries.

{
  "name": "Retrieve document labels from documents matching 'magnetic field' but only appearing in documents matching 'pulsar'.",
  "stages": {
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 10
      },
      "documents": {
        "type": "documents:byQuery",
        "query": {
          "type": "query:string",
          "query": "\"magnetic field\""
        },
        "limit": "unlimited"
      },
      "labelAggregator": {
        "type": "labelAggregator:topWeight",
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields",
          "labelFilter": {
            "type": "labelFilter:rejectLabels",
            "labels": {
              "type": "labels:fromDocuments",
              "documents": {
                "type": "documents:byQuery",
                "query": {
                  "type": "query:string",
                  "query": "pulsar"
                },
                "limit": "unlimited"
              }
            }
          }
        }
      }
    }
  }
}

The result of the above request on the reference Arxiv project is shown below. Compare it to a similar request using the accept​Labels component.

{
  "result" : {
    "labels" : {
      "labels" : [
        {
          "label" : "magnetic",
          "weight" : 22570.0
        },
        {
          "label" : "field",
          "weight" : 21930.0
        },
        {
          "label" : "magnetic field",
          "weight" : 15999.0
        },
        {
          "label" : "model",
          "weight" : 5156.0
        },
        {
          "label" : "spin",
          "weight" : 5057.0
        },
        {
          "label" : "state",
          "weight" : 4698.0
        },
        {
          "label" : "energy",
          "weight" : 3598.0
        },
        {
          "label" : "effects",
          "weight" : 3571.0
        },
        {
          "label" : "phase",
          "weight" : 2832.0
        },
        {
          "label" : "temperature",
          "weight" : 2815.0
        }
      ]
    }
  }
}

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

A reference source of labels to reject.

label​Filter:​surface

Accepts labels based on conditions applying to their exact surface appearance (case-sensitive characters). You can configure this filter to remove all-uppercase, capitalized or acronym-like labels.

{
  "type": "labelFilter:surface",
  "removeAcronyms": false,
  "removeCapitalized": false,
  "removeUppercase": false
}

remove​Acronyms

Type
boolean
Default
false
Required
no

If code, reject labels that appear to be acronyms (more than one letter, capitalized to non-capitalized letter count ratio >= 0.5).

remove​Capitalized

Type
boolean
Default
false
Required
no

If code, reject labels that are capitalized (first letter in uppercase, remaining letters in lowercase).

remove​Uppercase

Type
boolean
Default
false
Required
no

If code, reject labels with all-uppercase letters.

label​Filter:​token​Count

Accepts labels based on the number of tokens (words) in the label.

{
  "type": "labelFilter:tokenCount",
  "maxTokens": 8,
  "minTokens": 3
}

max​Tokens

Type
integer
Default
8
Constraints
value >= 0
Required
no

Reject labels that have more than max​Tokens words.

min​Tokens

Type
integer
Default
3
Constraints
value >= 0
Required
no

Reject labels that have fewer than min​Tokens words.

Consumers of label​Filter:​*

The following stages and components take label​Filter:​* as input:

Stage or component Property
documents:​rwmd
  • label​Filter
  • label​Collector:​top​From​Feature​Fields
  • label​Filter
  • label​Filter:​complement
  • label​Filter
  • label​Filter:​composite
  • label​Filters
  • labels:​by​Prefix
  • label​Filter
  • labels:​embedding​Nearest​Neighbors
  • label​Filter
  • labels:​from​Text
  • label​Filter