dictionary

Ad-hoc dictionary:​* components are used to filter labels. Such dictionaries are typically used for per-query filtering of junk labels or to narrow down the set of labels to a specific subset.

The following dictionary:​* stage types are available for use in analysis request JSONs:

dictionary:​all

This dictionary includes all labels.

dictionary:​glob

Filters labels matching wildcard expressions (example: * eclipse).

dictionary:​project

Uses the referenced dictionary declared in the project descriptor.

dictionary:​query​Terms

Excludes terms extracted from a query (if possible).

dictionary:​regex

Filters labels matching any provided regular expression.


dictionary:​reference

References a dictionary:​* component defined in the request or in the project's default components.


dictionary:​all

Includes entries from all project-level dictionaries defined in the dictionaries section of the project descriptor.

{
  "type": "dictionary:all"
}

dictionary:​glob

A glob dictionary allows filtering labels using word-based wildcard matching.

{
  "type": "dictionary:glob",
  "entries": []
}

The primary use case of the glob matcher is case-insensitive matching of entire phrases, as well as "begins with…", "ends with…" or "contains…" rules. Glob matcher entries are fast to parse and very fast to apply.

In the request below, we request the top aggregated labels form documents matching the electric field query but filter out any label containing electric and an exact label state:

{
  "name": "Labels from documents with glob dictionary filtering",
  "comment": "Retrieves labels that occur most frequently in the provided documents, with filtering applied.",
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "\"electric field\""
      },
      "limit": 500
    },
    "labels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:fixed",
        "value": 10
      },
      "labelAggregator": {
        "type": "labelAggregator:topWeight",
        "labelCollector": {
          "type": "labelCollector:topFromFeatureFields",
          "labelFilter": {
            "type": "labelFilter:dictionary",
            "exclude": [
              {
                "type": "dictionary:glob",
                "entries": [
                  "* electric *",
                  "state"
                ]
              }
            ]
          }
        }
      }
    }
  },
  "output": {
    "stages": [
      "labels"
    ]
  }
}

entries

Type
array of string
Default
[]
Required
no

An array of strings, each representing a single glob matching rule.

See the project descriptor reference for syntax specification and examples of glob matching rules.

dictionary:​project

This dictionary is a reference to a dictionary declared at the project descriptor level.

{
  "type": "dictionary:project",
  "dictionary": null
}

Project dictionaries are compiled once so if their content does not change between requests, it makes sense to move them to the project level and use a reference within the request.

dictionary

Type
string
Default
null
Required
yes

The identifier of the referenced dictionary at the project descriptor level.

dictionary:​query​Terms

Excludes individual terms extracted from a query. For example, a string query cats ​O​R dogs would construct a dictionary filtering the terms cat and dog.

{
  "type": "dictionary:queryTerms",
  "query": null
}

This implementation works on a best-effort basis. It is not always possible to extract query terms from complex Lucene queries (or other query implementations). Also, the shape of extracted queries may depend on the query analyzer pipeline (for example, stemming options).

query

Type
query
Default
null
Required
yes

The query to extract excluded terms from.

dictionary:​regex

This dictionary type excludes any labels that match one or more regular expressions. It offers more expressive syntax, but is expensive to parse and apply.

{
  "type": "dictionary:regex",
  "entries": []
}
Use the glob dictionary whenever possible and practical

Glob dictionaries are fast to parse and very fast to apply. Regular expressions are an order of magnitude slower and have to be applied to all label candidates, which may slow down processing significantly.

Each entry in the regular expression dictionary must be a valid Java Regular Expression pattern. If a label's string (as a whole) matches at least one of the patterns defined in the dictionary, it is marked as a positive match and filtered out.

entries

Type
array of string
Default
[]
Required
no

An array of strings, each containing a regular expression. Note that double quotes and backslashes are special characters and must be escaped appropriately.

See the project descriptor reference for examples of regular expression dictionary entries.

Consumers of dictionary:​*

The following stages and components take dictionary:​* as input:

Stage or component Property
label​Filter:​dictionary
  • exclude