Dictionaries

The dictionaries section describes the static dictionaries you can reference at various stages of Lingo4G processing, for example to exclude labels from analysis.

The dictionaries property of the project descriptor must be an object. Keys of the object are unique dictionary identifiers, values specify the type and contents of the dictionary. Values must be objects of the following types:

glob

A dictionary written using intuitive, word-based syntax, possibly extended with wildcard patterns.

regex

A dictionary written using regular expression patterns.

A typical dictionaries section is the following:

"dictionaries": {
  "common": {
    "type": "dictionary:glob",
    "files": [ "${l4g.project.dir}/resources/stoplabels.utf8.txt" ]
  },

  "domain-specific": {
    "type": "dictionary:glob",
    "entries": [
      "information about *",
      "overview of *"
    ]
  },

  "domain-specific-regex": {
    "type": "dictionary:regex",
    "entries": [
      "\\d+ mg"
    ]
  }
}

glob

This type of dictionary uses word-based matching expressions, with optional wildcards.

{
  "type": "glob",
  "entries": [],
  "files": []
}

Dictionaries of this type can match literal phrases, as well as "begins with…", "ends with…" or "contains…" types of expressions. Glob dictionary entries are very fast to parse and apply.

Syntax and matching rules

Matching rules for the glob dictionary have the following syntax.

  • Each entry must consist of one or more space-separated tokens.

  • A token is a sequence of arbitrary characters, such as words, numbers, identifiers.

  • Matching is case-insensitive by default. Letter case normalization is performed based on the R​O​O​T Java locale, which performs language-neutral case conflation according to Unicode rules.

  • A token put in single or double quotes, for example "​Rating***" is taken literally: matching is case-sensitive, * character inside quoted tokens is allowed and compared literally.

  • To include quote characters in the token, escape them with the \ character, for example: \"information\".

  • The following wildcard-matching tokens are recognized:

    • ? matches exactly one (any) word.

    • * matches zero or more words.

    • + matches one or more words. This token is functionally equivalent to: ? *.

    The * and + wildcards are possessive in the regular expression matching sense: they match the maximum sequence of tokens until the next token in the pattern. These wildcards will be suitable in most label matching scenarios. In rare cases, you may need to use the reluctant wildcards.

  • The following reluctant wildcard-matching tokens are recognized:

    • *? matches zero or more words (reluctant).

    • +? matches one or more words (reluctant). This token is functionally equivalent to: ? *?.

    The reluctant wildcards match the minimal sequence of tokens until the next token in the pattern.

  • The following restrictions apply to wildcard operators:

    • Wildcard characters (*, +) cannot be used to express prefixes or suffixes. For example, programm*, is not supported.

    • Greedy operators are not supported.

Examples

The following table shows a number of matching pattern examples. The Non-matching strings column has an additional explanation why there is no match for a particular rule.

Entry Matching strings Non-matching strings
more information

More information

M​O​R​E ​I​N​F​O​R​M​A​T​I​O​N

more informations'informations' does not match pattern token 'information'.

more information aboutPattern does not contain wildcards, only 2-word strings can match.

some more informationPattern does not contain wildcards, only 2-word strings can match.

more information *

more information

More information about

More information about a

more informations'informations' does not match pattern token 'information'.

more informations about'informations' does not match pattern token 'information'.

some more informationPattern does not have wildcards at the beginning, matching strings must start with 'more information'.

* information *

information

more information

information about

a lot more information on

informations'informations' does not match pattern token 'information'.

more informations about'informations' does not match pattern token 'information'.

some more informations'informations' does not match pattern token 'information'.

+ information

too much information

more information

information+ wildcard requires at least one word before 'information'.

more information about'about' is an extra word not covered by the pattern.

"​Information" *

Information

Information about

Information ​A​B​O​U​T

information"Information" token is case-sensitive, it does not match 'information'.

information about"Information" token is case-sensitive, it does not match 'information'.

Informations about'Informations' does not match pattern token "Information".

data ?

data mining

dataThe ? operator requires a word after "data".

data mining research"research" token does not match the pattern.

"​Programm*"

Programm*

Programmer"Programm*" token is taken literally, it matches only 'Programm*'.

Programming"Programm*" token is taken literally, it matches only 'Programm*'.

\"information\"

"information"

"​I​N​F​Ormation"Escaped quotes are taken literally, so match is case-insensitive

informationEscaped quotes not found in the string being matched.

"informationEscaped quotes not found in the string being matched.

* protein protein *

This pattern will never match any input.

The reason for this is that * makes a possessive match, that is matches the maximum number of words until the next token in the pattern. Therefore, the first occurrence of the protein token in the pattern will correspond to the last occurrence of that word in the input label, leaving no content to match the second occurrence of protein in the pattern. As a result, there is no such sequence that can ever match this a pattern.

To match labels with a doubled occurrence of some word, use the reluctant variant of the wildcard.

*? protein protein *

protein protein

selective protein protein interaction

protein protein protein

proteinTwo occurrences of "protein" on input are required.

selective protein-protein interactionThe "protein-protein" string counts as one token and it therefore does not match the two-token protein protein part of the pattern.

programm* Illegal pattern, combinations of the * wildcard and other characters are not supported.
"information Illegal pattern, unbalanced double quotes.
* Illegal pattern, there must be at least one non-wildcard token.

entries

Type
array of string
Default
[]
Required
no

An array of pattern matching entries, provided directly. See the syntax overview section for syntax rules and examples. Double quotes and backslashes in patterns must be escaped to form a valid JSON.

"dictionaries": {
  "glob-inline": {
    "type": "glob",
    "entries": [
      "information about *",
      "\"Overview\""
    ]
  }
}

files

Type
array of string
Default
[]
Required
no

An array of strings with file names containing matching patterns. These external files must adhere to the following rules:

  • plain-text, UTF-8 encoded,
  • one matching pattern per line,
  • lines starting with # are treated as comments and ignored.

There is no need to escape the double quote and backslash characters in dictionary files. An example glob dictionary file may look like this:

# Common stop labels
information *
overview of *
* awards

# Domain-specific entries
supplementary table *
subject group

A file-based dictionary declaration reading such a file may look like this:

"dictionaries": {
  "glob": {
    "type": "glob",
    "files": [
      "${l4g.project.dir}/resources/stoplabels.regex.txt"
    ]
  }
}

If multiple dictionary files or both files and inline entries are provided, the resulting dictionary will contain the union of patterns from all sources.

regex

A dictionary backed by a set of regular expression patterns. This type of dictionary offers more expressive syntax but is expensive to parse and apply.

{
  "type": "regex",
  "entries": [],
  "files": []
}
Use glob dictionary type whenever possible and practical

Dictionaries of the glob type are fast to parse and very fast to apply. This should be the preferred type of dictionary to use. Regular expressions should be reserved only to entries impossible to express in the glob dictionary syntax.

Each entry in the regular expression dictionary must be a valid Java Pattern class expression. If an entire input value matches at least one of the patterns defined in the dictionary, the value is marked as a positive match.

Examples

The following table shows a few example valid regular expressions. The Non-matching strings column contains a brief reason why the value does not match the expression.

Entry Matching strings Non-matching strings
more information

more information

More informationMatching is case-sensitive by default.

more information aboutThe whole string must match the pattern.

(?i)more information

more information

More ​Information

more information aboutThe whole string must match the pattern.

(?i)more information .*

more information about

more informationA trailing space is required for a match.

(?i)more information\b.*

more information

more information about

some more informationPattern does not allow leading wildcard.

Year\b\d+

Year 2000

YearAt least one trailing digit is required.

.*(low|high|top).*

low coverage

nice yellow dress

top coder

without stopping

Low coverageMatching is case-sensitive.

showstopperPattern matches a substring without word boundaries.

Regular expressions are very powerful, but it is easy to make unintentional mistakes. For instance, the intention of the last example in the table above may have been to match all strings containing the low, high or top words, but the pattern actually matches a much broader set of phrases (showstopper). For more predictable semantics and much faster matching, use the glob dictionary format.

entries

Type
array of string
Default
[]
Required
no

An array of entries with regular expressions. Double quotes and backslash characters that are part of the pattern must be escaped to form a valid JSON file.

"dictionaries": {
  "regex-inline": {
    "type": "regex",
    "entries": [
      "information about .*",
      "\"Overview\"",
      "overview of\\b.*"
    ]
  }
}

files

Type
array of string
Default
[]
Required
no

An array of strings with file names containing regular expressions. These external files must adhere to the following rules:

  • plain-text, UTF-8 encoded,
  • one regular expression per line,
  • lines starting with # are treated as comments and ignored.

There is no need to escape the double quote and backslash characters in dictionary files. An example glob dictionary file may look like this:

# Common stop labels
information about .*
"Overview"
overview of\b.*

A file-based dictionary declaration reading such a file may look like this:

"dictionaries": {
  "regex": {
    "type": "regex",
    "files": [
      "${l4g.project.dir}/resources/stoplabels.regex.txt"
    ]
  }
}

If multiple dictionary files or both files and inline entries are provided, the resulting dictionary will contain the union of patterns from all sources.