Dictionaries
The dictionaries
section describes the static dictionaries you can reference at various stages of
Lingo4G processing, for example to exclude labels from analysis.
The dictionaries
property of the project descriptor must be an object. Keys of the object are unique
dictionary identifiers, values specify the type and contents of the dictionary. Values must be objects of the
following types:
-
glob
-
A dictionary written using intuitive, word-based syntax, possibly extended with wildcard patterns.
-
regex
-
A dictionary written using regular expression patterns.
A typical dictionaries
section is the following:
"dictionaries": {
"common": {
"type": "dictionary:glob",
"files": [ "${l4g.project.dir}/resources/stoplabels.utf8.txt" ]
},
"domain-specific": {
"type": "dictionary:glob",
"entries": [
"information about *",
"overview of *"
]
},
"domain-specific-regex": {
"type": "dictionary:regex",
"entries": [
"\\d+ mg"
]
}
}
glob
This type of dictionary uses word-based matching expressions, with optional wildcards.
{
"type": "glob",
"entries": [],
"files": []
}
Dictionaries of this type can match literal phrases, as well as "begins with…", "ends with…" or "contains…" types of expressions. Glob dictionary entries are very fast to parse and apply.
Syntax and matching rules
Matching rules for the glob dictionary have the following syntax.
-
Each entry must consist of one or more space-separated tokens.
-
A token is a sequence of arbitrary characters, such as words, numbers, identifiers.
-
Matching is case-insensitive by default. Letter case normalization is performed based on the
ROOT
Java locale, which performs language-neutral case conflation according to Unicode rules. -
A token put in single or double quotes, for example
"Rating***"
is taken literally: matching is case-sensitive,*
character inside quoted tokens is allowed and compared literally. -
To include quote characters in the token, escape them with the
\
character, for example:\"information\"
. -
The following wildcard-matching tokens are recognized:
-
?
matches exactly one (any) word. -
*
matches zero or more words. -
+
matches one or more words. This token is functionally equivalent to:? *
.
The
*
and+
wildcards are possessive in the regular expression matching sense: they match the maximum sequence of tokens until the next token in the pattern. These wildcards will be suitable in most label matching scenarios. In rare cases, you may need to use the reluctant wildcards. -
-
The following reluctant wildcard-matching tokens are recognized:
-
*?
matches zero or more words (reluctant). -
+?
matches one or more words (reluctant). This token is functionally equivalent to:? *?
.
The reluctant wildcards match the minimal sequence of tokens until the next token in the pattern.
-
-
The following restrictions apply to wildcard operators:
-
Wildcard characters (
*
,+
) cannot be used to express prefixes or suffixes. For example,programm*
, is not supported. -
Greedy operators are not supported.
-
Examples
The following table shows a number of matching pattern examples. The Non-matching strings column has an additional explanation why there is no match for a particular rule.
Entry | Matching strings | Non-matching strings |
---|---|---|
more information |
|
|
more information * |
|
|
* information * |
|
|
+ information |
|
|
"Information" * |
|
|
data ? |
|
|
"Programm*" |
|
|
\"information\" |
|
|
* protein protein * |
This pattern will never match any input.
The reason for this is that To match labels with a doubled occurrence of some word, use the reluctant variant of the wildcard. |
|
*? protein protein * |
|
|
programm* |
Illegal pattern, combinations of the * wildcard and other characters are not supported.
|
|
"information |
Illegal pattern, unbalanced double quotes. | |
* |
Illegal pattern, there must be at least one non-wildcard token. |
entries
An array of pattern matching entries, provided directly. See the syntax overview section for syntax rules and examples. Double quotes and backslashes in patterns must be escaped to form a valid JSON.
"dictionaries": {
"glob-inline": {
"type": "glob",
"entries": [
"information about *",
"\"Overview\""
]
}
}
files
An array of strings with file names containing matching patterns. These external files must adhere to the following rules:
- plain-text, UTF-8 encoded,
- one matching pattern per line,
- lines starting with
#
are treated as comments and ignored.
There is no need to escape the double quote and backslash characters in dictionary files. An example glob dictionary file may look like this:
# Common stop labels
information *
overview of *
* awards
# Domain-specific entries
supplementary table *
subject group
A file-based dictionary declaration reading such a file may look like this:
"dictionaries": {
"glob": {
"type": "glob",
"files": [
"${l4g.project.dir}/resources/stoplabels.regex.txt"
]
}
}
If multiple dictionary files or both files and inline entries are provided, the resulting dictionary will contain the union of patterns from all sources.
regex
A dictionary backed by a set of regular expression patterns. This type of dictionary offers more expressive syntax but is expensive to parse and apply.
{
"type": "regex",
"entries": [],
"files": []
}
Dictionaries of the glob type are fast to parse and very fast to apply. This should be the preferred type of dictionary to use. Regular expressions should be reserved only to entries impossible to express in the glob dictionary syntax.
Each entry in the regular expression dictionary must be a valid Java Pattern class expression. If an entire input value matches at least one of the patterns defined in the dictionary, the value is marked as a positive match.
Examples
The following table shows a few example valid regular expressions. The Non-matching strings column contains a brief reason why the value does not match the expression.
Entry | Matching strings | Non-matching strings |
---|---|---|
more information |
|
|
(?i)more information |
|
|
(?i)more information .* |
|
|
(?i)more information\b.* |
|
|
Year\b\d+ |
|
|
.*(low|high|top).* |
|
|
Regular expressions are very powerful, but it is easy to make unintentional mistakes. For instance, the intention of the last example in the table above may have been to match all strings containing the low, high or top words, but the pattern actually matches a much broader set of phrases (showstopper). For more predictable semantics and much faster matching, use the glob dictionary format.
entries
An array of entries with regular expressions. Double quotes and backslash characters that are part of the pattern must be escaped to form a valid JSON file.
"dictionaries": {
"regex-inline": {
"type": "regex",
"entries": [
"information about .*",
"\"Overview\"",
"overview of\\b.*"
]
}
}
files
An array of strings with file names containing regular expressions. These external files must adhere to the following rules:
- plain-text, UTF-8 encoded,
- one regular expression per line,
- lines starting with
#
are treated as comments and ignored.
There is no need to escape the double quote and backslash characters in dictionary files. An example glob dictionary file may look like this:
# Common stop labels
information about .*
"Overview"
overview of\b.*
A file-based dictionary declaration reading such a file may look like this:
"dictionaries": {
"regex": {
"type": "regex",
"files": [
"${l4g.project.dir}/resources/stoplabels.regex.txt"
]
}
}
If multiple dictionary files or both files and inline entries are provided, the resulting dictionary will contain the union of patterns from all sources.