Tag dictionaries
Tag dictionaries categorize individual words based on their grammatical or semantic function, such as preposition, verb or proper name. You can reference the categories in the label dictionary to, for example, remove labels ending in a preposition, such as information about.
Built-in tag dictionaries
Lingo3G comes with built-in tag dictionaries for several languages. These dictionaries are precompiled into space- and lookup-efficient data structures, so you cannot modify them. However, you can use the custom tag dictionary to override the built-in category for specific words.
To turn off the built-in tag dictionaries, set the
useBuiltInWordDatabaseForLabelFiltering
parameter to false
.
Custom tag dictionaries
Lingo3G
reads the tag
dictionary from a JSON file named
language.tags.json
, where language is the
language name. The tag dictionary must be an array of objects with the
following structure:
[
// zero or more includes.
{
"include": "relative-path-to-include.json",
"required": true
},
// one or more tags, with word entries.
{
"comment": "Optional comment.",
"tag": "[tag]",
"words": [
"word1",
"word2"
]
}
]
The structure of a Lingo3G tag dictionary.
Each object in the array must be of one of the following types:
- include
-
Includes all entries from another tag dictionary into this dictionary. You can use includes to split a large tag dictionary into smaller ones or extract a common part of a set of tag dictionaries.
{ "include": "name.json", "required": true }
- include
-
(required) Path to the tag dictionary to include, relative to the location of this dictionary.
- required
-
If
true
and the file to include does not exist, Lingo3G will trigger an error.If
false
, Lingo3G will ignore a non-existing include file.
- word category set
-
Assigns a category tag to a set of words:
{ "tag": "...", "words": [ "a", "b", ... ] }
- tag
-
The category tag or tags to assign for the words listed in the
words
array.Use one or more comma-separated categories from the following list:
-
fnc
: function word (about, have) verb
: verb (have, allows)noun
: noun (website, test)adj
: adjective (cool)adv
: adverb (fully)-
geo
: geographical reference (London) name
: proper noun (John)-
numeric
: numeric values (10.4, 2021, 14:30, 3/4) -
sep
: phrase separator, such as e.g. or ie. Lingo3G removes phrase separators from processing and therefore will not allow them to appear in cluster labels at all.
-
- words
-
The list of words to which to apply the
tag
.Lingo3G matches the words against the input text in a case-insensitive way and does not apply any further processing, such as grammatical form normalization. Therefore, include all alternative grammatical and spelling variants of a word if necessary, for example: naïve, naive, naïvely, naively.
A real-world dictionary can look like this:
[
{ "include": "legal-words.json", "required": true },
{
"tag": "fnc",
"words": [ "a", "about" ]
},
{
"tag": "fnc,verb",
"words": [ "have" ]
},
{
"tag": "verb",
"words": [ "go", "allow" ]
},
{
"tag": "numeric",
"words": [ "thousand" ]
}
]
Tag dictionary example.
The default label dictionary uses the part of speech information in the following way:
- removes labels being, starting or ending in a function word or verb
- removes labels being or ending in an adjective or adverb
- removes labels being or beginning in a number
- slightly boosts labels containing proper nouns or geographic terms
You can customize this default behavior by editing the
common-rules.labels.json
dictionary.