Tag dictionaries

Tag dictionaries categorize individual words based on their grammatical or semantic function, such as preposition, verb or proper name. You can reference the categories in the label dictionary to, for example, remove labels ending in a preposition, such as information about.

Built-in tag dictionaries

Lingo3G comes with built-in tag dictionaries for several languages. These dictionaries are precompiled into space- and lookup-efficient data structures, so you cannot modify them. However, you can use the custom tag dictionary to override the built-in category for specific words.

To turn off the built-in tag dictionaries, set the useBuiltInWordDatabaseForLabelFiltering parameter to false.

Custom tag dictionaries

Lingo3G reads the tag dictionary from a JSON file named language.tags.json, where language is the language name. The tag dictionary must be an array of objects with the following structure:

[
  // zero or more includes.
  {
    "include": "relative-path-to-include.json",
    "required": true
  },

  // one or more tags, with word entries.
  {
    "comment": "Optional comment.",
    "tag": "[tag]",
    "words": [
      "word1",
      "word2"
    ]
  }
]

The structure of a Lingo3G tag dictionary.

Each object in the array must be of one of the following types:

include

Includes all entries from another tag dictionary into this dictionary. You can use includes to split a large tag dictionary into smaller ones or extract a common part of a set of tag dictionaries.

{ "include": "name.json", "required": true }
include

(required) Path to the tag dictionary to include, relative to the location of this dictionary.

required

If true and the file to include does not exist, Lingo3G will trigger an error.

If false, Lingo3G will ignore a non-existing include file.

word category set

Assigns a category tag to a set of words:

{ "tag": "...", "words": [ "a", "b", ... ] }
tag

The category tag or tags to assign for the words listed in the words array.

Use one or more comma-separated categories from the following list:

  • fnc: function word (about, have)
  • verb: verb (have, allows)
  • noun: noun (website, test)
  • adj: adjective (cool)
  • adv: adverb (fully)
  • geo: geographical reference (London)
  • name: proper noun (John)
  • numeric: numeric values (10.4, 2021, 14:30, 3/4)
  • sep: phrase separator, such as e.g. or ie. Lingo3G removes phrase separators from processing and therefore will not allow them to appear in cluster labels at all.
words

The list of words to which to apply the tag.

Lingo3G matches the words against the input text in a case-insensitive way and does not apply any further processing, such as grammatical form normalization. Therefore, include all alternative grammatical and spelling variants of a word if necessary, for example: naïve, naive, naïvely, naively.

A real-world dictionary can look like this:

[
  { "include": "legal-words.json", "required": true },
  {
    "tag": "fnc",
    "words": [ "a", "about" ]
  },
  {
    "tag": "fnc,verb",
    "words": [ "have" ]
  },
  {
    "tag": "verb",
    "words": [ "go", "allow" ]
  },
  {
    "tag": "numeric",
    "words": [ "thousand" ]
  }
]

Tag dictionary example.

Tip: how Lingo3G uses word categories by default.

The default label dictionary uses the part of speech information in the following way:

  • removes labels being, starting or ending in a function word or verb
  • removes labels being or ending in an adjective or adverb
  • removes labels being or beginning in a number
  • slightly boosts labels containing proper nouns or geographic terms

You can customize this default behavior by editing the common-rules.labels.json dictionary.