Tag dictionaries

This type of dictionary provides the algorithm with information about the functional category of each word in the input. Tags correspond to major part of speech groups, but also to other, more "semantic" categories, such as geographic references or proper names.

Tags can be used in label filtering dictionaries to restrict (or boost) candidate cluster labels of certain desired structure. For example, omit labels that start or end in a preposition ("information about") or to boost labels that contain information-rich words, such as proper nouns.

Built-in dictionaries

Lingo3G comes with built-in exact mappings between words, their lemmas and tags roughly reflecting part of speech categories for several languages. These dictionaries are precompiled into space and lookup-efficient data structures and cannot be modified by users.

It is possible to turn off these built-in dictionaries by setting the useBuiltInWordDatabaseForLabelFiltering parameter to false.

Customizable dictionaries

Custom word dictionaries are JSON files following the naming convention of language.tag-dictionary.json (where language is substituted with each supported language's name).

Even if a built-in word dictionary is available and enabled, the definition found in the user-defined word dictionary completely overrides the information from the built-in PoS database.

An outline structure of a word dictionary is shown below.

[
  // zero or more includes.
  {
    "include": "relative-path-to-include.json",
    "required": true
  },

  // one or more tags, with word entries.
  {
    "comment": "Optional comment.",
    "tag": "[tag]",
    "words": [
      "word1",
      "word2"
    ]
  }
]

A more realistic dictionary may look like this:

[
  { "include": "common-legal-jargon.json", "required": true },
  {
    "tag": "fnc",
    "words": [ "a", "about" ]
  },
  {
    "tag": "fnc,verb",
    "words": [ "have" ]
  },
  {
    "tag": "verb",
    "words": [ "go", "allow" ]
  },
  {
    "tag": "numeric",
    "words": [ "website" ]
  }
]

The dictionary definition is an array of the following types of objects.

resource includes

The dictionary file can be split into smaller files and assembled together with an include:

{ "include": "name.json", "required": true }

The include object must have the include property with the string value of the dictionary-relative resource to include. Each included file must be a valid dictionary itself.

An optional Boolean property required indicates whether the included resource can be silently omitted, if not found.

PoS word group

This element assigns a category code tag to a set of words:

{ "tag": "...", "words": [ "a", "b", ... ] }

The tag property provides word category code for the entire array of words specified in the words property.

The following part of speech codes are defined (any comma or space-delimited combination of these flags is permitted in a single tag):

  • fnc: structural "function" words ("about", "have"),
  • verb: verbs ("have", "allows"),
  • noun: nouns ("website", "test"),
  • adj: adjectives ("cool"),
  • adv: adverbs ("fully"),
  • geo: geographical references ("London"),
  • name: proper nouns ("John"),
  • sep: phrase separator tokens. These tokens are removed early from the processing stream and will not appear in cluster labels (at any position).
Words listed in the words array are matched against the input tokens in a case-insensitive way. They are not processed with other conflation methods (they are not stemmed): it may be necessary to enumerate all possible surface forms of a word (alternate spellings)