Synonym dictionary

The synonym dictionary lets you define words and phrases that represent the same concept, such as photo and picture, so that Lingo3G puts documents discussing those concepts in the same cluster.

Lingo3G reads the synonym dictionary from a JSON file named language.synonyms.json, where language is the language name. The synonym dictionary must be an array of objects with the following structure:

[
  // zero or more includes.
  {
    "include": "relative-path-to-include.json",
    "required": true
  },

  // zero or more synonym sets consisting of label matching rules
  {
    "comment": "Example synonym set for various words describing a 'picture'.",
    "glob": [
      "picture",
      "pictures",
      "pic",
      "pics"
    ]
  }
]

The structure of a Lingo3G synonym dictionary.

The dictionary definition is an array of the following types of objects.

include

Includes all entries from another synonym dictionary into this dictionary. You can use includes to split a large synonym dictionary into smaller ones or extract a common part of a set of synonym dictionaries.

{ "include": "name.json", "required": true }
include

(required) Path to the synonym dictionary to include, relative to the location of this dictionary.

required

If true and the file to include does not exist, Lingo3G will trigger an error.

If false, Lingo3G will ignore a non-existing include file.

synonym set

Represents a single synonym set. An array of glob rules defines the set of synonymous words and phrases.

{
  "comment": "The 'movie' concept",
  "label": "Movie",
  "glob": [
    "film", "motion picture", "movie"
  ]
}
glob

(required) An array of glob rules defining the synonymous labels.

label

(optional) The label Lingo3G should use to describe a cluster formed from any word or phrase in this synonym set.

If you don't provide the preferred label, Lingo3G will describe the cluster with an arbitrary label from the synonym set.

comment

An optional comment for this synonym set.

Heads up, synonym dictionary limitations.
  • You can only use glob matching rules in the synonym dictionary. Lingo3G does not support regexp and exact matchers in synonym definitions.

  • Lingo3G does not apply synonyms when matching label dictionary entries. To match synonyms in the label dictionary, add a matcher entry for each synonym explicitly.

  • Synonym sets are not transitive with respect to matching rule definitions. For example, Lingo3G does not collapse the following two declarations, which share the first rule, into one synonym set:

    [
      {
        "glob": [ "dm", "data mining" ]
      },
      {
        "glob": [ "dm", "drum machine" ]
      }
    ]
    

    Make sure each synonym set covers a distinct subset of potential cluster labels. When a label matches more than one synonym set, Lingo3G assigns the label to an arbitrary synonym set.

A real-world synonym dictionary can look like this:

[
  {
    "label": "New York",
    "glob": [
      "nyc",
      "new york city",
      "big apple"
    ]
  },
  {
    "glob": [
      "movie",
      "film",
      "motion picture"
    ]
  }
]

Example synonym dictionary.

With the above dictionary, Lingo3G puts documents containing NYC, New York City or Big apple in a single cluster labelled New York.