Synonyms

This section describes label synonyms which represent groups of different words or phrases that should map to the same concept (cluster).

Synonyms can provide a hint to Lingo3G that a set of words or phrases represents the same thing and any occurrence of its members should be treated as synonymous during clustering. For example, photo, photograph, pic and picture typically all represent the same real-world concept and should be considered the same thing for clustering.

Synonym dictionaries are specified in JSON files and follow the naming convention of language.synonyms.json (where language is substituted with each supported language's name).

An outline structure of a synonym set dictionary is shown below.

[
  // zero or more includes.
  {
    "include": "relative-path-to-include.json",
    "required": true
  },

  // zero or more synonym sets consisting of label matching rules
  {
    "comment": "Example synonym set for various words describing a 'picture'.",
    "glob": [
      "picture",
      "pictures",
      "pic",
      "pics"
    ]
  }
]

A more realistic file could look something like this:

[
  {
    "glob": [
      "nyc",
      "new york city",
      "big apple"
    ]
  },
  {
    "glob": [
      "picture",
      "pictures",
      "pic",
      "pics"
    ]
  }
]

The dictionary definition is an array of the following types of objects.

resource includes

The dictionary file can be split into smaller files and assembled together with an include:

{ "include": "name.json", "required": true }

The include object must have the include property with the string value of the dictionary-relative resource to include. Each included file must be a valid dictionary itself.

An optional Boolean property required indicates whether the included resource can be silently omitted, if not found.

synonym group

A single synonym is defined by all labels covered by its glob matching rules.

{
  "comment": "Optional comment.",
  "label": "Optional label",
  "glob": [ ... ]
}

Each synonym group can contain an optional label which will be used as the cluster's description if any of the synonym-covered labels is selected to be a cluster.

Important!
  • Synonym entries can use glob matching rules only: regexp and exact rules are not supported within a synonym entry.

  • Glob expressions in synonym dictionaries are preprocessed with the current algorithm settings (stemming, term normalization). This means that synonym rules should list all surface form variants of words if stemming) and term normalization is turned off.

  • Synonym sets are not transitive with respect to matching rule definitions. For example, the following two declarations will not be collapsed into one synonym set (they share the same first rule):

    [
      {
        "glob": [ "dm", "data mining" ]
      },
      {
        "glob": [ "dm", "drum machine" ]
      }
    ]
    

    Each synonym set should cover a distinct subset of potential cluster labels. When a given label matches more than one synonym set, the result is undefined (that label will be assigned to an arbitrary synonym set and thus to any of the weights defined by these synonym sets).