Label dictionary

The label dictionary lets you tune the labels Lingo3G chooses to describe clusters. Using the label dictionary, you can prevent Lingo3G from selecting certain words or phrases, such as domain-common phrases and abusive language. You can also boost other labels, such as brand names.

Lingo3G reads the label dictionary from a JSON file named language.labels.json, where language is the language name. The label dictionary must be an array of objects with the following structure:

[
  // zero or more includes.
  {
    "include": "relative-path-to-include.json",
    "required": true
  },

  // zero or more weighted matcher entries.
  {
    "comment": "Optional comment.",
    "weight": 0,

    // One or more label matchers
    "exact": [],
    "glob": [],
    "regexp": []
  }
]

The structure of a Lingo3G label dictionary.

Each object in the array must be of one of the following types:

include

Includes all entries from another label dictionary into this dictionary. You can use includes to split a large label dictionary into smaller ones or extract a common part of a set of label dictionaries.

{ "include": "name.json", "required": true }
include

(required) Path to the label dictionary to include, relative to the location of this dictionary.

required

If true and the file to include does not exist, Lingo3G will trigger an error.

If false, Lingo3G will ignore a non-existing include file.

weighted rules

Defines one or more label matchers and their associated label weight. If a label matches any rule in any matcher, Lingo3G applies the weight to the label.

{
  "comment": "Comment",
  "weight": 0,

  "exact": [],
  "glob": [],
  "regexp": []
}
weight

The weight to apply to the matching labels.

A weight of 0 excludes the labels from processing – Lingo3G will not use the labels to describe clusters. Weights between 0 and 1 decrease the chances that the corresponding labels appear in clusters. Values larger than 1 increase the likelihood that Lingo3G selects the labels to describe a cluster.

If you don't specify the weight, Lingo3G assumes the weight of 0 and removes all the matching labels from processing.

glob, exact, regexp

Label matchers that determine to which labels to apply the weight. You can use any combination of the available matcher types.

comment

An optional comment for this dictionary section.

If a label matches multiple rules with different weights, Lingo3G applies the lowest weight. For example, if a specific label matches two rule set with weight 0 and 2.5, Lingo3G uses the zero weight and removes the label from processing.

A real-world label dictionary can look like this:

[
  { "include": "common-legal-jargon.json", "required": true },
  {
    "comment": "Remove common expressions.",
    "weight": 0,
    "exact": [
      "main page"
    ],
    "glob": [
      "* home page *",
      "{fnc}", "+ {fnc}", "{fnc} +",
      "* banned *"
    ],
    "regexp": [
      "\\d{1,2}(am|pm)"
    ]
  },
  {
    "comment": "Boost product names.",
    "weight": 2,
    "exact": [
      "Lingo3G", "Lingo4G"
    ],
    "glob": [
      "* text clustering *"
    ]
  }
]

Example label dictionary.

With the above dictionary, Lingo3G removes from processing the following labels: main page, Welcome to the home page, for clustering, 5pm and promotes the following labels in the clusters: Lingo3G, Lingo4G, text clustering tools.

See the label matchers section for more examples of label matching.