Label filtering

Label filtering provides some control over the way Lingo3G chooses labels that describe clusters. You can prevent Lingo3G from selecting certain words or phrases (domain-common phrases, abusive language), or boost and promote other labels (product or brand names).

Label dictionaries are specified in JSON files following the naming convention of language.labels.json (where language is substituted with each supported language's name).

An outline structure of a label filtering dictionary is shown below.

[
  // zero or more includes.
  {
    "include": "relative-path-to-include.json",
    "required": true
  },

  // zero or more weighted matcher entries.
  {
    "comment": "Optional comment.",
    "weight": 0,

    // One or more label matchers
    "exact": [],
    "glob": [],
    "regexp": []
  }
]

A real dictionary could look something like this:

[
  { "include": "common-legal-jargon.json", "required": true },
  {
    "comment": "Remove common junk expressions.",
    "weight": 0,
    "exact": [
      "main page"
    ],
    "glob": [
      "* home page *",
      "{fnc}", "+ {fnc}", "{fnc} +",
      "* banned *"
    ],
    "regexp": [
      "\\d{1,2}(am|pm)"
    ]
  },
  {
    "comment": "Boost these",
    "weight": 2,
    "exact": [
      "E.T."
    ],
    "glob": [
      "* clustering engine *",
      "orange +"
    ],
    "regexp": []
  }
]

The dictionary definition is an array of the following types of objects.

resource includes

The dictionary file can be split into smaller files and assembled together with an include:

{ "include": "name.json", "required": true }

The include object must have the include property with the string value of the dictionary-relative resource to include. Each included file must be a valid dictionary itself.

An optional Boolean property required indicates whether the included resource can be silently omitted, if not found.

weighted rules

Defines one or more label matchers and their associated numeric weight.

The weight attribute determines whether the rule penalizes or boosts matching clustering labels. The (default) weight value of zero excludes any label matching this entry. Weights between 0 and 1 decrease the chances that the corresponding label (or labels) will appear in clusters. Values higher than 1 boost the likelihood that the entry's matching labels will be selected for cluster labels.

Label filtering dictionaries support all of the matchers described in the matching rules chapter.

If a candidate cluster label matches more than one entry, Lingo3G will apply pruning rules first (rules with the weight of 0), then any other rules in descending order of their weights.