Label matchers

This chapter describes the common syntax of label matching rules, used to define label filters and synonyms.

Both label filtering and synonym dictionaries use a common syntax of label matchers. An example label filter entry is shown below.

{
  "exact": [
    "Literal case-sensitive match"
  ],
  "glob": [
    "starts with phrase *",
    "* contains words *"
  ],
  "regexp": [
    "(?).+BadLabel.+",
    "(?)^[0-9]\s*.*"
  ]
}

An example entry in a label filter dictionary containing four different matcher types.

A matcher is a predicate stating whether a string literal (word or phrase) should be accepted or not. Several different implementations of the matching strategy are possible. Lingo3G implements matchers described in sections below.

Exact matcher

Exact matchers require exact, case-sensitive equality between the word or phrase and the dictionary entry. Exact matcher entries are fast to parse and very fast to apply during clustering.

{
  "exact": [
    "DevOps",
    "Windows 2000"
  ]
}

An example label filter dictionary with two exact matcher entries.

The above label dictionary definitions will match labels DevOps and Windows 2000, but will not match Devops or Windows 2000 machine.

For case-insensitive matching, use glob matchers (preferably) or case-insensitive regular expression matchers.

Glob matcher

Glob matcher allows simple word-based wildcard matching. The primary use case of the glob matcher is case-insensitive matching of literal phrases, as well as "begins with…", "ends with…" or "contains…" types of expressions. Glob matcher entries are fast to parse and very fast to apply.

{
  "glob": [
    "more information",
    "more information *",
    "* about *",
    "big ?",
    "+ apple"
  ]
}

An example label filter dictionary with glob matcher entries.

Matching rules

  • Each entry must consist of one or more space-separated tokens.

  • A token is a sequence of arbitrary characters, such as words, numbers, identifiers.

  • Matching is case-insensitive by default. Letter case normalization is performed based on the ROOT Java locale, which performs language-neutral case conflation according to Unicode rules.

  • A token put in single or double quotes, for example "Rating***" is taken literally: matching is case-sensitive, * character inside quoted tokens is allowed and compared literally.

  • To include quote characters in the token, escape them with the \ character, for example: \"information\".

  • The following wildcard-matching tokens are recognized:

    • The ? token matches exactly one (any) word (possessive operator).

    • The * token matches zero or more words (possessive operator).

    • The *? token matches zero or more words (reluctant operator).

    • The + token matches one or more words (possessive operator). This operator is functionally equivalent to: ? *.

    • The +? token matches one or more words (reluctant operator). This operator is functionally equivalent to: ? *?.

    A few restrictions apply to wildcard operators.

    • Wildcard characters (*, +) cannot be used to express prefixes or suffixes. For example, programm*, is not supported.

    • *? and +? wildcards are reluctant matchers in regular expression matching sense: that is, they match the minimal sequence of tokens until the next token in the pattern.

    • * and + wildcards are possessive matchers in regular expression matching sense: that is, they match the maximum sequence of tokens until the next token in the pattern.

    • Greedy operators are not supported.

Example entries

The following table shows a number of example glob entries. The "Non-matching strings" column also has an explanation why there is no match.

Entry Matching strings Non-matching strings
more information

More information

MORE INFORMATION

more informations'informations' does not match pattern token 'information'.

more information aboutPattern does not contain wildards, only 2-word strings can match.

some more informationPattern does not contain wildards, only 2-word strings can match.

more information *

more information

More information about

More information about a

more informations'informations' does not match pattern token 'information'.

more informations about'informations' does not match pattern token 'information'.

some more informationPattern does not have wildcards at the beginning, matching strings must start with 'more information'.

* information *

information

more information

information about

a lot more information on

informations'informations' does not match pattern token 'information'.

more informations about'informations' does not match pattern token 'information'.

some more informations'informations' does not match pattern token 'information'.

+ information

too much information

more information

information+ wildcard requires at least one word before 'information'.

more information about'about' is an extra word not covered by the pattern.

"Information" *

Information

Information about

Information ABOUT

information"Information" token is case-sensitive, it does not match 'information'.

information about"Information" token is case-sensitive, it does not match 'information'.

Informations about'Informations' does not match pattern token "Information".

data ?

data mining

data? operator requires a word after "data".

data mining research"research" token does not match the pattern.

"Programm*"

Programm*

Programmer"Programm*" token is taken literally, it matches only 'Programm*'.

Programming"Programm*" token is taken literally, it matches only 'Programm*'.

\"information\"

"information"

"INFOrmation"Escaped quotes are taken literally, so match is case-insensitive

informationEscaped quotes not found in the string being matched.

"informationEscaped quotes not found in the string being matched.

programm* Illegal pattern, combinations of the * wildcard and other characters are not supported.
"information Illegal pattern, unbalanced double quotes.
* Illegal pattern, there must be at least one non-wildcard token.

Regular expression matcher

The regular expression matcher checks words or labels against a list of regular expressions you provide.

{
  "regexp": [
    "Windows 9[58]",
    "(?)^[0-9]\\s*.*"
  ]
}

An example label filter dictionary with two regexp matcher entries.

The regular expressions must follow Java syntax. If any fragment of a word or label matches any regular expression provided in the dictionary, the word or label will be filtered out.

Heads up, performance impact!

Regular expression-based matching is a powerful mechanism, but it can also result in a dramatic decrease of clustering performance. Therefore, it should be used only when a similar effect cannot be achieved by a reasonable number of exact and glob matching entries.