Label filtering

Label filtering lets you shape the labels you retrieve by excluding labels based on various criteria, such as the number of words.

Typical use cases of label filtering include:

  • Excluding the specific labels provided by the user. These may be some meaningless labels Lingo4G was not able to filter out automatically.

  • Excluding labels from the globally-defined exclusions dictionary. These would usually be the meaningless labels specific to the domain of the collection you are processing.

  • Excluding a broader class of labels from the result. For example, when looking for embedding-based synonyms for a word, you may want to exclude any labels containing that word.

Most label retrieval stages offer an option to filter the label lists they produce. If a label retrieval stage supports filtering, it usually exposes the label​Filter property, in which you can provide a label​Filter component.

For example, the following request retrieves labels similar to the word photon but not containing that exact word:

{
  "stages": {
    "similarLabels": {
      "type": "labels:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:labelEmbedding",
        "labels": {
          "type": "labels:direct",
          "labels": [
            {
              "label": "photon"
            }
          ]
        }
      },
      "labelFilter": {
        "type": "labelFilter:dictionary",
        "exclude": [
          {
            "type": "dictionary:regex",
            "entries": [
              ".*photon.*"
            ]
          }
        ]
      }
    }
  }
}

Preserving globally-defined filters

The example project descriptors contain the following default label filter definition (and so should your custom Lingo4G projects!):

"labelFilter": {
  "type": "labelFilter:composite",
  "labelFilters": {
    "auto": {
      "type": "labelFilter:autoStopLabels"
    },
    "project": {
      "type": "labelFilter:dictionary",
      "exclude": [
        {
          "type": "dictionary:all"
        }
      ]
    }
  }
}

They exclude labels from project-wide stop label dictionaries as well as the meaningless labels Lingo4G discovered automatically during indexing.

If you'd like preserve the global label filters when adding custom label filtering, use the label​Filter:​composite filter to combine the global filter with the custom one you want to add:

{
  "type": "labelFilter:composite",
  "labelFilters": {
    "default": {

      "type": "labelFilter:reference",
      "use": "labelFilter"
    },
    "custom": {

      "type": "labelFilter:dictionary",
      "exclude": [
        {
          "type": "dictionary:regex",
          "entries": [
            ".*photon.*"
          ]
        }
      ]
    }
  }
}

To reuse such combined filter, shadow the default label filter component in your request.

Label list filters

You can use the label​Filter:​accept​Labels and label​Filter:​reject​Labels to apply filtering based on a closed explicit list of labels. You can use any labels stage to provide the closed list of labels to filter by.

This kind of filtering is useful, for example, when you would like to limit the analysis to a set of labels appearing in a specific set of documents. The following example returns labels matching the photo prefix, but the search is limited to the labels appearing in the documents defined by the set:math query.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "set:math"
      },
      "limit": 20000
    },
    "documentLabels": {
      "type": "labels:fromDocuments",
      "maxLabels": {
        "type": "labelCount:progressive",
        "min": 10000
      }
    },
    "prefixLabels":{
      "type": "labels:byPrefix",
      "prefix": "elec",
      "limit": 100,
      "labelFilter": {
        "type": "labelFilter:acceptLabels",
        "labels": {
          "type": "labels:reference",
          "use": "documentLabels"
        }
      }
    }
  },
  "output": {
    "stages": [
      "prefixLabels"
    ]
  }
}

The request starts with selecting documents matching the set:math and extracting a sizeable set of labels from those documents. Then, the request searches for labels starting with the elec prefix, but limits the results to the labels appearing in the document set.