Features

Features are small units of text, such as words or phrases, that characterize the content of a document. Lingo4G uses features to perform all analytical processing, including clustering, 2d embedding or finding similar documents.

A feature is a piece of text that represents an entity or a concept present in the document. Each occurrence of a feature consists of two elements:

label

A textual, human-friendly representation of the feature. Typically, the label will be short: a word or a short phrase. Lingo4G uses feature labels as unique identifiers, so features with exactly the same label are considered identical.

occurrence context (surface form)

An occurrence of a feature is a fragment of the source document's text in which the feature is present or to which it is related. This text may be the same as the feature's label, but it may also be a synonym or even completely different content (for example, an acronym 2​H​Cl for the full label histamine dihydrochloride).

Features and labels

In Lingo4G, the terms features and labels are used interchangeably and mean the same thing, even though it should be noted that one feature (label) may be a representation of many different occurrence contexts (surface forms).

Good features should accurately represent the topics and entities present in a document and be specific to that document — discriminate it well from at least the majority of all other documents. So where do we get features from? Certain domains (like chemistry or law) may come with existing term ontologies and vocabularies that should be used as the source of potential feature patterns. In the vast majority of cases, however, features are not known in advance and have to be discovered automatically from the documents present in Lingo4G's index.

Lingo4G comes feature extractors covering both automated feature extraction and existing feature dictionaries. The definition of feature extractors is part of the project descriptor's indexer section and we discuss their details there.