Stemming

Stemmers (word conflation algorithms) normalize multiple different surface forms of a word (concept) into a unique common root (lemma). In English, for example, plurals would be conflated into their singular forms (flowersflower).

This step is required to make sure that a cluster labeled Programming contains documents referencing all variants of the phrase, such as programs, programmer or programmed.

Exact stemming (lemmatisation)

Lingo3G comes with built-in exact mappings between surface forms of words and their lemmas for several languages (English, Dutch). These dictionaries are precompiled into space and lookup-efficient data structures and cannot be modified by users.

It is possible to turn off these built-in dictionaries entirely by setting the useBuiltInWordDatabaseForStemming parameter to false.

Heuristic stemming

Heuristic stemming is an algorithmic approach to word conflation. The algorithm follows a predefined set of rules to derive the lemma from an inflected surface form.

The advantage of heuristics is that they apply to any words, even those that are very rare or a result of novel word formation. The downside is that, as with any heuristic, the algorithm can make occasional errors.

Lingo3G uses state of the art stemming implementations from the Apache Lucene project, which in turn are compiled from the Snowball project.

It is possible to turn off heuristic stemming by setting the useHeuristicStemming parameter to false.