Lingo3G parameters

You can tune various aspects of Lingo3G clustering by changing some of the parameters that control the algorithm.

Below is the list of algorithm parameters along with their default values. When pasting the JSON to your REST API requests, choose one of the available configuration variants where noted. Click on the property name to see the available documentation.

{
"dictionaries": {
"synonyms": [],
"tags": []
},
"incremental": {
"unknownWordHandlingStrategy": "ignore-cluster"
},
"license": null,
}

allowOneDocumentClusters

Type
Boolean
Default
false
Path
clusters.allowOneDocumentClusters
Java snippet
algorithmInstance.clusters.allowOneDocumentClusters

When enabled, the algorithm will not prune clusters containing only one document.

Tip: For collections larger than 100 documents, to get one-document clusters, you also need to set wordDfThresholdScalingFactor and phraseDfThresholdScalingFactor to 0.0.

Tip: When one-document clusters are allowed, the number of larger clusters may decrease. To obtain more larger clusters while keeping the one-document ones, increase maxClusteringPassesTop and maxClusteringPassesSub or set them to 0.

Performance impact: medium.

combinedClusterScoreBalance

Type
Double
Default
0.5
Constraints
value >= 0.0 and value <= 1.0
Path
clusters.combinedClusterScoreBalance
Java snippet
algorithmInstance.clusters.combinedClusterScoreBalance

Decides whether document count or cluster label score should have larger impact on the cluster score. Setting this parameter to 0.5 will cause the clustering engine to assign equal weight to document count and cluster label score during cluster score calculation. A value equal to 1.0 will cause the clustering engine to use only document count for cluster scoring. Similarly, with the 0.0 value, only the cluster label score will be used.

Performance impact: none

maxClusterSize

Type
Double
Default
0.4
Constraints
value >= 0.0 and value <= 1.0
Path
clusters.maxClusterSize
Java snippet
algorithmInstance.clusters.maxClusterSize

Determines the maximum allowed size of a cluster in relation to the parent cluster size. For example, a value of 0.4 means that clusters must not contain more than 40% of the parent cluster's documents (of all documents in case of top-level clusters). This parameter is meaningful only if documentCountLabelScorerWeight is greater than 0.

Performance impact: none

minClusterSize

Type
Double
Default
0
Constraints
value >= 0.0 and value <= 1.0
Path
clusters.minClusterSize
Java snippet
algorithmInstance.clusters.minClusterSize

Determines the minimum allowed size of a cluster in relation to the parent cluster size. For example, a value of 0.4 means that clusters must not contain less than 40% of the parent cluster's documents (of all documents in case of top-level clusters). This parameter is meaningful only if documentCountLabelScorerWeight is greater than 0.

Performance impact: none

normalizeScores

Type
Boolean
Default
true
Path
clusters.normalizeScores
Java snippet
algorithmInstance.clusters.normalizeScores

Cluster and label score normalization switch. When switched on, the clustering engine will normalize cluster and label scores so that they fall in the 0.0 to 1.0 range.

Performance impact: none

Results impact: As the value of this parameter does not have any impact on the order and structure of clusters generated by the clustering engine, this switch will be useful only for applications that depend on absolute values of cluster or label scores.

preciseDocumentAssignment

Type
Boolean
Default
false
Path
clusters.preciseDocumentAssignment
Java snippet
algorithmInstance.clusters.preciseDocumentAssignment

When precise document assignment is switched off, clusters with multi word labels will contain all documents that contain the label's word in any order and at any position. When precise document assignment is switched on, only documents containing all cluster label's words close to each other (but still in any order) will be placed in the cluster.

The level of proximity between words enforced by this setting can be configured by the preciseDocumentAssignmentSlopMultiplier and preciseDocumentAssignmentSlopOffset attributes. The window in which all label words must occur in the document is defined as follows: numberOfLabelWords * multiplier + offset. For example, if the label consists of 3 words, multiplier is 2 and offset is 1, all words of the label must appear in the document within a window of 3 * 2 + 1 = 7 consecutive words (possibly separated by non-label words).

Performance impact: medium

preciseDocumentAssignmentSlopMultiplier

Type
Double
Default
1.5
Constraints
value >= 1.0 and value <= 10.0
Path
clusters.preciseDocumentAssignmentSlopMultiplier
Java snippet
algorithmInstance.clusters.preciseDocumentAssignmentSlopMultiplier

Configures the level of proximity of words enforced by the preciseDocumentAssignment setting. Please see the description of the preciseDocumentAssignment attribute for details.

preciseDocumentAssignmentSlopOffset

Type
Integer
Default
0
Constraints
value >= 0 and value <= 10
Path
clusters.preciseDocumentAssignmentSlopOffset
Java snippet
algorithmInstance.clusters.preciseDocumentAssignmentSlopOffset

Configures the level of proximity of words enforced by the preciseDocumentAssignment setting. Please see the description of the preciseDocumentAssignment attribute for details.

labelFilters

Type
com.carrotsearch.lingo3g.parameters.LabelMatcher[]
Default
[]
Path
dictionaries.labelFilters
Java snippet
algorithmInstance.dictionaries.labelFilters

Additional label filtering (scoring) dictionaries.

One or more dictionaries can be supplied. The default implementation in com.carrotsearch.lingo3g.parameters.LabelMatcher supports multiple label matching rules.

REST-style example using the default implementation:

"labelFilters": [{
   "exact": ["Cluster Label 1", "Foo Bar"],
   "glob": [
     "lemon *",
     "? mining"
   ],
   "regexp": [
     "(?).+pattern1.+",
     "(?).+[0-9]{2}.+"
   ]
 }]

synonyms

Type
com.carrotsearch.lingo3g.parameters.SynonymSet[]
Default
[]
Path
dictionaries.synonyms
Java snippet
algorithmInstance.dictionaries.synonyms

Additional synonym dictionaries.

One or more dictionaries can be supplied. Note that, unlike the com.carrotsearch.lingo3g.parameters.LabelMatcher, synonym dictionaries only support glob-type rules.

REST-style example using the default implementation:

"synonyms": [{
       "label": "Citrus",
       "glob": [
         "orange peel",
         "lemon peel"
       ]
     }]
 

tags

Type
com.carrotsearch.lingo3g.parameters.Tag[]
Default
[]
Path
dictionaries.tags
Java snippet
algorithmInstance.dictionaries.tags

A set of tags and words they apply to.

REST-style example using the default implementation:

"dictionaries": {
   "tags": [{
       "tag": "fnc",
       "words": ["peel"]
     }]
 }

accentFolding

Type
Boolean
Default
true
Path
documents.accentFolding
Java snippet
algorithmInstance.documents.accentFolding

Converts accented characters to their basic ASCII counterparts. When accent folding is switched on, all accents (like 'ü', 'ç', 'ó') will be internally replaced with their ASCII counterparts ('u', 'c', 'o'). This can be used to make words like "Bücher" and "Bucher" equivalent.

Performance impact: high

boostFields

Type
String[]
Default
[]
Path
documents.boostFields
Java snippet
algorithmInstance.documents.boostFields

Specifies a list of document field names that are boosted by boostedFieldScorerWeight attribute. Content of fields provided in this attribute can be given more weight during clustering.

nominal

Type
com.carrotsearch.lingo3g.parameters.ClusterScoringFields$NominalClusterScoringField[]
Default
[]
Path
documents.clusterScoringFields.nominal
Java snippet
algorithmInstance.documents.clusterScoringFields.nominal

An array of field names to be used with nominal scoring.

numeric

Type
com.carrotsearch.lingo3g.parameters.ClusterScoringFields$NumericClusterScoringField[]
Default
[]
Path
documents.clusterScoringFields.numeric
Java snippet
algorithmInstance.documents.clusterScoringFields.numeric

An array of field names to be used with numeric scoring.

dashedWordsSynonymMarkerEnabled

Type
Boolean
Default
true
Path
documents.dashedWordsSynonymMarkerEnabled
Java snippet
algorithmInstance.documents.dashedWordsSynonymMarkerEnabled

When switched on, the clustering engine will treat words separated by a space (' '), period ('.'), slash ('/') or a dash ('-') or written together and the corresponding phrases as synonymous, for example: "data-mining", "data.mining", "datamining", "data/mining" and "data mining".

Performance impact: medium

dictionarySynonymMarkerEnabled

Type
Boolean
Default
true
Path
documents.dictionarySynonymMarkerEnabled
Java snippet
algorithmInstance.documents.dictionarySynonymMarkerEnabled

When switched on, the clustering engine will apply synonyms defined in synonym dictionaries.

Performance impact: medium

maxTokensPerDocument

Type
Integer
Default
0
Constraints
value >= 0 and value <= 10000
Path
documents.maxTokensPerDocument
Java snippet
algorithmInstance.documents.maxTokensPerDocument

Maximum tokens per document to read. Determines the maximum number of tokens (words) the clustering engine will read from each input document. When this parameter is set to 0, all tokens will be read.

Performance impact: high

maxWordDf

Type
Double
Default
1
Constraints
value >= 0.0 and value <= 1.0
Path
documents.maxWordDf
Java snippet
algorithmInstance.documents.maxWordDf

Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with larger document frequency will be ignored.

For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.

This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, for example clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

Performance impact: low

phraseDfThresholdScalingFactor

Type
Double
Default
0.2
Constraints
value >= 0.0 and value <= 5.0
Path
documents.phraseDfThresholdScalingFactor
Java snippet
algorithmInstance.documents.phraseDfThresholdScalingFactor

Phrase-level Document Frequency (DF) cut-off scaling factor. This factor is used to compute the minimum document frequency (DF) threshold for phrases (longer than one word), relative to the number of input documents, according to the formula below:

df = floor((documents on input) * threshold / 100)

So for threshold=0.2 the DF cut-off will increase by 0.2 every 100 documents. This means that an input of, for example, 2500 documents, will have minimum phrase df set to floor(2500 * 0.2 / 100) = 5 and any phrases appearing in fewer than 5 documents will be ignored.

Performance impact: very high

Results impact: Setting low values for this parameter will preserve infrequent phrases, which can result in more accurate clustering (especially at subcluster level), at the cost of slower processing. Setting high values of this parameter will increase performance at the cost of lower clustering accuracy.

useBuiltInWordDatabaseForStemming

Type
Boolean
Default
false
Path
documents.useBuiltInWordDatabaseForStemming
Java snippet
algorithmInstance.documents.useBuiltInWordDatabaseForStemming

Use built-in word database for stemming. If enabled, Lingo3G will use the built-in word inflection and part of speech database rather than an algorithmic stemmer.

Stemmers or word inflection databases transform various form of a word to one common root. This is required to make sure that a cluster labeled e.g. Programming contains documents referencing all variants of the word, such as programs, programmer or programmed.

Results impact: Algorithmic stemming tends to be more aggressive compared to stemming based on word inflection dictionaries shipping with Lingo3G. This means that with algorithmic stemming all the following forms: program, programming, programmer and programmable will be treated as the same concept, while with the word database based stemming, they will be treated as separate, different concepts. As a result, with algorithmic stemming, a cluster labeled Program will contain documents referring to all program, programs, programming programmer and programmable, while with the word database based stemming, the cluster will contain only documents referring to program and programs.

Enabling this option is recommended only when it is important do distinguish between slight variations of the same general concept, e.g. programming and program.

It is possible to disable heuristic stemming by setting useHeuristicStemming attribute to false, but still apply the dictionary-based stemming (by enabling this option).

Performance impact: small.

useHeuristicStemming

Type
Boolean
Default
true
Path
documents.useHeuristicStemming
Java snippet
algorithmInstance.documents.useHeuristicStemming

This option enables or disables algorithmic stemming. The useBuiltInWordDatabaseForStemming attribute contains relevant discussion on how stemming affects clustering results.

Performance impact: small.

wordDfThresholdScalingFactor

Type
Double
Default
0.7
Constraints
value >= 0.0 and value <= 5.0
Path
documents.wordDfThresholdScalingFactor
Java snippet
algorithmInstance.documents.wordDfThresholdScalingFactor

Word-level Document Frequency (DF) cut-off scaling factor. This factor is used to compute the minimum document frequency (DF) threshold for words, relative to the number of input documents, according to the formula below:

df = floor((documents on input) * threshold / 100);

So for threshold=1 the DF cut-off will increase by 1.0 every 100 documents. This means that an input of, for example, 350 documents, will have minimum word df set to floor(350 * 1 / 100) = 3 and any words appearing in fewer than 3 documents will be ignored.

Performance impact: very high

Results impact: Setting low values for this parameter will preserve infrequent words, which can result in more accurate clustering (especially at subcluster level), at the cost of slower processing. Setting high values of this parameter will increase performance at the cost of lower clustering accuracy.

clusterCountBase

Type
Integer
Default
7
Constraints
value >= 2 and value <= 100
Path
hierarchy.clusterCountBase
Java snippet
algorithmInstance.hierarchy.clusterCountBase

The number of clusters discovered in each clustering pass. The higher the value of this parameter, the larger the total number of clusters.

Performance impact: medium

documentCoverageTarget

Type
Double
Default
0.95
Constraints
value >= 0.0 and value <= 1.0
Path
hierarchy.documentCoverageTarget
Java snippet
algorithmInstance.hierarchy.documentCoverageTarget

The percentage of input documents to be put in clusters. Determines the percentage of documents the clustering engine should assign to clusters. After each clustering pass, the clustering engine will check if the required document coverage has been achieved. If so, it will not perform further clustering passes. The required document coverage may not always be achieved, especially if the maximum number of clustering passes is set to a low value. To cause the clustering engine to always perform the maximum number of clustering passes, set the value of this parameter to 1.0.

Performance impact: high

maxClusteringPassesSub

Type
Integer
Default
2
Constraints
value >= 0 and value <= 10
Path
hierarchy.maxClusteringPassesSub
Java snippet
algorithmInstance.hierarchy.maxClusteringPassesSub

Maximum number of clustering passes to perform on subclusters. Determines the maximum number of cluster discovery passes the clustering engine should perform to discover subclusters. The first clustering pass discovers large/more general clusters, while further passes find smaller/more specific clusters. Setting the maximum number of passes to 0 will force the algorithm to stop clustering only when no more subclusters can be created or the documentCoverageTarget has been reached for the parent cluster.

Performance impact: high

Results impact: With the lowest value of this parameter, the clustering engine will discover only the largest clusters, while with higher values, smaller and more specific clusters will also be created. Setting this parameter to 0 will cause the clustering algorithm to create the maximum possible number of subclusters for each cluster.

maxClusteringPassesTop

Type
Integer
Default
4
Constraints
value >= 0 and value <= 10
Path
hierarchy.maxClusteringPassesTop
Java snippet
algorithmInstance.hierarchy.maxClusteringPassesTop

Maximum number of clustering passes to perform on top hierarchy level. Determines the maximum number of cluster discovery passes the clustering engine should perform to discover the top-level clusters. The first clustering pass discovers large/more general clusters, while further passes find smaller/more specific clusters. Setting the maximum number of passes to 0 will force the algorithm to stop clustering only when no more clusters can be created or the documentCoverageTarget has been reached.

Performance impact: high

Results impact: With the lowest value of this parameter, the clustering engine will discover only the largest clusters, while with higher values, smaller and more specific clusters will also be created. Setting this parameter to 0 will cause the clustering algorithm to create the maximum possible number of clusters.

maxHierarchyDepth

Type
Integer
Default
2
Constraints
value >= 1 and value <= 5
Path
hierarchy.maxHierarchyDepth
Java snippet
algorithmInstance.hierarchy.maxHierarchyDepth

The maximum number of cluster levels to create. Setting this parameter to 1 will disable hierarchical clustering. In such case it is also recommended to disable hierarchical merging, which will preserve smaller clusters.

Performance impact: high

maxImprovementIterations

Type
Integer
Default
5
Constraints
value >= 0 and value <= 50
Path
hierarchy.maxImprovementIterations
Java snippet
algorithmInstance.hierarchy.maxImprovementIterations

The number of clustering improvement iterations to perform. Determines the maximum number of clustering improvement cycles the clustering engine should perform. During each cycle, it will examine cluster arrangements similar to the current one, and if any of them is better, the current one will be replaced.

Performance impact: very high

minClusterSizeForSubclusters

Type
Integer
Default
10
Constraints
value >= 3 and value <= 50
Path
hierarchy.minClusterSizeForSubclusters
Java snippet
algorithmInstance.hierarchy.minClusterSizeForSubclusters

The minimum number of documents that must be assigned to a cluster before the clustering engine attempts to create subclusters for that cluster.

Performance impact: high

neighborhoodSize

Type
Integer
Default
20
Constraints
value >= 10 and value <= 200
Path
hierarchy.neighborhoodSize
Java snippet
algorithmInstance.hierarchy.neighborhoodSize

Maximum similar cluster arrangements to examine. Determines the maximum number of similar cluster arrangements the clustering engine should examine during each heuristic improvement cycle. This parameter is meaningful only when maxImprovementIterations is greater than 0.

Performance impact: very high

unknownWordHandlingStrategy

Type
com.carrotsearch.lingo3g.parameters.UnknownWordHandlingStrategy
Default
ignore-cluster
Constraints
value in [ignore-cluster, assign]
Path
incremental.unknownWordHandlingStrategy
Java snippet
algorithmInstance.incremental.unknownWordHandlingStrategy

Handling of unknown words in persistent clusters. Defines how Lingo3G should treat unknown words in labels of persistent clusters. A word is unknown when it occurs in the persistent cluster's label but it is not present in any of the documents being clustered.

The two available options are:

  • ignore-cluster: ignore the persistent cluster as a whole. No documents will be assigned to persistent clusters with unknown words in their labels. This option favours assignment precision at the cost of some potentially relevant documents not being assigned to persistent clusters.
  • assign: ignores the missing word. Documents will be assigned to persistent clusters even if some of their label's words do not occur in the input documents. This options favours assignment recall at the cost of some potentially irrelevant documents being assigned to persistent clusters.

Performance impact: none

allowNumbersInLabels

Type
Boolean
Default
true
Path
labels.allowNumbersInLabels
Java snippet
algorithmInstance.labels.allowNumbersInLabels

Allow numbers in labels switch. When switched on, the clustering engine will allow tokens identified as numbers to appear in cluster labels.

Performance impact: low

capitalizeNonFunctionWords

Type
Boolean
Default
true
Path
labels.capitalizeNonFunctionWords
Java snippet
algorithmInstance.labels.capitalizeNonFunctionWords

Capitalize non function words in labels. When switched on, the clustering engine will capitalize all non function words in labels. When switched off, particular words will appear in labels in the case they appeared in the majority of input documents.

Performance impact: low

dashedWordsLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.dashedWordsLabelFilter
Java snippet
algorithmInstance.labels.filtering.dashedWordsLabelFilter

Filters out labels containing words starting or ending in a dash character ('-').

Performance impact: low

dictionaryLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.dictionaryLabelFilter
Java snippet
algorithmInstance.labels.filtering.dictionaryLabelFilter

Removes or boosts labels based on a predefined dictionary of words, phrases and regular expressions. Impact on performance depends on the number of regular expression entries in the label dictionary: the more regular expression entries, the lower the processing speed.

Performance impact: medium to very high

leftCompleteLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.leftCompleteLabelFilter
Java snippet
algorithmInstance.labels.filtering.leftCompleteLabelFilter

Truncated labels filter. Heuristically eliminates truncated cluster labels ("York Restaurants"), replacing them with more complete phrases, "New York Restaurants", based on the context. It is recommended to use this filter in combination with rightCompleteLabelFilter. The strength of truncated label elimination is determined by the labelOverrideThreshold attribute.

Performance impact: medium

minLengthLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.minLengthLabelFilter
Java snippet
algorithmInstance.labels.filtering.minLengthLabelFilter

Filters out labels whose string representation (excluding spaces) is shorter than 3 characters.

Performance impact: low

numberOnlyLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.numberOnlyLabelFilter
Java snippet
algorithmInstance.labels.filtering.numberOnlyLabelFilter

Filters out labels that consist only of numeric tokens.

Performance impact: low

oneLetterWordLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.oneLetterWordLabelFilter
Java snippet
algorithmInstance.labels.filtering.oneLetterWordLabelFilter

Filters out labels containing only one-letter words ("M a f").

Performance impact: low

repeatedWordsLabelFilter

Type
Boolean
Default
false
Path
labels.filtering.repeatedWordsLabelFilter
Java snippet
algorithmInstance.labels.filtering.repeatedWordsLabelFilter

Filters out labels containing repeated words ("New York York").

Performance impact: low

rightCompleteLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.rightCompleteLabelFilter
Java snippet
algorithmInstance.labels.filtering.rightCompleteLabelFilter

Truncated labels filter. Heuristically eliminates truncated cluster labels ("York Restaurants"), replacing them with more complete phrases, "New York Restaurants", based on the context. It is recommended to use this filter in combination with leftCompleteLabelFilter. The strength of truncated label elimination is determined by the labelOverrideThreshold attribute.

Performance impact: medium

trailingGenitiveLabelFilter

Type
Boolean
Default
true
Path
labels.filtering.trailingGenitiveLabelFilter
Java snippet
algorithmInstance.labels.filtering.trailingGenitiveLabelFilter

Filters out phrases ending in a Saxon genitive of an English noun ("Discover World's", "For your computers'").

Performance impact: low

labelOverrideThreshold

Type
Double
Default
0.5
Constraints
value >= 0.2 and value <= 1.0
Path
labels.labelOverrideThreshold
Java snippet
algorithmInstance.labels.labelOverrideThreshold

Determines the strength of the truncated label filters. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.

Performance impact: low

lowercaseFunctionWords

Type
Boolean
Default
true
Path
labels.lowercaseFunctionWords
Java snippet
algorithmInstance.labels.lowercaseFunctionWords

Use lower case for function words in labels. When switched on, the clustering engine will convert all function words in labels into lower case. When switched off, particular function words will appear in labels in the case they appeared in the majority of input documents.

Performance impact: low

maxLabelWords

Type
Integer
Default
8
Constraints
value >= 1 and value <= 8
Path
labels.maxLabelWords
Java snippet
algorithmInstance.labels.maxLabelWords

Determines the maximum label length in words. Labels consisting of more words will not be generated.

Performance impact: none

Results impact: Setting the maximum label length to some lower value (e.g. 2 or 3) may create more general clusters.

This setting can also be useful when the input collection contains duplicate documents. In such cases, Lingo3G may create overlong cluster labels taken directly from the duplicate documents. While the best solution to this problem would be eliminating duplicate documents from input, lowering the maximum label length can serve as a simple workaround.

minLabelWords

Type
Integer
Default
1
Constraints
value >= 1 and value <= 8
Path
labels.minLabelWords
Java snippet
algorithmInstance.labels.minLabelWords

Determines the minimum label length in words. Labels consisting of fewer words will not be generated.

Performance impact: none

Results impact: Setting the minimum label length to some higher value (e.g. 4 or 5) may create more specific clusters.

preferredLabelLength

Type
Double
Default
2.5
Constraints
value >= 0.0 and value <= 8.0
Path
labels.preferredLabelLength
Java snippet
algorithmInstance.labels.preferredLabelLength

Instructs the clustering engine to prefer cluster labels consisting of the specified number of words. The strength of the preference is determined by the preferredLabelLengthDeviation attribute.

Fractional preferred label lengths are also allowed. For example, preferred label length of 2.5 will result in labels of length 2 and 3 being treated equally preferred; a value of 2.2 will prefer two-word labels more than three-word ones.

Performance impact: none

preferredLabelLengthDeviation

Type
Double
Default
2.5
Constraints
value >= 0.0 and value <= 20.0
Path
labels.preferredLabelLengthDeviation
Java snippet
algorithmInstance.labels.preferredLabelLengthDeviation

Allowed deviation from the preferred label length. Determines how far the clustering engine is allowed to deviate from the preferredLabelLength. A value of 0.0 allows no deviation: all labels must have the preferred length. Larger values allow more and more deviation, with the value of 20.0 meaning almost no preference at all.

When the preferred label length deviation is 0.0 and the fractional part of the preferred label length is 0.5, then the only allowed label lengths will be the two integers closest to the preferred label length value. For example, if preferred label length deviation is 0.0 and preferred label length is 2.5, the clustering engine will create only labels consisting of 2 or 3 words. If the fractional part of the preferred label length is other than 0.5, only the closest integer label length will be preferred.

Performance impact: none

putPromotedLabelsAtHierarchyRoot

Type
Boolean
Default
false
Path
labels.putPromotedLabelsAtHierarchyRoot
Java snippet
algorithmInstance.labels.putPromotedLabelsAtHierarchyRoot

Put promoted labels at hierarchy root. When switched on, labels promoted using the label dictionary will be always put at the top level of the cluster hierarchy. When switched off, promoted labels will not be forced to appear at the hierarchy root and will be placed where they naturally belong, e.g. as subclusters of larger clusters.

Results impact: a lot of labels can get promoted as a result of boosting e.g. proper nouns defined in the built-in POS database. With this option enabled, all such labels will be put at the root of cluster hierarchy, which may result in a clearly visible cluster overlap. For example, clusters Bill Clinton, President Bill Clinton and U.S. President Bill Clinton will all show at the root of the cluster tree, while with this option disabled, only the Bill Clinton cluster would be placed at root of the hierarchy.

Performance impact: low

queryHint

Type
String
Default
null
Path
labels.queryHint
Java snippet
algorithmInstance.labels.queryHint

Query that produced the documents. The query terms can be penalized (com.carrotsearch.lingo3g.parameters.Labels#queryWordLabelWeight and this may help the algorithm to create better clusters. Providing the query is optional but desirable.

queryWordLabelWeight

Type
Double
Default
0.5
Constraints
value >= 0.0 and value <= 1.0
Path
labels.queryWordLabelWeight
Java snippet
algorithmInstance.labels.queryWordLabelWeight

Determines the weight of labels containing query words (queryHint}. Lower values mean that phrases containing query words are less likely to appear as cluster labels. In particular, the value of 0.0 will totally eliminate query words from cluster labels. The value of 1.0, on the other hand, will cause the clustering engine to treat equally labels with and without query words.

Performance impact: low

removeRepeatedSynonymsFromLabels

Type
Boolean
Default
true
Path
labels.removeRepeatedSynonymsFromLabels
Java snippet
algorithmInstance.labels.removeRepeatedSynonymsFromLabels

Remove repeated synonyms from labels. When switched on, no synonymous words will appear in a single label. For example, if 'photos' and 'pictures' are declared synonyms, labels such as 'Tiger Photos Pictures" or "Photos and Pictures" will not be generated.

Performance impact: low

singleWordLabelWeight

Type
Double
Default
0.5
Constraints
value >= 0.0 and value <= 1.0
Path
labels.singleWordLabelWeight
Java snippet
algorithmInstance.labels.singleWordLabelWeight

Determines how willing the clustering engine will be to select single words as cluster labels. The higher the value of this parameter, the more clusters described with single-word labels will be produced.

Performance impact: none

useBuiltInWordDatabaseForLabelFiltering

Type
Boolean
Default
true
Path
labels.useBuiltInWordDatabaseForLabelFiltering
Java snippet
algorithmInstance.labels.useBuiltInWordDatabaseForLabelFiltering

Use built-in word database for label filtering. If enabled, Lingo3G will perform label filtering based on the the built-in word databases in addition to word dictionary files.

Results impact: If this option is enabled, Lingo3G should produce better-formed cluster labels. For example, labels being, starting or ending with a verb or adjective should appear less frequently. However, because of the limitations of the current part of speech tagging model (please see below), enabling this option is also likely to prevent certain well-formed cluster labels, e.g. if the built-in word database misinterprets a noun for a verb.

Limitations of the part of speech tagging model. Currently, Lingo3G uses a unigram model for assigning part of speech tags to words. This means that for each word having multiple part of speech tags (such as "program" in English, which, depending on the context, can be both a verb and a noun), one of the available tags needs to be chosen. To do that, Lingo3G employs a heuristic that takes into account the word frequency and the set of part of speech tags the word has. While the heuristic is fairly efficient in a general, some words may be tagged erroneously. To provide a solution for such cases, the built-in part of speech database tags can be overridden in the user-defined XML word dictionary.

Performance impact: small.

license

Type
String
Default
null
Path
license
Java snippet
algorithmInstance.license

An explicit license string. If not provided, default locations are scanned for licenses.

aggressiveCloningControl

Type
Boolean
Default
false
Path
merging.aggressiveCloningControl
Java snippet
algorithmInstance.merging.aggressiveCloningControl

Aggressive cluster cloning control switch. When switched on, the clustering engine will not allow the same label to appear at any level of the hierarchy. This parameter is meaningful only if cloningControl is switched on.

Performance impact: low

cloningControl

Type
Boolean
Default
true
Path
merging.cloningControl
Java snippet
algorithmInstance.merging.cloningControl

Cluster cloning control switch. When switched on, the clustering engine will not allow the same cluster label to appear both at the top- and subcluster-level of the hierarchy.

Performance impact: low

flatMerging

Type
Boolean
Default
true
Path
merging.flatMerging
Java snippet
algorithmInstance.merging.flatMerging

Flat merging switch. When switched on, the clustering engine will perform cluster merging using a strategy specific for flat (non-hierarchical) clusters. With this strategy the clustering engine will merge only clusters of similar size.

Performance impact: low

hierarchicalMerging

Type
Boolean
Default
true
Path
merging.hierarchicalMerging
Java snippet
algorithmInstance.merging.hierarchicalMerging

Hierarchical merging switch. When switched on, the clustering engine will use a cluster merging strategy specially designed for hierarchical clustering, and will be more eager to move clusters from the top level positions to subclusters. If the algorithm is set to perform flat clustering (maxHierarchyDepth = 1), disabling hierarchical merging is recommended to preserve smaller clusters.

Performance impact: low

hierarchicalMergingWithLabels

Type
Boolean
Default
true
Path
merging.hierarchicalMergingWithLabels
Java snippet
algorithmInstance.merging.hierarchicalMergingWithLabels

Label merging switch. When switched on, the clustering engine will take cluster labels into account while hierarchical merging of clusters. This parameter is meaningful only when hierarchicalMerging is switched on.

Performance impact: low

Results impact: With label merging switched on, the clustering engine may move some additional clusters from the top level to subclusters.

mergeThreshold

Type
Double
Default
0.7
Constraints
value >= 0.0 and value <= 1.0
Path
merging.mergeThreshold
Java snippet
algorithmInstance.merging.mergeThreshold

Cluster merge threshold. If the overlap between clusters is larger than the value of this parameter, these clusters will be merged.

Performance impact: none

Results impact: Low values of this parameter will cause the clustering engine to eagerly merge clusters, which will create larger clusters in which some documents may be irrelevant. High values of this parameter will cause it to merge clusters rarely, which will result in large numbers of small clusters with more relevant documents.

preciseHierarchicalMerging

Type
Boolean
Default
false
Path
merging.preciseHierarchicalMerging
Java snippet
algorithmInstance.merging.preciseHierarchicalMerging

Precise hierarchical merging switch. When switched on, the hierarchically merged group will contain only those documents that contain the label of the merged group. Enable this option if you would like to avoid a situation where, due to standard merging, a cluster contains documents in which the cluster's label does not appear.

Performance impact: low

Results impact: With precise hierarchical merging switched on, certain small groups removed from the top level may not re-emerge as children of the large group they were merged into. As a result, some documents of such a group may end up unassigned to any cluster.

boostedFieldScorerWeight

Type
Double
Default
0.6
Constraints
value >= 0.0
Path
scoring.boostedFieldScorerWeight
Java snippet
algorithmInstance.scoring.boostedFieldScorerWeight

Assigns higher scores to labels that contain words appearing in input documents' titles.

Performance impact: low

capitalizedWordLabelScorerWeight

Type
Double
Default
0.1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.capitalizedWordLabelScorerWeight
Java snippet
algorithmInstance.scoring.capitalizedWordLabelScorerWeight

Assigns higher scores to labels that contain capitalized words.

Performance impact: low

clusterSetDocumentOverlapLabelScorerWeight

Type
Double
Default
1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.clusterSetDocumentOverlapLabelScorerWeight
Java snippet
algorithmInstance.scoring.clusterSetDocumentOverlapLabelScorerWeight

Assigns higher scores to labels that contain documents not present in the current cluster set.

Performance impact: low

dictionaryWeightLabelScorerWeight

Type
Double
Default
1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.dictionaryWeightLabelScorerWeight
Java snippet
algorithmInstance.scoring.dictionaryWeightLabelScorerWeight

Boosts label scores by a factor specified in the label dictionary file. If this scorer has weight 0, label boosting will not be applied.

Performance impact: low

documentCountLabelScorerWeight

Type
Double
Default
1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.documentCountLabelScorerWeight
Java snippet
algorithmInstance.scoring.documentCountLabelScorerWeight

Assigns higher scores to clusters whose number of documents in relation to the total number of documents is equal or smaller than specified by the maxClusterSize parameter.

Performance impact: low

grammaticalVariantLabelScorerWeight

Type
Double
Default
1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.grammaticalVariantLabelScorerWeight
Java snippet
algorithmInstance.scoring.grammaticalVariantLabelScorerWeight

Strength of penalization of the less frequent variants of stem-equivalent labels. For example, if the input documents contain phrases "Fuel efficiency" and "Fuel efficient", the less frequent phrase variant will be less likely to appear as a cluster label.

When the value of this attribute is 1.0, the less frequent phrases will be penalized proportionally to the difference between the frequency of that phrase and the most frequent variant. Lower values of this attribute will decrease the penalty, setting the value to 0.0 will cause Lingo3G to treat all grammatical variants equally.

Performance impact: low

queryWordLabelScorerWeight

Type
Double
Default
0.1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.queryWordLabelScorerWeight
Java snippet
algorithmInstance.scoring.queryWordLabelScorerWeight

Penalizes labels that contain query words.

Performance impact: low

tfDfRatioLabelScorerWeight

Type
Double
Default
0.2
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.tfDfRatioLabelScorerWeight
Java snippet
algorithmInstance.scoring.tfDfRatioLabelScorerWeight

Assigns higher score to more general/shorter labels.

Performance impact: low

tfLabelScorerWeight

Type
Double
Default
1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.tfLabelScorerWeight
Java snippet
algorithmInstance.scoring.tfLabelScorerWeight

Assigns higher scores to labels with higher Term Frequency (TF).

Performance impact: low

unindexedWordLabelScorerWeight

Type
Double
Default
0.1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.unindexedWordLabelScorerWeight
Java snippet
algorithmInstance.scoring.unindexedWordLabelScorerWeight

Penalizes labels that contain too many function words.

Performance impact: low

wordCountLabelScorerWeight

Type
Double
Default
1
Constraints
value >= 0.0 and value <= 1.0
Path
scoring.wordCountLabelScorerWeight
Java snippet
algorithmInstance.scoring.wordCountLabelScorerWeight

Assigns higher scores to labels that consist of 2, 3 or 4 words. Longer labels are penalized: the longer the label, the higher the penalty.

Performance impact: low