Lingo3G parameters
You can tune various aspects of Lingo3G clustering by changing some of the parameters that control the algorithm.
The following list presents the available parameters along with their default values. Click the property name to see the available documentation.
{}
clusters
- Type
- com.carrotsearch.lingo3g.parameters.Clusters
- Default
- l3g::Clusters
- Path
- clusters
- Java snippet
- algorithmInstance.clusters
Cluster and document assignment control parameters.
{}
allowOneDocumentClusters
- Type
- Boolean
- Default
- false
- Path
- clusters.allowOneDocumentClusters
- Java snippet
- algorithmInstance.clusters.allowOneDocumentClusters
When enabled, the algorithm will not prune clusters containing only one document.
Tip: For collections larger than 100 documents, to get one-document
clusters, you also need to set wordDfThresholdScalingFactor
and phraseDfThresholdScalingFactor
to 0.0.
Tip: When one-document clusters are allowed, the number of larger clusters
may decrease. To obtain more larger clusters while keeping the one-document ones, increase
maxClusteringPassesTop
and maxClusteringPassesSub
or set them to 0.
Performance impact: medium.
combinedClusterScoreBalance
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- clusters.combinedClusterScoreBalance
- Java snippet
- algorithmInstance.clusters.combinedClusterScoreBalance
Decides whether document count or cluster label score should have larger impact on the cluster score. Setting this parameter to 0.5 will cause the clustering engine to assign equal weight to document count and cluster label score during cluster score calculation. A value equal to 1.0 will cause the clustering engine to use only document count for cluster scoring. Similarly, with the 0.0 value, only the cluster label score will be used.
Performance impact: none
maxClusterSize
- Type
- Double
- Default
- 0.4
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- clusters.maxClusterSize
- Java snippet
- algorithmInstance.clusters.maxClusterSize
Determines the maximum allowed size of a cluster in relation to the parent cluster size. For
example, a value of 0.4 means that clusters must not contain more than 40% of the parent
cluster's documents (of all documents in case of top-level clusters). This parameter is
meaningful only if documentCountLabelScorerWeight
is greater than 0.
Performance impact: none
minClusterSize
- Type
- Double
- Default
- 0
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- clusters.minClusterSize
- Java snippet
- algorithmInstance.clusters.minClusterSize
Determines the minimum allowed size of a cluster in relation to the parent cluster size. For
example, a value of 0.4 means that clusters must not contain less than 40% of the parent
cluster's documents (of all documents in case of top-level clusters). This parameter is
meaningful only if documentCountLabelScorerWeight
is greater than 0.
Performance impact: none
normalizeScores
- Type
- Boolean
- Default
- true
- Path
- clusters.normalizeScores
- Java snippet
- algorithmInstance.clusters.normalizeScores
Cluster and label score normalization switch. When switched on, the clustering engine will normalize cluster and label scores so that they fall in the 0.0 to 1.0 range.
Performance impact: none
Results impact: As the value of this parameter does not have any impact on the order and structure of clusters generated by the clustering engine, this switch will be useful only for applications that depend on absolute values of cluster or label scores.
preciseDocumentAssignment
- Type
- Boolean
- Default
- false
- Path
- clusters.preciseDocumentAssignment
- Java snippet
- algorithmInstance.clusters.preciseDocumentAssignment
When precise document assignment is switched off, clusters with multi word labels will contain all documents that contain the label's word in any order and at any position. When precise document assignment is switched on, only documents containing all cluster label's words close to each other (but still in any order) will be placed in the cluster.
The level of proximity between words enforced by this setting can be configured by the
preciseDocumentAssignmentSlopMultiplier
and
preciseDocumentAssignmentSlopOffset
attributes.
The window in which all label words must occur in the document is defined as follows:
numberOfLabelWords * multiplier + offset
. For example, if the label consists of 3 words,
multiplier is 2 and offset is 1, all words of the label must appear in the document within a
window of 3 * 2 + 1 = 7
consecutive words (possibly separated by non-label words).
Performance impact: medium
preciseDocumentAssignmentSlopMultiplier
- Type
- Double
- Default
- 1.5
- Constraints
- value >= 1.0 and value <= 10.0
- Path
- clusters.preciseDocumentAssignmentSlopMultiplier
- Java snippet
- algorithmInstance.clusters.preciseDocumentAssignmentSlopMultiplier
Configures the level of proximity of words enforced by the preciseDocumentAssignment
setting. Please see the description of the preciseDocumentAssignment
attribute for details.
preciseDocumentAssignmentSlopOffset
- Type
- Integer
- Default
- 0
- Constraints
- value >= 0 and value <= 10
- Path
- clusters.preciseDocumentAssignmentSlopOffset
- Java snippet
- algorithmInstance.clusters.preciseDocumentAssignmentSlopOffset
Configures the level of proximity of words enforced by the preciseDocumentAssignment
setting. Please see the description of the preciseDocumentAssignment
attribute for details.
dictionaries
- Type
- com.carrotsearch.lingo3g.parameters.EphemeralDictionaries
- Default
- l3g::EphemeralDictionaries
- Path
- dictionaries
- Java snippet
- algorithmInstance.dictionaries
Per-request overrides of language components (dictionaries).
labels
- Type
- com.carrotsearch.lingo3g.parameters.LabelMatcher[]
- Default
- []
- Path
- dictionaries.labels
- Java snippet
- algorithmInstance.dictionaries.labels
Ephemeral (per-request) label filtering dictionaries.
One or more dictionaries can be supplied. The default implementation in com.carrotsearch.lingo3g.parameters.LabelMatcher
supports multiple label matching rules.
REST-style example using the default implementation:
"labels": [{
"exact": ["Cluster Label 1", "Foo Bar"],
"glob": [
"lemon *",
"? mining"
],
"regexp": [
"(?).+pattern1.+",
"(?).+[0-9]{2}.+"
]
}]
synonyms
- Type
- com.carrotsearch.lingo3g.parameters.SynonymSet[]
- Default
- []
- Path
- dictionaries.synonyms
- Java snippet
- algorithmInstance.dictionaries.synonyms
Ephemeral synonym dictionaries.
One or more dictionaries can be supplied. Note that, unlike the com.carrotsearch.lingo3g.parameters.LabelMatcher
,
synonym dictionaries only support glob
-type rules.
REST-style example using the default implementation:
"synonyms": [{
"label": "Citrus",
"glob": [
"orange peel",
"lemon peel"
]
}]
documents
- Type
- com.carrotsearch.lingo3g.parameters.Documents
- Default
- l3g::Documents
- Path
- documents
- Java snippet
- algorithmInstance.documents
Document fields, text preprocessing and segmentation control parameters.
{"wordDfThresholdScalingFactor": 0.7}
accentFolding
- Type
- Boolean
- Default
- true
- Path
- documents.accentFolding
- Java snippet
- algorithmInstance.documents.accentFolding
Converts accented characters to their basic ASCII counterparts. When accent folding is switched on, all accents (like 'ü', 'ç', 'ó') will be internally replaced with their ASCII counterparts ('u', 'c', 'o'). This can be used to make words like "Bücher" and "Bucher" equivalent.
Performance impact: high
boostFields
- Type
- String[]
- Default
- []
- Path
- documents.boostFields
- Java snippet
- algorithmInstance.documents.boostFields
Specifies a list of document field names that are boosted by boostedFieldScorerWeight
attribute. Content of fields
provided in this attribute can be given more weight during clustering.
dashedWordsSynonymMarkerEnabled
- Type
- Boolean
- Default
- true
- Path
- documents.dashedWordsSynonymMarkerEnabled
- Java snippet
- algorithmInstance.documents.dashedWordsSynonymMarkerEnabled
When switched on, the clustering engine will treat words separated by a space (' '), period ('.'), slash ('/') or a dash ('-') or written together and the corresponding phrases as synonymous, for example: "data-mining", "data.mining", "datamining", "data/mining" and "data mining".
Performance impact: medium
dictionarySynonymMarkerEnabled
- Type
- Boolean
- Default
- true
- Path
- documents.dictionarySynonymMarkerEnabled
- Java snippet
- algorithmInstance.documents.dictionarySynonymMarkerEnabled
When switched on, the clustering engine will apply synonyms defined in synonym dictionaries.
Performance impact: medium
maxTokensPerDocument
- Type
- Integer
- Default
- 0
- Constraints
- value >= 0
- Path
- documents.maxTokensPerDocument
- Java snippet
- algorithmInstance.documents.maxTokensPerDocument
Maximum tokens per document to read. Determines the maximum number of tokens (words) the clustering engine will read from each input document. When this parameter is set to 0, all tokens will be read.
Performance impact: high
maxWordDf
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- documents.maxWordDf
- Java snippet
- algorithmInstance.documents.maxWordDf
Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with larger document frequency will be ignored.
For example, when maxWordDf
is 0.4, words appearing in more than 40% of
documents will be be ignored. A value of 1.0 means that all words will be taken into account,
no matter in how many documents they appear.
This attribute may be useful when certain words appear in most of the input documents (e.g.
company name from header or footer) and such words dominate the cluster labels. In such case,
setting maxWordDf
to a value lower than 1.0, e.g. 0.9 may improve the
clusters.
Another useful application of this attribute is when there is a need to generate only very
specific clusters, for example clusters containing small numbers of documents. This can be
achieved by setting maxWordDf
to extremely low values, e.g. 0.1 or 0.05.
Performance impact: low
phraseDfThresholdScalingFactor
- Type
- Double
- Default
- 0.2
- Constraints
- value >= 0.0
- Path
- documents.phraseDfThresholdScalingFactor
- Java snippet
- algorithmInstance.documents.phraseDfThresholdScalingFactor
Phrase-level Document Frequency (DF) cut-off scaling factor. This factor is used to compute the minimum document frequency (DF) threshold for phrases (longer than one word), relative to the number of input documents, according to the formula below:
df = floor((documents on input) * phraseDfThresholdScalingFactor / 100)
So for phraseDfThresholdScalingFactor=0.2, the DF cut-off will increase by 0.2 every 100
documents. This means that an input of, for example, 2500 documents, will have minimum phrase
df set to floor(2500 * 0.2 / 100) = 5
and any phrases appearing in fewer than 5
documents will be ignored.
Performance impact: very high
Results impact: Setting low values for this parameter will preserve infrequent phrases, which can result in more accurate clustering (especially at subcluster level), at the cost of slower processing. Setting high values of this parameter will increase performance at the cost of lower clustering accuracy.
useBuiltInWordDatabaseForStemming
- Type
- Boolean
- Default
- false
- Path
- documents.useBuiltInWordDatabaseForStemming
- Java snippet
- algorithmInstance.documents.useBuiltInWordDatabaseForStemming
Use built-in word database for stemming. If enabled, Lingo3G will use the built-in word inflection and part of speech database rather than an algorithmic stemmer.
Stemmers or word inflection databases transform various form of a word to one common root. This is required to make sure that a cluster labeled e.g. Programming contains documents referencing all variants of the word, such as programs, programmer or programmed.
Results impact: Algorithmic stemming tends to be more aggressive compared to stemming based on word inflection dictionaries shipping with Lingo3G. This means that with algorithmic stemming all the following forms: program, programming, programmer and programmable will be treated as the same concept, while with the word database based stemming, they will be treated as separate, different concepts. As a result, with algorithmic stemming, a cluster labeled Program will contain documents referring to all program, programs, programming programmer and programmable, while with the word database based stemming, the cluster will contain only documents referring to program and programs.
Enabling this option is recommended only when it is important do distinguish between slight variations of the same general concept, e.g. programming and program.
It is possible to disable heuristic stemming by setting useHeuristicStemming
attribute to false
, but still apply the dictionary-based stemming
(by enabling this option).
Performance impact: small.
useHeuristicStemming
- Type
- Boolean
- Default
- true
- Path
- documents.useHeuristicStemming
- Java snippet
- algorithmInstance.documents.useHeuristicStemming
This option enables or disables algorithmic stemming. The useBuiltInWordDatabaseForStemming
attribute contains
relevant discussion on how stemming affects clustering results.
Performance impact: small.
wordDfThresholdScalingFactor
- Type
- Double
- Default
- 0.7
- Constraints
- value >= 0.0
- Path
- documents.wordDfThresholdScalingFactor
- Java snippet
- algorithmInstance.documents.wordDfThresholdScalingFactor
Word-level Document Frequency (DF) cut-off scaling factor. This factor is used to compute the minimum document frequency (DF) threshold for words, relative to the number of input documents, according to the formula below:
df = floor((documents on input) * wordDfThresholdScalingFactor / 100);
So for wordDfThresholdScalingFactor=1, the DF cut-off will increase by 1.0 every 100
documents. This means that an input of, for example, 350 documents, will have minimum word df
set to floor(350 * 1 / 100) = 3
and any words appearing in fewer than 3 documents will
be ignored.
Performance impact: very high
Results impact: Setting low values for this parameter will preserve infrequent words, which can result in more accurate clustering (especially at subcluster level), at the cost of slower processing. Setting high values of this parameter will increase performance at the cost of lower clustering accuracy.
hierarchy
- Type
- com.carrotsearch.lingo3g.parameters.Hierarchy
- Default
- l3g::Hierarchy
- Path
- hierarchy
- Java snippet
- algorithmInstance.hierarchy
Cluster hierarchy control parameters.
{"neighborhoodSize": 20}
clusterCountBase
- Type
- Integer
- Default
- 7
- Constraints
- value >= 2
- Path
- hierarchy.clusterCountBase
- Java snippet
- algorithmInstance.hierarchy.clusterCountBase
The number of clusters discovered in each clustering pass. The higher the value of this parameter, the larger the total number of clusters.
Performance impact: medium
documentCoverageTarget
- Type
- Double
- Default
- 0.95
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- hierarchy.documentCoverageTarget
- Java snippet
- algorithmInstance.hierarchy.documentCoverageTarget
The percentage of input documents to be put in clusters. Determines the percentage of documents the clustering engine should assign to clusters. After each clustering pass, the clustering engine will check if the required document coverage has been achieved. If so, it will not perform further clustering passes. The required document coverage may not always be achieved, especially if the maximum number of clustering passes is set to a low value. To cause the clustering engine to always perform the maximum number of clustering passes, set the value of this parameter to 1.0.
Performance impact: high
maxClusteringPassesSub
- Type
- Integer
- Default
- 2
- Constraints
- value >= 0 and value <= 10
- Path
- hierarchy.maxClusteringPassesSub
- Java snippet
- algorithmInstance.hierarchy.maxClusteringPassesSub
Maximum number of clustering passes to perform on subclusters. Determines the maximum number of
cluster discovery passes the clustering engine should perform to discover subclusters. The
first clustering pass discovers large/more general clusters, while further passes find
smaller/more specific clusters. Setting the maximum number of passes to 0 will force the
algorithm to stop clustering only when no more subclusters can be created or the documentCoverageTarget
has been reached for the parent cluster.
Performance impact: high
Results impact: With the lowest value of this parameter, the clustering engine will discover only the largest clusters, while with higher values, smaller and more specific clusters will also be created. Setting this parameter to 0 will cause the clustering algorithm to create the maximum possible number of subclusters for each cluster.
maxClusteringPassesTop
- Type
- Integer
- Default
- 4
- Constraints
- value >= 0 and value <= 10
- Path
- hierarchy.maxClusteringPassesTop
- Java snippet
- algorithmInstance.hierarchy.maxClusteringPassesTop
Maximum number of clustering passes to perform on top hierarchy level. Determines the maximum
number of cluster discovery passes the clustering engine should perform to discover the
top-level clusters. The first clustering pass discovers large/more general clusters, while
further passes find smaller/more specific clusters. Setting the maximum number of passes to 0
will force the algorithm to stop clustering only when no more clusters can be created or the
documentCoverageTarget
has been reached.
Performance impact: high
Results impact: With the lowest value of this parameter, the clustering engine will discover only the largest clusters, while with higher values, smaller and more specific clusters will also be created. Setting this parameter to 0 will cause the clustering algorithm to create the maximum possible number of clusters.
maxHierarchyDepth
- Type
- Integer
- Default
- 2
- Constraints
- value >= 1 and value <= 5
- Path
- hierarchy.maxHierarchyDepth
- Java snippet
- algorithmInstance.hierarchy.maxHierarchyDepth
The maximum number of cluster levels to create. Setting this parameter to 1 will disable hierarchical clustering. In such case it is also recommended to disable hierarchical merging, which will preserve smaller clusters.
Performance impact: high
maxImprovementIterations
- Type
- Integer
- Default
- 5
- Constraints
- value >= 0 and value <= 50
- Path
- hierarchy.maxImprovementIterations
- Java snippet
- algorithmInstance.hierarchy.maxImprovementIterations
The number of clustering improvement iterations to perform. Determines the maximum number of clustering improvement cycles the clustering engine should perform. During each cycle, it will examine cluster arrangements similar to the current one, and if any of them is better, the current one will be replaced.
Performance impact: very high
minClusterSizeForSubclusters
- Type
- Integer
- Default
- 10
- Constraints
- value >= 3
- Path
- hierarchy.minClusterSizeForSubclusters
- Java snippet
- algorithmInstance.hierarchy.minClusterSizeForSubclusters
The minimum number of documents that must be assigned to a cluster before the clustering engine attempts to create subclusters for that cluster.
Performance impact: high
neighborhoodSize
- Type
- Integer
- Default
- 20
- Constraints
- value >= 10 and value <= 200
- Path
- hierarchy.neighborhoodSize
- Java snippet
- algorithmInstance.hierarchy.neighborhoodSize
Maximum similar cluster arrangements to examine. Determines the maximum number of similar
cluster arrangements the clustering engine should examine during each heuristic improvement
cycle. This parameter is meaningful only when maxImprovementIterations
is greater than 0.
Performance impact: very high
labels
- Type
- com.carrotsearch.lingo3g.parameters.Labels
- Default
- l3g::Labels
- Path
- labels
- Java snippet
- algorithmInstance.labels
Cluster label control parameters.
{}
allowNumbersInLabels
- Type
- Boolean
- Default
- true
- Path
- labels.allowNumbersInLabels
- Java snippet
- algorithmInstance.labels.allowNumbersInLabels
Allow numbers in labels switch. When switched on, the clustering engine will allow tokens identified as numbers to appear in cluster labels.
Performance impact: low
capitalizeNonFunctionWords
- Type
- Boolean
- Default
- true
- Path
- labels.capitalizeNonFunctionWords
- Java snippet
- algorithmInstance.labels.capitalizeNonFunctionWords
Capitalize non function words in labels. When switched on, the clustering engine will capitalize all non function words in labels. When switched off, particular words will appear in labels in the case they appeared in the majority of input documents.
Performance impact: low
filtering
- Type
- com.carrotsearch.lingo3g.parameters.Filtering
- Default
- l3g::Filtering
- Path
- labels.filtering
- Java snippet
- algorithmInstance.labels.filtering
Cluster label filtering control parameters.
{"trailingGenitiveLabelFilter": true}
dashedWordsLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.dashedWordsLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.dashedWordsLabelFilter
Filters out labels containing words starting or ending in a dash character ('-').
Performance impact: low
dictionaryLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.dictionaryLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.dictionaryLabelFilter
Removes or boosts labels based on a predefined dictionary of words, phrases and regular expressions. Impact on performance depends on the number of regular expression entries in the label dictionary: the more regular expression entries, the lower the processing speed.
Performance impact: medium to very high
leftCompleteLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.leftCompleteLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.leftCompleteLabelFilter
Truncated labels filter. Heuristically eliminates truncated cluster labels ("York
Restaurants"), replacing them with more complete phrases, "New York Restaurants", based on the
context. It is recommended to use this filter in combination with rightCompleteLabelFilter
. The strength of truncated label
elimination is determined by the labelOverrideThreshold
attribute.
Performance impact: medium
minLengthLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.minLengthLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.minLengthLabelFilter
Filters out labels whose string representation (excluding spaces) is shorter than 3 characters.
Performance impact: low
numberOnlyLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.numberOnlyLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.numberOnlyLabelFilter
Filters out labels that consist only of numeric tokens.
Performance impact: low
oneLetterWordLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.oneLetterWordLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.oneLetterWordLabelFilter
Filters out labels containing only one-letter words ("M a f").
Performance impact: low
repeatedWordsLabelFilter
- Type
- Boolean
- Default
- false
- Path
- labels.filtering.repeatedWordsLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.repeatedWordsLabelFilter
Filters out labels containing repeated words ("New York York").
Performance impact: low
rightCompleteLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.rightCompleteLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.rightCompleteLabelFilter
Truncated labels filter. Heuristically eliminates truncated cluster labels ("York
Restaurants"), replacing them with more complete phrases, "New York Restaurants", based on the
context. It is recommended to use this filter in combination with leftCompleteLabelFilter
. The strength of truncated label elimination
is determined by the labelOverrideThreshold
attribute.
Performance impact: medium
trailingGenitiveLabelFilter
- Type
- Boolean
- Default
- true
- Path
- labels.filtering.trailingGenitiveLabelFilter
- Java snippet
- algorithmInstance.labels.filtering.trailingGenitiveLabelFilter
Filters out phrases ending in a Saxon genitive of an English noun ("Discover World's", "For your computers'").
Performance impact: low
labelOverrideThreshold
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.2 and value <= 1.0
- Path
- labels.labelOverrideThreshold
- Java snippet
- algorithmInstance.labels.labelOverrideThreshold
Determines the strength of the truncated label filters. The lowest value means strongest truncated labels elimination, which may lead to overlong cluster labels and many unclustered documents. The highest value effectively disables the filter, which may result in short or truncated labels.
Performance impact: low
lowercaseFunctionWords
- Type
- Boolean
- Default
- true
- Path
- labels.lowercaseFunctionWords
- Java snippet
- algorithmInstance.labels.lowercaseFunctionWords
Use lower case for function words in labels. When switched on, the clustering engine will convert all function words in labels into lower case. When switched off, particular function words will appear in labels in the case they appeared in the majority of input documents.
Performance impact: low
maxLabelWords
- Type
- Integer
- Default
- 8
- Constraints
- value >= 1 and value <= 8
- Path
- labels.maxLabelWords
- Java snippet
- algorithmInstance.labels.maxLabelWords
Determines the maximum label length in words. Labels consisting of more words will not be generated.
Performance impact: none
Results impact: Setting the maximum label length to some lower value (e.g. 2 or 3) may create more general clusters.
This setting can also be useful when the input collection contains duplicate documents. In such cases, Lingo3G may create overlong cluster labels taken directly from the duplicate documents. While the best solution to this problem would be eliminating duplicate documents from input, lowering the maximum label length can serve as a simple workaround.
minLabelWords
- Type
- Integer
- Default
- 1
- Constraints
- value >= 1 and value <= 8
- Path
- labels.minLabelWords
- Java snippet
- algorithmInstance.labels.minLabelWords
Determines the minimum label length in words. Labels consisting of fewer words will not be generated.
Performance impact: none
Results impact: Setting the minimum label length to some higher value (e.g. 4 or 5) may create more specific clusters.
preferredLabelLength
- Type
- Double
- Default
- 2.5
- Constraints
- value >= 0.0 and value <= 8.0
- Path
- labels.preferredLabelLength
- Java snippet
- algorithmInstance.labels.preferredLabelLength
Instructs the clustering engine to prefer cluster labels consisting of the specified number of
words. The strength of the preference is determined by the preferredLabelLengthDeviation
attribute.
Fractional preferred label lengths are also allowed. For example, preferred label length of 2.5 will result in labels of length 2 and 3 being treated equally preferred; a value of 2.2 will prefer two-word labels more than three-word ones.
Performance impact: none
preferredLabelLengthDeviation
- Type
- Double
- Default
- 2.5
- Constraints
- value >= 0.0 and value <= 20.0
- Path
- labels.preferredLabelLengthDeviation
- Java snippet
- algorithmInstance.labels.preferredLabelLengthDeviation
Allowed deviation from the preferred label length. Determines how far the clustering engine is
allowed to deviate from the preferredLabelLength
. A value of 0.0
allows no deviation: all labels must have the preferred length. Larger values allow more and
more deviation, with the value of 20.0 meaning almost no preference at all.
When the preferred label length deviation is 0.0 and the fractional part of the preferred label length is 0.5, then the only allowed label lengths will be the two integers closest to the preferred label length value. For example, if preferred label length deviation is 0.0 and preferred label length is 2.5, the clustering engine will create only labels consisting of 2 or 3 words. If the fractional part of the preferred label length is other than 0.5, only the closest integer label length will be preferred.
Performance impact: none
putPromotedLabelsAtHierarchyRoot
- Type
- Boolean
- Default
- false
- Path
- labels.putPromotedLabelsAtHierarchyRoot
- Java snippet
- algorithmInstance.labels.putPromotedLabelsAtHierarchyRoot
Put promoted labels at hierarchy root. When switched on, labels promoted using the label dictionary will be always put at the top level of the cluster hierarchy. When switched off, promoted labels will not be forced to appear at the hierarchy root and will be placed where they naturally belong, e.g. as subclusters of larger clusters.
Results impact: a lot of labels can get promoted as a result of boosting e.g. proper nouns defined in the built-in POS database. With this option enabled, all such labels will be put at the root of cluster hierarchy, which may result in a clearly visible cluster overlap. For example, clusters Bill Clinton, President Bill Clinton and U.S. President Bill Clinton will all show at the root of the cluster tree, while with this option disabled, only the Bill Clinton cluster would be placed at root of the hierarchy.
Performance impact: low
queryHint
- Type
- String
- Default
- null
- Path
- labels.queryHint
- Java snippet
- algorithmInstance.labels.queryHint
Query that produced the documents. The query terms can be penalized (com.carrotsearch.lingo3g.parameters.Labels#queryWordLabelWeight
and this may help the algorithm to create better clusters. Providing the
query is optional but desirable.
queryWordLabelWeight
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- labels.queryWordLabelWeight
- Java snippet
- algorithmInstance.labels.queryWordLabelWeight
Determines the weight of labels containing query words (queryHint
}. Lower
values mean that phrases containing query words are less likely to appear as cluster labels. In
particular, the value of 0.0 will totally eliminate query words from cluster labels. The value
of 1.0, on the other hand, will cause the clustering engine to treat equally labels with and
without query words.
Performance impact: low
removeRepeatedSynonymsFromLabels
- Type
- Boolean
- Default
- true
- Path
- labels.removeRepeatedSynonymsFromLabels
- Java snippet
- algorithmInstance.labels.removeRepeatedSynonymsFromLabels
Remove repeated synonyms from labels. When switched on, no synonymous words will appear in a single label. For example, if 'photos' and 'pictures' are declared synonyms, labels such as 'Tiger Photos Pictures" or "Photos and Pictures" will not be generated.
Performance impact: low
singleWordLabelWeight
- Type
- Double
- Default
- 0.5
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- labels.singleWordLabelWeight
- Java snippet
- algorithmInstance.labels.singleWordLabelWeight
Determines how willing the clustering engine will be to select single words as cluster labels. The higher the value of this parameter, the more clusters described with single-word labels will be produced.
Performance impact: none
useBuiltInWordDatabaseForLabelFiltering
- Type
- Boolean
- Default
- true
- Path
- labels.useBuiltInWordDatabaseForLabelFiltering
- Java snippet
- algorithmInstance.labels.useBuiltInWordDatabaseForLabelFiltering
Use built-in word database for label filtering. If enabled, Lingo3G will perform label filtering based on the the built-in word databases in addition to word dictionary files.
Results impact: If this option is enabled, Lingo3G should produce better-formed cluster labels. For example, labels being, starting or ending with a verb or adjective should appear less frequently. However, because of the limitations of the current part of speech tagging model (please see below), enabling this option is also likely to prevent certain well-formed cluster labels, e.g. if the built-in word database misinterprets a noun for a verb.
Limitations of the part of speech tagging model. Currently, Lingo3G uses a unigram model for assigning part of speech tags to words. This means that for each word having multiple part of speech tags (such as "program" in English, which, depending on the context, can be both a verb and a noun), one of the available tags needs to be chosen. To do that, Lingo3G employs a heuristic that takes into account the word frequency and the set of part of speech tags the word has. While the heuristic is fairly efficient in a general, some words may be tagged erroneously. To provide a solution for such cases, the built-in part of speech database tags can be overridden in the user-defined XML word dictionary.
Performance impact: small.
license
- Type
- String
- Default
- null
- Path
- license
- Java snippet
- algorithmInstance.license
An explicit license string. If not provided, default locations are scanned for licenses.
merging
- Type
- com.carrotsearch.lingo3g.parameters.Merging
- Default
- l3g::Merging
- Path
- merging
- Java snippet
- algorithmInstance.merging
Cluster merging control parameters.
{"preciseHierarchicalMerging": false}
aggressiveCloningControl
- Type
- Boolean
- Default
- false
- Path
- merging.aggressiveCloningControl
- Java snippet
- algorithmInstance.merging.aggressiveCloningControl
Aggressive cluster cloning control switch. When switched on, the clustering engine will not
allow the same label to appear at any level of the hierarchy. This parameter is meaningful only
if cloningControl
is switched on.
Performance impact: low
cloningControl
- Type
- Boolean
- Default
- true
- Path
- merging.cloningControl
- Java snippet
- algorithmInstance.merging.cloningControl
Cluster cloning control switch. When switched on, the clustering engine will not allow the same cluster label to appear both at the top- and subcluster-level of the hierarchy.
Performance impact: low
flatMerging
- Type
- Boolean
- Default
- true
- Path
- merging.flatMerging
- Java snippet
- algorithmInstance.merging.flatMerging
Flat merging switch. When switched on, the clustering engine will perform cluster merging using a strategy specific for flat (non-hierarchical) clusters. With this strategy the clustering engine will merge only clusters of similar size.
Performance impact: low
hierarchicalMerging
- Type
- Boolean
- Default
- true
- Path
- merging.hierarchicalMerging
- Java snippet
- algorithmInstance.merging.hierarchicalMerging
Hierarchical merging switch. When switched on, the clustering engine will use a cluster merging
strategy specially designed for hierarchical clustering, and will be more eager to move
clusters from the top level positions to subclusters. If the algorithm is set to perform flat
clustering (maxHierarchyDepth
= 1), disabling hierarchical
merging is recommended to preserve smaller clusters.
Performance impact: low
hierarchicalMergingWithLabels
- Type
- Boolean
- Default
- true
- Path
- merging.hierarchicalMergingWithLabels
- Java snippet
- algorithmInstance.merging.hierarchicalMergingWithLabels
Label merging switch. When switched on, the clustering engine will take cluster labels into
account while hierarchical merging of clusters. This parameter is meaningful only when hierarchicalMerging
is switched on.
Performance impact: low
Results impact: With label merging switched on, the clustering engine may move some additional clusters from the top level to subclusters.
mergeThreshold
- Type
- Double
- Default
- 0.7
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- merging.mergeThreshold
- Java snippet
- algorithmInstance.merging.mergeThreshold
Cluster merge threshold. If the overlap between clusters is larger than the value of this parameter, these clusters will be merged.
Performance impact: none
Results impact: Low values of this parameter will cause the clustering engine to eagerly merge clusters, which will create larger clusters in which some documents may be irrelevant. High values of this parameter will cause it to merge clusters rarely, which will result in large numbers of small clusters with more relevant documents.
preciseHierarchicalMerging
- Type
- Boolean
- Default
- false
- Path
- merging.preciseHierarchicalMerging
- Java snippet
- algorithmInstance.merging.preciseHierarchicalMerging
Precise hierarchical merging switch. When switched on, the hierarchically merged group will contain only those documents that contain the label of the merged group. Enable this option if you would like to avoid a situation where, due to standard merging, a cluster contains documents in which the cluster's label does not appear.
Performance impact: low
Results impact: With precise hierarchical merging switched on, certain small groups removed from the top level may not re-emerge as children of the large group they were merged into. As a result, some documents of such a group may end up unassigned to any cluster.
scoring
- Type
- com.carrotsearch.lingo3g.parameters.Scoring
- Default
- l3g::Scoring
- Path
- scoring
- Java snippet
- algorithmInstance.scoring
Cluster scoring parameters.
{}
boostedFieldScorerWeight
- Type
- Double
- Default
- 0.6
- Constraints
- value >= 0.0
- Path
- scoring.boostedFieldScorerWeight
- Java snippet
- algorithmInstance.scoring.boostedFieldScorerWeight
Assigns higher scores to labels that contain words appearing in input documents' titles.
Performance impact: low
capitalizedWordLabelScorerWeight
- Type
- Double
- Default
- 0.1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.capitalizedWordLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.capitalizedWordLabelScorerWeight
Assigns higher scores to labels that contain capitalized words.
Performance impact: low
clusterSetDocumentOverlapLabelScorerWeight
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.clusterSetDocumentOverlapLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.clusterSetDocumentOverlapLabelScorerWeight
Assigns higher scores to labels that contain documents not present in the current cluster set.
Performance impact: low
dictionaryWeightLabelScorerWeight
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.dictionaryWeightLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.dictionaryWeightLabelScorerWeight
Boosts label scores by a factor specified in the label dictionary file. If this scorer has weight 0, label boosting will not be applied.
Performance impact: low
documentCountLabelScorerWeight
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.documentCountLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.documentCountLabelScorerWeight
Assigns higher scores to clusters whose number of documents in relation to the total number of
documents is equal or smaller than specified by the maxClusterSize
parameter.
Performance impact: low
grammaticalVariantLabelScorerWeight
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.grammaticalVariantLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.grammaticalVariantLabelScorerWeight
Strength of penalization of the less frequent variants of stem-equivalent labels. For example, if the input documents contain phrases "Fuel efficiency" and "Fuel efficient", the less frequent phrase variant will be less likely to appear as a cluster label.
When the value of this attribute is 1.0, the less frequent phrases will be penalized proportionally to the difference between the frequency of that phrase and the most frequent variant. Lower values of this attribute will decrease the penalty, setting the value to 0.0 will cause Lingo3G to treat all grammatical variants equally.
Performance impact: low
queryWordLabelScorerWeight
- Type
- Double
- Default
- 0.1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.queryWordLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.queryWordLabelScorerWeight
Penalizes labels that contain query words.
Performance impact: low
tfDfRatioLabelScorerWeight
- Type
- Double
- Default
- 0.2
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.tfDfRatioLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.tfDfRatioLabelScorerWeight
Assigns higher score to more general/shorter labels.
Performance impact: low
tfLabelScorerWeight
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.tfLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.tfLabelScorerWeight
Assigns higher scores to labels with higher Term Frequency (TF).
Performance impact: low
unindexedWordLabelScorerWeight
- Type
- Double
- Default
- 0.1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.unindexedWordLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.unindexedWordLabelScorerWeight
Penalizes labels that contain too many function words.
Performance impact: low
wordCountLabelScorerWeight
- Type
- Double
- Default
- 1
- Constraints
- value >= 0.0 and value <= 1.0
- Path
- scoring.wordCountLabelScorerWeight
- Java snippet
- algorithmInstance.scoring.wordCountLabelScorerWeight
Assigns higher scores to labels that consist of 2, 3 or 4 words. Longer labels are penalized: the longer the label, the higher the penalty.
Performance impact: low