Parameter tuning

This chapter answers the most commonly asked questions about Lingo3G clustering quality or performance tuning.

How can I remove or promote certain phrases?

Lingo3G has an advanced model of boosting or penalizing potential cluster phrases. See the Label dictionary chapter for more information.

Why are not all documents assigned to clusters?

When Lingo3G discovers a set of clusters, it assigns documents relevant to these clusters and may leave a certain portion of the input unassigned anywhere. Documents not referenced from any cluster are essentially outliers: they are not sufficiently related to any other document in the input.

The fraction of unassigned documents increases with the number of input documents as Lingo3G by default tries to output a set of clusters that is concise and readable to human users. In certain applications, for example when you pass cluster labels to another automated step, you can increase the number of clusters and therefore decrease the number of unassigned documents.

To reduce the number of unassigned documents, apply the following parameter changes:

  1. Increase maxClusteringPassesTop above the default value or set it to zero to force Lingo3G to create as many clusters as possible.

  2. Increase documentCoverageTarget above the default value.

  3. Increase singleWordLabelWeight above the default value. This will increase the number of one-word labels, which may not always be desirable.

  4. When you cluster more than a hundred documents, you can further reduce the number of unassigned documents by lowering wordDfThresholdScalingFactor and phraseDfThresholdScalingFactor. This will force Lingo3G to consider lower-frequency words and phrases hence create more clusters. Note that lowering these parameters may significantly increase processing time.

How to make more general clusters?

To make the clusters more general (containing more documents, covering broader topics), apply the following parameter changes:

  1. Increase singleWordLabelWeight above the default value, possibly up to 1.00. Note that this will increase the number of one-word labels, which may not always be desirable.

  2. Increase maxClusterSize above the default value, possibly up to 1.00.

  3. Increase minClusterSize in steps of 0.01 to eliminate the clusters with smallest numbers of documents.

  4. To further increase the size of clusters, try lowering mergeThreshold. This will cause Lingo3G to merge similar clusters into larger ones.

How to make more focused clusters?

To make the clusters more specific (containing fewer documents, covering more narrow topics), apply the following parameter changes:

  1. Decrease maxClusterSize below the default value to eliminate large clusters.

  2. Decrease maxClusteringPassesTop to 0 to force Lingo3G to create as many clusters as possible.

  3. If there are too many one-word meaningless cluster labels, try lowering singleWordLabelWeight. Setting this attribute to 0.00 will eliminate one-word labels altogether.