Quality tuning

This chapter answers most commonly asked questions concerning Lingo3G quality or speed tuning.

How can I remove or promote certain phrases?

Lingo3G has an advanced model of boosting or penalizing potential cluster phrases. Please see label filtering chapter for more information.

Why are not all documents assigned to clusters?

When Lingo3G discovers a set of clusters it assigns documents relevant to these clusters and may leave a certain portion of the input unassigned anywhere. Documents not referenced from any cluster are essentially outliers: they are not (sufficiently) related to any other document in the input.

The fraction of unassigned documents will increase with the number of input documents as Lingo3G (by default) tries to output a set of clusters that is meant to be fairly concise and readable by human users. In certain applications, where cluster labels are passed to another automated step, for example, the number of output clusters can be increased (and the number of unassigned documents decreased).

To reduce the number of unassigned documents:

  1. Increase the maxClusteringPassesTop above the default value or set it to zero to force Lingo3G to create as many clusters as possible.
  2. Increase the documentCoverageTarget above the default value.
  3. Increase the singleWordLabelWeight above the default value. This will increase the number of one-word labels, which may not always be desirable.
  4. When clustering more than a hundred documents, further reductions in the number of unassigned documents can be achieved by lowering wordDfThresholdScalingFactor and phraseDfThresholdScalingFactor. This will force Lingo3G to consider lower-frequency words and phrases when clustering and hence creating more clusters. Please note that lowering these values may significantly increase processing time.

How to make more general clusters?

To make the clusters more general (containing more documents, covering broader topics):

  1. Increase the singleWordLabelWeight above the default value, possibly up to 1.00. Note that this will increase the number of one-word labels, which may not always be desirable.
  2. Increase the maxClusterSize above the default value, possibly up to 1.00.
  3. Increase the minClusterSize in steps of 0.01 to eliminate the clusters with smallest numbers of documents.
  4. To further increase the size of clusters, try lowering the mergeThreshold. This will cause Lingo3G do merge similar clusters into larger ones.

How to make more focused clusters?

To make the clusters more specific (containing fewer documents, covering more narrow topics):

  1. Decrease the maxClusterSize below the default value to eliminate large clusters.
  2. Decrease the maxClusteringPassesTop to 0 to force Lingo3G to create as many clusters as possible.
  3. If there are too many one-word meaningless cluster labels, try lowering the singleWordLabelWeight. Setting this attribute to 0.00 will eliminate one-word labels altogether.