pairwiseSimilarity

pairwise​Similarity:​* components are used to declare similarity between two documents in duplicate detection and document overlap highlighting.

The following pairwise​Similarity:​* stage types are available for use in analysis request JSONs:

pairwise​Similarity:​document​Overlap​Ratio

For two documents A and B, computes the ratio of text spans that are duplicated in both documents.

pairwise​Similarity:​document​Overlap​Ratio

For two documents A and B, computes the ratio of text spans that are duplicated in both documents.

pairwise​Similarity:​feature​Intersection​Min​Ratio

For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the number of features in the smaller set: (​A ∩ ​B) / min(|​A|, |​B|).

pairwise​Similarity:​feature​Intersection​Size

For two documents with feature sets A and B, returns the count of features present in both A and B (A ∩ ​B).

pairwise​Similarity:​feature​Intersection​To​Union​Ratio

For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the total number of unique features: (​A ∩ ​B) / (​A ∪ ​B).


pairwise​Similarity:​reference

References a pairwise​Similarity:​* component defined in the request or in the project's default components.


pairwise​Similarity:​document​Overlap​Ratio

For two documents A and B, computes the ratio of text spans that are duplicated in both documents. This is a computationally expensive similarity metric that is best used as a final step in duplicate detection. This is also the similarity metric used internally by the text overlap stage.

{
  "type": "pairwiseSimilarity:documentOverlapRatio",
  "allowedGapRatio": 0,
  "computeDifferences": false,
  "crossFieldOverlaps": true,
  "fields": {
    "type": "fields:reference",
    "auto": true
  },
  "ngramWindow": 6
}

Document overlap ratio splits the text in both documents into smaller fragments and then looks for copies of these fragments in both documents. All shared fragments are then aggregated into contiguous regions that are considered to be an overlap between the two documents.

A ratio of 1 means one document contains all text passages copied from the other document. Note that text spans are not necessarily ordered so the text may not be identical, even if the ratio is 1.

allowed​Gap​Ratio

Type
number
Default
0
Required
no

Allowed character ratio of non-overlapping "gaps" between consecutive text spans. If the length of these "gaps", divided by the total length of neighboring text spans is smaller than this number, they're concatenated into a single overlapping span.

By default the allowed gap ratio is zero, meaning no concatenations are allowed.

compute​Differences

Type
boolean
Default
false
Required
no

When set to true, the similarity is inverted and carries the ratio of different (non-overlapping) regions between the two documents.

This option can be used to compute fragments​In​Fields highlights to visually inspect differences between text values. Note that this option cannot be used to retrieve alignedFragments in overlap analysis (because it computes different text fragments, not identical text fragments).

cross​Field​Overlaps

Type
boolean
Default
true
Required
no

If set to true, the duplicated text spans are found across all fields. Otherwise, each field is treated separately and the result is averaged.

fields

Type
fields
Default
{
  "type": "fields:reference",
  "auto": true
}
Required
no

Text fields to use for computing overlaps. If more than one text field is used, the overlap between pairs of corresponding fields is averaged, unless cross field overlap is enabled.

ngram​Window

Type
integer
Default
6
Constraints
value > 0
Required
no

The length of the word n-gram window used to detect overlapping text spans (these are aggregated into overlap ranges, if they're contiguous). Setting the n-gram window too low may result in some false-positives.

pairwise​Similarity:​feature​Intersection​Min​Ratio

For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the number of features in the smaller set: (​A ∩ ​B) / min(|​A|, |​B|).

{
  "type": "pairwiseSimilarity:featureIntersectionMinRatio",
  "features": {
    "type": "featureSource:reference",
    "auto": true
  }
}

This similarity is useful for validation of "relative containment" of features from one document in another. For example, one could use it to detect documents in which the set of indexed labels constitutes a 90%-99% subset of another document's labels, indicating a very similar (but not identical content).

The output value is an integer in the range of [0, 1].

features

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

A feature source for both documents. Only top-level, unique features are counted (sets). Composite features are not flattened.

pairwise​Similarity:​feature​Intersection​Size

For two documents with feature sets A and B, returns the count of features present in both A and B (A ∩ ​B).

{
  "type": "pairwiseSimilarity:featureIntersectionSize",
  "features": {
    "type": "featureSource:reference",
    "auto": true
  }
}

This similarity is useful for validation of absolute feature values (for example an absolute number of shared entire field values or an absolute number of duplicate sentences).

The output value is an integer in the range of [0, ∞).

features

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

A feature source for both documents. Only top-level, unique features are counted (sets). Composite features are not flattened.

pairwise​Similarity:​feature​Intersection​To​Union​Ratio

For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the number of all unique features: (​A ∩ ​B) / (​A ∪ ​B).

{
  "type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
  "features": {
    "type": "featureSource:reference",
    "auto": true
  }
}

This similarity is useful for validation of "absolute intersection" of features between two documents.

The output value is an integer in the range of [0, 1].

features

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

A feature source for both documents. Only top-level, unique features are counted (sets). Composite features are not flattened.

Consumers of pairwise​Similarity:​*

The following stages and components take pairwise​Similarity:​* as input:

Stage or component Property
document​Overlap
  • pairwise​Similarity