pairwiseSimilarity
pairwise​Similarity:​*
components are used to declare similarity between two documents in
duplicate detection
and
document overlap highlighting.
The following pairwise​Similarity:​*
stage types are available for use in analysis request JSONs:
-
pairwise​Similarity:​document​Overlap​Ratio
-
For two documents A and B, computes the ratio of text spans that are duplicated in both documents.
-
pairwise​Similarity:​document​Overlap​Ratio
-
For two documents A and B, computes the ratio of text spans that are duplicated in both documents.
-
pairwise​Similarity:​feature​Intersection​Min​Ratio
-
For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the number of features in the smaller set:
(​A ∩ ​B) / min(|​A|, |​B|)
. -
pairwise​Similarity:​feature​Intersection​Size
-
For two documents with feature sets A and B, returns the count of features present in both A and B (
A ∩ ​B
). -
pairwise​Similarity:​feature​Intersection​To​Union​Ratio
-
For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the total number of unique features:
(​A ∩ ​B) / (​A ∪ ​B)
.
pairwise​Similarity:​reference
-
References a
pairwise​Similarity:​*
component defined in the request or in the project's default components.
pairwise​Similarity:​document​Overlap​Ratio
For two documents A and B, computes the ratio of text spans that are duplicated in both documents. This is a computationally expensive similarity metric that is best used as a final step in duplicate detection. This is also the similarity metric used internally by the text overlap stage.
{
"type": "pairwiseSimilarity:documentOverlapRatio",
"allowedGapRatio": 0,
"computeDifferences": false,
"crossFieldOverlaps": true,
"fields": {
"type": "fields:reference",
"auto": true
},
"ngramWindow": 6
}
Document overlap ratio splits the text in both documents into smaller fragments and then looks for copies of these fragments in both documents. All shared fragments are then aggregated into contiguous regions that are considered to be an overlap between the two documents.
A ratio of 1 means one document contains all text passages copied from the other document. Note that text spans are not necessarily ordered so the text may not be identical, even if the ratio is 1.
allowed​Gap​Ratio
Allowed character ratio of non-overlapping "gaps" between consecutive text spans. If the length of these "gaps", divided by the total length of neighboring text spans is smaller than this number, they're concatenated into a single overlapping span.
By default the allowed gap ratio is zero, meaning no concatenations are allowed.
compute​Differences
When set to true
, the similarity is inverted and carries the ratio of different (non-overlapping)
regions between the two documents.
This option can be used to compute
fragments​In​Fields
highlights
to visually inspect differences between text values. Note that this option cannot be used to retrieve
alignedFragments
in overlap analysis (because it computes different text fragments, not identical text fragments).
cross​Field​Overlaps
If set to true
, the duplicated text spans are found across all
fields. Otherwise, each field is treated
separately and the result is averaged.
fields
Text fields to use for computing overlaps. If more than one text field is used, the overlap between pairs of corresponding fields is averaged, unless cross field overlap is enabled.
ngram​Window
The length of the word n-gram window used to detect overlapping text spans (these are aggregated into overlap ranges, if they're contiguous). Setting the n-gram window too low may result in some false-positives.
pairwise​Similarity:​feature​Intersection​Min​Ratio
For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the
number of features in the smaller set: (​A ∩ ​B) / min(|​A|, |​B|)
.
{
"type": "pairwiseSimilarity:featureIntersectionMinRatio",
"features": {
"type": "featureSource:reference",
"auto": true
}
}
This similarity is useful for validation of "relative containment" of features from one document in another. For example, one could use it to detect documents in which the set of indexed labels constitutes a 90%-99% subset of another document's labels, indicating a very similar (but not identical content).
The output value is an integer in the range of [0, 1].
features
A feature source for both documents. Only top-level, unique features are counted (sets). Composite features are not flattened.
pairwise​Similarity:​feature​Intersection​Size
For two documents with feature sets A and B, returns the count of features present in both A and B (A ∩ ​B
).
{
"type": "pairwiseSimilarity:featureIntersectionSize",
"features": {
"type": "featureSource:reference",
"auto": true
}
}
This similarity is useful for validation of absolute feature values (for example an absolute number of shared entire field values or an absolute number of duplicate sentences).
The output value is an integer in the range of [0, ∞).
features
A feature source for both documents. Only top-level, unique features are counted (sets). Composite features are not flattened.
pairwise​Similarity:​feature​Intersection​To​Union​Ratio
For two documents with feature sets A and B, returns the count of features present in both A and B, divided by the
number of all unique features:
(​A ∩ ​B) / (​A ∪ ​B)
.
{
"type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
"features": {
"type": "featureSource:reference",
"auto": true
}
}
This similarity is useful for validation of "absolute intersection" of features between two documents.
The output value is an integer in the range of [0, 1].
features
A feature source for both documents. Only top-level, unique features are counted (sets). Composite features are not flattened.
pairwise​Similarity:​*
Consumers of
The following stages and components take pairwise​Similarity:​*
as
input:
Stage or component | Property |
---|---|
document​Overlap | pairwise​Similarity |