documentPairs
Document pairs stages produce a stream of paired documents. Such pairs can be an output of complex
computations or a simple cartesian set of all possibilities. For example, the
document​Pairs:​duplicates
finds pairs of similar (or dissimilar) documents (as explained
in the duplicate detection chapter), while the
document​Pairs:​all
stage can can be useful to produce
a report of overlapping regions
among all pairs of documents in a (small) scope.
The following document​Pairs:​*
stage types are available for use in analysis request JSONs:
-
document​Pairs:​all
-
A stream of all pairs of documents in the referenced document scope.
-
document​Pairs:​duplicates
-
A stream of pairs of similar (or dissimilar) documents in the referenced document scope.
document​Pairs:​reference
-
References the results of another
document​Pairs:​*
stage defined in the request.
The common JSON output of any document​Pairs
stage is an array of objects, each having a
pair
property with an array of two (internal) document identifiers. Here is an example:
[
{
"pair" : [ 1627, 1848 ]
},
{
"pair" : [ 1838, 1988 ]
},
{
...
}
]
Because this output contains internal document identifiers, it is typically used together with another stage or component that can accept a stream of pairs and convert it to some more useful output. Examples of such components include the document overlap stage or the document content stage.
document​Pairs:​all
Produces a stream of all unique pairs of documents in the provided document set.
{
"type": "documentPairs:all",
"documents": {
"type": "documents:reference",
"auto": true
}
}
The returned set contains unique (unordered) pairs of identifiers.
The number of unique pairs in a set of n documents is n(n-1)/2. Passing a large set of documents to this stage may return a very large result.
documents
A reference to
documents:​*
component providing the source documents for pairs.
document​Pairs:​duplicates
Produces a stream of pairs of documents that fulfill the provided similarity thresholds and have at least one hash collision on the provided features. This stage is typically used to find pairs of identical or very similar documents. We discuss its internal workings and provide examples of full configuration arrangements in a dedicated chapter talking specifically about duplicate detection.
{
"type": "documentPairs:duplicates",
"documentPairFilter": {
"countCondition": "ONE_OR_MORE",
"query": {
"type": "query:all"
}
},
"hashGrouping": {
"features": {
"type": "featureSource:reference",
"auto": true
},
"pairing": {
"maxHashBitsDifferent": 0,
"maxHashGroupSize": 200
}
},
"output": {
"diagnostics": true,
"explanations": false
},
"query": null,
"validation": {
"debug": false,
"max": 1.7976931348623157e+308,
"min": 0,
"pairwiseSimilarity": {
"type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
"features": {
"type": "featureSource:reference",
"auto": true
}
}
},
"validationFilters": []
}
The output of the document​Pairs:​duplicates
stage contains additional properties for each pair of
documents.
[
{
"pair" : [ 1627, 1848 ],
"similarity" : 1.0,
"explanation" : "Documents share 4 out of 4 distinct features (flattened composites of words from fields: title, summary)."
},
{
...
}
]
The similarity
contains the value returned by the
pairwise similarity. The explanation
property
contains an explanation of how the similarity function computed the similarity for debugging purposes. This is an
optional property that needs to be enabled using the
explanations
configuration option.
Note the document​Pairs:​duplicates
is also a
document selector. A unique set of document identifiers present in the
output pairs can be consumed by any component accepting the
documents:​*
reference. Alternatively, an explicit
documents:fromDocumentPairs adapter can
be used to transform document pairs into a document set.
The remaining part of this chapter contains the documentation for configuration options.
document​Pair​Filter
A filter to narrow down the resulting set of pairs prior to computing validation scores. The filter consists of a set of documents (sub-scope) and the count condition. If one document from the candidate pair is present in the sub-scope, the count is one. If both documents are present, the count is two. If no documents are present in the sub-scope, the count is zero. The count condition determines whether the pair should be accepted or rejected, depending on the document count in sub-scope.
This property can be used to search for similarities across a larger set of documents with the output filtered to include pairs of documents referencing a smaller set. For example, a search for similar article abstracts but scoped to just those pairs, where at least (or exactly) one document is published in year 2021 or authored by a particular person.
count​Condition
Count condition for the number of documents in the pair present in the scope.
The count​Condition
property supports the following values:
Z​E​R​O
-
The pair contains exactly zero documents from the filter query. The filter query acts as an exclusion filter.
O​N​E
-
The pair contains exactly one document from the filter query. This condition can be used to select pairs of documents from exclusively different sets. For example, pairs of similar documents where one document references a particular author (the other document will not reference that author because the count condition must be exactly 1).
O​N​E_​O​R_​M​O​R​E
-
The pair contains one or two documents from the filter query. This condition can be used to select pairs of documents where at least one document belongs to a particular set. Contrary to the
O​N​E
condition, both documents from the pair can be present in the filter query. T​W​O
-
The pair contains exactly two documents from the filter query. This condition is rarely used and should be expressed by adding a filtering clause to the duplicate scope query.
query
The query that provides the set of documents for this pair filter.
hash​Grouping
The initial phase finding candidate pairs of documents with at least one collision on any document-derived feature.
A feature source should provide a set of features for each document returned by the query. Numeric hashes of these features are then aggregated and any hash collisions point at documents where potentially the same feature (or features) occurred with high probability.
High performance of this phase is critical for the overall performance and depends on:
- uniqueness of features,
- feature extraction cost,
- overall number of different features (hashes)
There are several feature sources to choose from. They vary in computational cost and can produce a different number of features for the same input. For example, sentence-based features are much faster to compute than term n-gram features and their number is much smaller for the same input (n-grams overlap, sentences don't). This said, the choice of the feature source depends heavily on the task so none of them are better than others. Ideally, one should try to assemble a feature source where features with identical hashes are shared only by documents that will eventually be considered duplicates.
features
The source of features for hash collision grouping. See general remarks under hash grouping for more information.
pairing
Configures hash-grouping conditions and limits. This section is used to prevent poor features from blowing up pairwise similarity computation costs and to set up maximum bit difference thresholds used in tandem with the simhash feature source.
max​Hash​Bits​Different
This parameter makes sense only if used in combination with the top-level simhash feature source and specifies the maximum number of bits by which feature hashes may differ in order to still be considered as conflicting.
max​Hash​Group​Size
Maximum size of a candidate document group sharing the same feature hash. For example, a term-based feature source could return a feature based on a very frequent term, shared by all documents. This would cause all the documents sharing the hash for that feature to be pairwise-compared, leading to exponential processing times. The algorith will omit any such features, assuming other, more unique features will still be present to trigger a hash collision for similar document pairs.
The number of skipped groups is returned as part of the log from the analysis. A high number of skipped groups may indicate a problem with the feature source (features are not unique enough).
output
This element configures how the output pairs are emitted and the additional properties they contain.
diagnostics
If set to true
, the response will include additional diagnostic information about each phase of
duplicate detection. This will include the number of hashes, hash conflicts and other hints that help diagnose
problems.
This property should be used as a debugging tool, it is not meant for production use.
explanations
If set to true
, each output pair will contain a human-readable explanation of how the pairwise
similarity value was computed. This explanation is typically straightforward but may become quite complex for
complex nested definitions of similarity.
This property should be used as a debugging tool, it is not meant for production use.
limit
Specifies the maximum number of pairs to be returned. If defined, the output is limited to this number, even if more pairs satisfying the pairwise similarity have been found.
By default the output is not limited.
query
A
query:​*
component defining all the input documents that are passed to the hash grouping phase. This is the set of all
documents among which pairs of duplicates should be found, unless a secondary
document​Pair​Filter
is used to prune certain pairs.
validation
Configuration of the second phase of duplicate document detection, measuring the similarity between any candidate document pairs found in hash grouping and passing the pair filter.
validation​Filters
Validation filters are an advanced technique of speeding up pairwise similarity computation. If one or more cheap lower bound approximation of the final similarity can be defined, candidate document pairs can be filtered out before the final, more costly, pairwise similarity is computed.
Validation filters are an array of zero or more pairwise similarity components. If any of the filters rejects a document pair, its evaluation stops early and the final similarity is never computed for that pair.
validation​Criteria
Validation criteria combine the definition of pairwise similarity between documents and the allowed range for this similarity (minimum and maximum values).
{
"debug": false,
"max": 1.7976931348623157e+308,
"min": 0,
"pairwiseSimilarity": {
"type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
"features": {
"type": "featureSource:reference",
"auto": true
}
}
}
This type is most typically used in the validation step of duplicate detection. Additional validation can also be used to speed up processing using validation filters
debug
Emits additional debugging information for each document pair in the output. The format and structure of the explanation is subject to change without notice.
max
Limits the returned document pairs to those with the similarity equal or smaller than this value.
min
Limits the returned document pairs to those with the similarity equal or larger than this value.
pairwise​Similarity
Any pairwise similarity function that returns a numeric value of similarity between two document candidates.
Ideally, the function should be fast to compute, especially if the number of feature collisions (document candidate pairs) is high. If the similarity is costly, it's worth to consider adding one or more fast validation filters to limit the number of pairs to which the full similarity check is applied.
document​Pairs:​*
Consumers of
The following stages and components take document​Pairs:​*
as
input:
Stage or component | Property |
---|---|
document​Overlap | document​Pairs |
documents:​from​Document​Pairs | document​Pairs |