documentPairs

Document pairs stages produce a stream of paired documents. Such pairs can be an output of complex computations or a simple cartesian set of all possibilities. For example, the document​Pairs:​duplicates finds pairs of similar (or dissimilar) documents (as explained in the duplicate detection chapter), while the document​Pairs:​all stage can can be useful to produce a report of overlapping regions among all pairs of documents in a (small) scope.

The following document​Pairs:​* stage types are available for use in analysis request JSONs:

document​Pairs:​all

A stream of all pairs of documents in the referenced document scope.

document​Pairs:​duplicates

A stream of pairs of similar (or dissimilar) documents in the referenced document scope.


document​Pairs:​reference

References the results of another document​Pairs:​* stage defined in the request.


The common JSON output of any document​Pairs stage is an array of objects, each having a pair property with an array of two (internal) document identifiers. Here is an example:

[
  {
    "pair" : [ 1627, 1848 ]
  },
  {
    "pair" : [ 1838, 1988 ]
  },
  {
    ...
  }
]

Because this output contains internal document identifiers, it is typically used together with another stage or component that can accept a stream of pairs and convert it to some more useful output. Examples of such components include the document overlap stage or the document content stage.

document​Pairs:​all

Produces a stream of all unique pairs of documents in the provided document set.

{
  "type": "documentPairs:all",
  "documents": {
    "type": "documents:reference",
    "auto": true
  }
}

The returned set contains unique (unordered) pairs of identifiers.

Heads up, high complexity warning

The number of unique pairs in a set of n documents is n(n-1)/2. Passing a large set of documents to this stage may return a very large result.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

A reference to documents:​* component providing the source documents for pairs.

document​Pairs:​duplicates

Produces a stream of pairs of documents that fulfill the provided similarity thresholds and have at least one hash collision on the provided features. This stage is typically used to find pairs of identical or very similar documents. We discuss its internal workings and provide examples of full configuration arrangements in a dedicated chapter talking specifically about duplicate detection.

{
  "type": "documentPairs:duplicates",
  "documentPairFilter": {
    "countCondition": "ONE_OR_MORE",
    "query": {
      "type": "query:all"
    }
  },
  "hashGrouping": {
    "features": {
      "type": "featureSource:reference",
      "auto": true
    },
    "pairing": {
      "maxHashBitsDifferent": 0,
      "maxHashGroupSize": 200
    }
  },
  "output": {
    "diagnostics": true,
    "explanations": false
  },
  "query": null,
  "validation": {
    "debug": false,
    "max": 1.7976931348623157e+308,
    "min": 0,
    "pairwiseSimilarity": {
      "type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
      "features": {
        "type": "featureSource:reference",
        "auto": true
      }
    }
  },
  "validationFilters": []
}

The output of the document​Pairs:​duplicates stage contains additional properties for each pair of documents.

[
  {
    "pair" : [ 1627, 1848 ],
    "similarity" : 1.0,
    "explanation" : "Documents share 4 out of 4 distinct features (flattened composites of words from fields: title, summary)."
  },
  {
    ...
  }
]

The similarity contains the value returned by the pairwise similarity. The explanation property contains an explanation of how the similarity function computed the similarity for debugging purposes. This is an optional property that needs to be enabled using the explanations configuration option.

Note the document​Pairs:​duplicates is also a document selector. A unique set of document identifiers present in the output pairs can be consumed by any component accepting the documents:​* reference. Alternatively, an explicit documents:fromDocumentPairs adapter can be used to transform document pairs into a document set.

The remaining part of this chapter contains the documentation for configuration options.

document​Pair​Filter

Type
object
Default
{
  "query": {
    "type": "query:all"
  },
  "countCondition": "ONE_OR_MORE"
}
Required
no

A filter to narrow down the resulting set of pairs prior to computing validation scores. The filter consists of a set of documents (sub-scope) and the count condition. If one document from the candidate pair is present in the sub-scope, the count is one. If both documents are present, the count is two. If no documents are present in the sub-scope, the count is zero. The count condition determines whether the pair should be accepted or rejected, depending on the document count in sub-scope.

This property can be used to search for similarities across a larger set of documents with the output filtered to include pairs of documents referencing a smaller set. For example, a search for similar article abstracts but scoped to just those pairs, where at least (or exactly) one document is published in year 2021 or authored by a particular person.

count​Condition

Type
string
Default
"ONE_OR_MORE"
Constraints
one of [ZERO, ONE, ONE_OR_MORE, TWO]
Required
no

Count condition for the number of documents in the pair present in the scope.

The count​Condition property supports the following values:

Z​E​R​O

The pair contains exactly zero documents from the filter query. The filter query acts as an exclusion filter.

O​N​E

The pair contains exactly one document from the filter query. This condition can be used to select pairs of documents from exclusively different sets. For example, pairs of similar documents where one document references a particular author (the other document will not reference that author because the count condition must be exactly 1).

O​N​E_​O​R_​M​O​R​E

The pair contains one or two documents from the filter query. This condition can be used to select pairs of documents where at least one document belongs to a particular set. Contrary to the O​N​E condition, both documents from the pair can be present in the filter query.

T​W​O

The pair contains exactly two documents from the filter query. This condition is rarely used and should be expressed by adding a filtering clause to the duplicate scope query.

query

Type
query
Default
{
  "type": "query:all"
}
Required
no

The query that provides the set of documents for this pair filter.

hash​Grouping

Type
object
Default
{
  "features": {
    "type": "featureSource:reference",
    "auto": true
  },
  "pairing": {
    "maxHashBitsDifferent": 0,
    "maxHashGroupSize": 200
  }
}
Required
no

The initial phase finding candidate pairs of documents with at least one collision on any document-derived feature.

A feature source should provide a set of features for each document returned by the query. Numeric hashes of these features are then aggregated and any hash collisions point at documents where potentially the same feature (or features) occurred with high probability.

High performance of this phase is critical for the overall performance and depends on:

  • uniqueness of features,
  • feature extraction cost,
  • overall number of different features (hashes)

There are several feature sources to choose from. They vary in computational cost and can produce a different number of features for the same input. For example, sentence-based features are much faster to compute than term n-gram features and their number is much smaller for the same input (n-grams overlap, sentences don't). This said, the choice of the feature source depends heavily on the task so none of them are better than others. Ideally, one should try to assemble a feature source where features with identical hashes are shared only by documents that will eventually be considered duplicates.

features

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

The source of features for hash collision grouping. See general remarks under hash grouping for more information.

pairing

Type
object
Default
{
  "maxHashBitsDifferent": 0,
  "maxHashGroupSize": 200
}
Required
no

Configures hash-grouping conditions and limits. This section is used to prevent poor features from blowing up pairwise similarity computation costs and to set up maximum bit difference thresholds used in tandem with the simhash feature source.

max​Hash​Bits​Different
Type
integer
Default
0
Constraints
value >= 0 and value <= 5
Required
no

This parameter makes sense only if used in combination with the top-level simhash feature source and specifies the maximum number of bits by which feature hashes may differ in order to still be considered as conflicting.

max​Hash​Group​Size
Type
integer
Default
200
Constraints
value >= 0
Required
no

Maximum size of a candidate document group sharing the same feature hash. For example, a term-based feature source could return a feature based on a very frequent term, shared by all documents. This would cause all the documents sharing the hash for that feature to be pairwise-compared, leading to exponential processing times. The algorith will omit any such features, assuming other, more unique features will still be present to trigger a hash collision for similar document pairs.

The number of skipped groups is returned as part of the log from the analysis. A high number of skipped groups may indicate a problem with the feature source (features are not unique enough).

output

Type
object
Default
{
  "explanations": false,
  "diagnostics": true
}
Required
no

This element configures how the output pairs are emitted and the additional properties they contain.

diagnostics

Type
boolean
Default
true
Required
no

If set to true, the response will include additional diagnostic information about each phase of duplicate detection. This will include the number of hashes, hash conflicts and other hints that help diagnose problems.

This property should be used as a debugging tool, it is not meant for production use.

explanations

Type
boolean
Default
false
Required
no

If set to true, each output pair will contain a human-readable explanation of how the pairwise similarity value was computed. This explanation is typically straightforward but may become quite complex for complex nested definitions of similarity.

This property should be used as a debugging tool, it is not meant for production use.

limit

Type
integer
Default
undefined
Constraints
value >= 0
Required
no

Specifies the maximum number of pairs to be returned. If defined, the output is limited to this number, even if more pairs satisfying the pairwise similarity have been found.

By default the output is not limited.

query

Type
query
Default
null
Required
yes

A query:​* component defining all the input documents that are passed to the hash grouping phase. This is the set of all documents among which pairs of duplicates should be found, unless a secondary document​Pair​Filter is used to prune certain pairs.

validation

Type
validationCriteria
Default
{
  "min": 0,
  "max": 1.7976931348623157e+308,
  "debug": false,
  "pairwiseSimilarity": {
    "type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
    "features": {
      "type": "featureSource:reference",
      "auto": true
    }
  }
}
Required
no

Configuration of the second phase of duplicate document detection, measuring the similarity between any candidate document pairs found in hash grouping and passing the pair filter.

validation​Filters

Type
array of validationCriteria
Default
[]
Required
no

Validation filters are an advanced technique of speeding up pairwise similarity computation. If one or more cheap lower bound approximation of the final similarity can be defined, candidate document pairs can be filtered out before the final, more costly, pairwise similarity is computed.

Validation filters are an array of zero or more pairwise similarity components. If any of the filters rejects a document pair, its evaluation stops early and the final similarity is never computed for that pair.

validation​Criteria

Validation criteria combine the definition of pairwise similarity between documents and the allowed range for this similarity (minimum and maximum values).

{
  "debug": false,
  "max": 1.7976931348623157e+308,
  "min": 0,
  "pairwiseSimilarity": {
    "type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
    "features": {
      "type": "featureSource:reference",
      "auto": true
    }
  }
}

This type is most typically used in the validation step of duplicate detection. Additional validation can also be used to speed up processing using validation filters

debug

Type
boolean
Default
false
Required
no

Emits additional debugging information for each document pair in the output. The format and structure of the explanation is subject to change without notice.

max

Type
number
Default
1.7976931348623157e+308
Constraints
value >= 0
Required
no

Limits the returned document pairs to those with the similarity equal or smaller than this value.

min

Type
number
Default
0
Constraints
value >= 0
Required
no

Limits the returned document pairs to those with the similarity equal or larger than this value.

pairwise​Similarity

Type
pairwiseSimilarity
Default
{
  "type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
  "features": {
    "type": "featureSource:reference",
    "auto": true
  }
}
Required
no

Any pairwise similarity function that returns a numeric value of similarity between two document candidates.

Ideally, the function should be fast to compute, especially if the number of feature collisions (document candidate pairs) is high. If the similarity is costly, it's worth to consider adding one or more fast validation filters to limit the number of pairs to which the full similarity check is applied.

Consumers of document​Pairs:​*

The following stages and components take document​Pairs:​* as input:

Stage or component Property
document​Overlap
  • document​Pairs
  • documents:​from​Document​Pairs
  • document​Pairs