documentOverlap
Document overlap analysis provides various reports about duplicate text regions in one or more pairs of documents.
Here is an example rendered output of a document overlap request in Lingo4G Explorer:
A complete example of how this functionality works and can be leveraged in applications is in the tutorial on highlighting duplicate document regions.
documentâOverlap
Detects and returns contiguous duplicated fragments of text in one or more pairs of documents. The returned fragments are sorted by their length but may be unordered with respect to each other.
{
"type": "documentOverlap",
"alignedFragments": {
"contextChars": 160,
"fields": {
"type": "contentFields:empty"
},
"maxFragments": "unlimited"
},
"documentPairs": {
"type": "documentPairs:reference",
"auto": true
},
"fragmentsInFields": {
"contextChars": 160,
"fields": {
"type": "contentFields:empty"
}
},
"pairwiseSimilarity": {
"type": "pairwiseSimilarity:documentOverlapRatio",
"allowedGapRatio": 0,
"computeDifferences": false,
"crossFieldOverlaps": true,
"fields": {
"type": "fields:reference",
"auto": true
},
"ngramWindow": 6
}
}
The output of this stage is an array of objects, each corresponding to one document pair sourced from documentPairs:
[
{
"pair": [ 1627, 1848 ],
"stats" : {
"a" : {
"field" : {
"fragments" : 2,
"fieldLength" : 80,
"overlapLength" : 43,
"ratio" : 0.5375
}
},
"b" : {
"field" : {
"fragments" : 2,
"fieldLength" : 84,
"overlapLength" : 43,
"ratio" : 0.5119047619047619
}
}
},
"alignedFragments" : [
...
],
"fragmentsInFields": {
}
}
// ... more pairs
]
The pair
property contains internal document identifiers. The
stats
property contains the per-field information about the number of discovered duplicated text fragments, total field
value length and the overlap length (in characters). The
alignedâFragments
and
fragmentsâInâFields
are optional output report sections, their structure is discussed with each corresponding configuration property.
alignedâFragments
Configures the report that displays the same duplicate fragments of text in the context of their occurrences in both documents. This view shows the different "context" of the same passage in both documents. Here is an example output for a single document pair:
"alignedFragments": [
{
"length": 584,
"context": {
"a": " âoverlapâPenalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrastâ\\overlapâ to the\nfrequentist literature, little is known about the properties of such priors and\nthe convergence and concentration of the corresponding posteriorâŠ",
"b": " âoverlapâPenalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrastâ\\overlapâ to the\ncorresponding frequentist literature, very little is known about the properties\nof such priors. Focusing on a broad class of shrinkage priors, weâŠ"
}
},
{
"length": 189,
"context": {
"a": "âŠthis article, we propose a new class of Dirichlet--Laplace (DL) priors,\nwhich possess optimal posterior concentration and âoverlapâlead to efficient posterior\ncomputation exploiting results from normalized random measure theory. Finite\nsample performance of Dirichlet--Laplace priors relative to alternatives is\nassessedâ\\overlapâ in simulated and real data examples.\n",
"b": "âŠBayesian\nLasso, are suboptimal in high-dimensional settings. A new class of Dirichlet\nLaplace (DL) priors are proposed, which are optimal and âoverlapâlead to efficient\nposterior computation exploiting results from normalized random measure theory.\nFinite sample performance of Dirichlet Laplace priors relative to alternatives\nis assessedâ\\overlapâ in simulations.\n"
}
}
],
Each aligned fragment is an object with the length
property containing the duplicated fragment's
size in characters, and a context
sub-element for both documents (constant properties
"a"
and "b"
correspond to the first and second document of the pair.
The next value in the above example contains an ellipsis character at the position the value was truncated
(âŠ
) and a pair of special markers to indicate the start and end of the fragment:
âoverlapâ
, â\overlapâ
. These markers contain fairly unique Unicode characters, so
they're not likely to appear anywhere in the text. They can be used directly for highlighting; here is how
Lingo4G Explorer displays aligned fragments:
contextâChars
The number of context characters to the left and right of the duplicated fragment.
fields
This element configures the truncation and highlight marker details for the report. The duplicated fragment of text from both documents is passed to the highlighting engine for formatting so any options applicable to highlighting can be used to format the report as well. For example, contentFields:grouped can be used to declare the maximum length of the quoted fragment, tune the highlight markers or the truncation character:
"fields": {
"type": "contentFields:grouped",
"groups": [
{
"fields": [
"title", "abstract"
],
"config": {
"maxValues": 1,
"maxValueLength": 600,
"highlighting" : {
"startMarker" : ">>",
"endMarker" : "<<"
}
}
}
]
}
maxâFragments
The maximum number of fragments to display or a constant string
unlimited
for no limit. Fragments are sorted by their descending score, which is derived from
their length.
documentâPairs
A reference to the source stream of document pairs to analyze. The output contains a report of overlaps that is generated independently for each input pair.
fragmentsâInâFields
Configures the report that displays the document field values and all (or a subset) of duplicate fragments found in both documents. This view can be used to display a contiguous view of how much of the field's value was found to be a duplicate of another document's field (or fields).
Here is the relevant JSON output in the response. Note the
âoverlapâ
and â\\overlapâ
highlight markers around duplicated fragments. We used a
very large maxâValueâLength
of 3000 so the entire field value is returned as one passage, with no
truncations.
"fragmentsInFields": {
"a": {
"abstract": [
" âoverlapâPenalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrastâ\\overlapâ to the\nfrequentist literature, little is known about the properties of such priors and\nthe convergence and concentration of the corresponding posterior distribution.\nIn this article, we propose a new class of Dirichlet--Laplace (DL) priors,\nwhich possess optimal posterior concentration and âoverlapâlead to efficient posterior\ncomputation exploiting results from normalized random measure theory. Finite\nsample performance of Dirichlet--Laplace priors relative to alternatives is\nassessedâ\\overlapâ in simulated and real data examples.\n"
]
},
"b": {
"abstract": [
" âoverlapâPenalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrastâ\\overlapâ to the\ncorresponding frequentist literature, very little is known about the properties\nof such priors. Focusing on a broad class of shrinkage priors, we provide\nprecise results on prior and posterior concentration. Interestingly, we\ndemonstrate that most commonly used shrinkage priors, including the Bayesian\nLasso, are suboptimal in high-dimensional settings. A new class of Dirichlet\nLaplace (DL) priors are proposed, which are optimal and âoverlapâlead to efficient\nposterior computation exploiting results from normalized random measure theory.\nFinite sample performance of Dirichlet Laplace priors relative to alternatives\nis assessedâ\\overlapâ in simulations.\n"
]
}
}
Here is how this information (combined with the extra fields output) is displayed in the Explorer:
The way field values are presented can be configured using the
fields
specification, similarly to how it was done for the aligned fragments report.
contextâChars
The number of context characters to the left and right of the duplicated fragment.
fields
This element configures the truncation and highlight marker details for the report. Field values and all duplicated fragments of text from both documents are passed to the highlighting engine for formatting so any options applicable to highlighting can be used to format the report as well. For example, contentFields:grouped can be used to declare the maximum length of the quoted fragment, tune the highlight markers or the truncation character:
"fragmentsInFields": {
"contextChars": 100,
"fields": {
"type": "contentFields:grouped",
"groups": [
{
"fields": {
"title", "abstract"
},
"config": {
"maxValues": 10,
"maxValueLength": 300
}
}
]
}
}
pairwiseâSimilarity
Document overlap ratio similarity used to detect duplicate text regions. This property must reference precisely this implementation of pairwise similarity, it will not work with any other type.
The length of individual n-grams, whether cross-field detection is allowed and which fields are used as source text will affect the generated duplication report. Refer to the documentation of Document overlap ratio similarity for more information on how the overlapping regions are computed.