documentOverlap

Document overlap analysis provides various reports about duplicate text regions in one or more pairs of documents.

Here is an example rendered output of a document overlap request in Lingo4G Explorer:

Field view with duplicated text fragments highlighted.
Field view with duplicated text fragments highlighted.

Field view with duplicated text fragments highlighted.

A complete example of how this functionality works and can be leveraged in applications is in the tutorial on highlighting duplicate document regions.

document​Overlap

Detects and returns contiguous duplicated fragments of text in one or more pairs of documents. The returned fragments are sorted by their length but may be unordered with respect to each other.

{
  "type": "documentOverlap",
  "alignedFragments": {
    "contextChars": 160,
    "fields": {
      "type": "contentFields:empty"
    },
    "maxFragments": "unlimited"
  },
  "documentPairs": {
    "type": "documentPairs:reference",
    "auto": true
  },
  "fragmentsInFields": {
    "contextChars": 160,
    "fields": {
      "type": "contentFields:empty"
    }
  },
  "pairwiseSimilarity": {
    "type": "pairwiseSimilarity:documentOverlapRatio",
    "allowedGapRatio": 0,
    "computeDifferences": false,
    "crossFieldOverlaps": true,
    "fields": {
      "type": "fields:reference",
      "auto": true
    },
    "ngramWindow": 6
  }
}

The output of this stage is an array of objects, each corresponding to one document pair sourced from documentPairs:

[
  {
    "pair": [ 1627, 1848 ],
    "stats" : {
      "a" : {
        "field" : {
          "fragments" : 2,
          "fieldLength" : 80,
          "overlapLength" : 43,
          "ratio" : 0.5375
        }
      },
      "b" : {
        "field" : {
          "fragments" : 2,
          "fieldLength" : 84,
          "overlapLength" : 43,
          "ratio" : 0.5119047619047619
        }
      }
    },
    "alignedFragments" : [
      ...
    ],
    "fragmentsInFields": {
    }
  }

  // ... more pairs
]

The pair property contains internal document identifiers. The stats property contains the per-field information about the number of discovered duplicated text fragments, total field value length and the overlap length (in characters). The aligned​Fragments and fragments​In​Fields are optional output report sections, their structure is discussed with each corresponding configuration property.

aligned​Fragments

Type
object
Default
{
  "maxFragments": "unlimited",
  "contextChars": 160,
  "fields": {
    "type": "contentFields:empty"
  }
}
Required
no

Configures the report that displays the same duplicate fragments of text in the context of their occurrences in both documents. This view shows the different "context" of the same passage in both documents. Here is an example output for a single document pair:

"alignedFragments": [
  {
    "length": 584,
    "context": {
      "a": "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\nfrequentist literature, little is known about the properties of such priors and\nthe convergence and concentration of the corresponding posterior
",
      "b": "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\ncorresponding frequentist literature, very little is known about the properties\nof such priors. Focusing on a broad class of shrinkage priors, we
"
    }
  },
  {
    "length": 189,
    "context": {
      "a": "
this article, we propose a new class of Dirichlet--Laplace (DL) priors,\nwhich possess optimal posterior concentration and ⁌overlap⁍lead to efficient posterior\ncomputation exploiting results from normalized random measure theory. Finite\nsample performance of Dirichlet--Laplace priors relative to alternatives is\nassessed⁌\\overlap⁍ in simulated and real data examples.\n",
      "b": "
Bayesian\nLasso, are suboptimal in high-dimensional settings. A new class of Dirichlet\nLaplace (DL) priors are proposed, which are optimal and ⁌overlap⁍lead to efficient\nposterior computation exploiting results from normalized random measure theory.\nFinite sample performance of Dirichlet Laplace priors relative to alternatives\nis assessed⁌\\overlap⁍ in simulations.\n"
    }
  }
],

Each aligned fragment is an object with the length property containing the duplicated fragment's size in characters, and a context sub-element for both documents (constant properties "a" and "b" correspond to the first and second document of the pair.

The next value in the above example contains an ellipsis character at the position the value was truncated (
) and a pair of special markers to indicate the start and end of the fragment: ⁌overlap⁍, ⁌\overlap⁍. These markers contain fairly unique Unicode characters, so they're not likely to appear anywhere in the text. They can be used directly for highlighting; here is how Lingo4G Explorer displays aligned fragments:

Context-aligned duplicated text fragments.
Context-aligned duplicated text fragments.

Context-aligned duplicated text fragments.

context​Chars

Type
integer
Default
160
Required
no

The number of context characters to the left and right of the duplicated fragment.

fields

Type
contentFields
Default
{
  "type": "contentFields:empty"
}
Required
no

This element configures the truncation and highlight marker details for the report. The duplicated fragment of text from both documents is passed to the highlighting engine for formatting so any options applicable to highlighting can be used to format the report as well. For example, contentFields:grouped can be used to declare the maximum length of the quoted fragment, tune the highlight markers or the truncation character:

"fields": {
  "type": "contentFields:grouped",
  "groups": [
    {
      "fields": [
        "title", "abstract"
      ],
      "config": {
        "maxValues": 1,
        "maxValueLength": 600,
        "highlighting" : {
          "startMarker" : ">>",
          "endMarker" : "<<"
        }
      }
    }
  ]
}

max​Fragments

Type
limit
Default
unlimited
Required
no

The maximum number of fragments to display or a constant string unlimited for no limit. Fragments are sorted by their descending score, which is derived from their length.

document​Pairs

Type
documentPairs
Default
{
  "type": "documentPairs:reference",
  "auto": true
}
Required
no

A reference to the source stream of document pairs to analyze. The output contains a report of overlaps that is generated independently for each input pair.

fragments​In​Fields

Type
object
Default
{
  "fields": {
    "type": "contentFields:empty"
  },
  "contextChars": 160
}
Required
no

Configures the report that displays the document field values and all (or a subset) of duplicate fragments found in both documents. This view can be used to display a contiguous view of how much of the field's value was found to be a duplicate of another document's field (or fields).

Here is the relevant JSON output in the response. Note the ⁌overlap⁍ and ⁌\\overlap⁍ highlight markers around duplicated fragments. We used a very large max​Value​Length of 3000 so the entire field value is returned as one passage, with no truncations.

"fragmentsInFields": {
  "a": {
    "abstract": [
      "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\nfrequentist literature, little is known about the properties of such priors and\nthe convergence and concentration of the corresponding posterior distribution.\nIn this article, we propose a new class of Dirichlet--Laplace (DL) priors,\nwhich possess optimal posterior concentration and ⁌overlap⁍lead to efficient posterior\ncomputation exploiting results from normalized random measure theory. Finite\nsample performance of Dirichlet--Laplace priors relative to alternatives is\nassessed⁌\\overlap⁍ in simulated and real data examples.\n"
    ]
  },
  "b": {
    "abstract": [
      "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\ncorresponding frequentist literature, very little is known about the properties\nof such priors. Focusing on a broad class of shrinkage priors, we provide\nprecise results on prior and posterior concentration. Interestingly, we\ndemonstrate that most commonly used shrinkage priors, including the Bayesian\nLasso, are suboptimal in high-dimensional settings. A new class of Dirichlet\nLaplace (DL) priors are proposed, which are optimal and ⁌overlap⁍lead to efficient\nposterior computation exploiting results from normalized random measure theory.\nFinite sample performance of Dirichlet Laplace priors relative to alternatives\nis assessed⁌\\overlap⁍ in simulations.\n"
    ]
  }
}

Here is how this information (combined with the extra fields output) is displayed in the Explorer:

Field view with duplicated text fragments highlighted.
Field view with duplicated text fragments highlighted.

Field view with duplicated text fragments highlighted.

The way field values are presented can be configured using the fields specification, similarly to how it was done for the aligned fragments report.

context​Chars

Type
integer
Default
160
Required
no

The number of context characters to the left and right of the duplicated fragment.

fields

Type
contentFields
Default
{
  "type": "contentFields:empty"
}
Required
no

This element configures the truncation and highlight marker details for the report. Field values and all duplicated fragments of text from both documents are passed to the highlighting engine for formatting so any options applicable to highlighting can be used to format the report as well. For example, contentFields:grouped can be used to declare the maximum length of the quoted fragment, tune the highlight markers or the truncation character:

"fragmentsInFields": {
  "contextChars": 100,
  "fields": {
    "type": "contentFields:grouped",
    "groups": [
      {
        "fields": {
          "title", "abstract"
        },
        "config": {
          "maxValues": 10,
          "maxValueLength": 300
        }
      }
    ]
  }
}

pairwise​Similarity

Type
pairwiseSimilarity
Default
{
  "type": "pairwiseSimilarity:documentOverlapRatio",
  "fields": {
    "type": "fields:reference",
    "auto": true
  },
  "ngramWindow": 6,
  "crossFieldOverlaps": true,
  "allowedGapRatio": 0,
  "computeDifferences": false
}
Required
no

Document overlap ratio similarity used to detect duplicate text regions. This property must reference precisely this implementation of pairwise similarity, it will not work with any other type.

The length of individual n-grams, whether cross-field detection is allowed and which fields are used as source text will affect the generated duplication report. Refer to the documentation of Document overlap ratio similarity for more information on how the overlapping regions are computed.