Highlighting duplicate text regions

The document​Overlap request stage is responsible for locating identical text fragments in a pair of documents and returning information useful for visualization of such regions.

Here is an example rendered output of a document overlap request in Lingo4G Explorer:

Field view with duplicated text fragments highlighted.
Field view with duplicated text fragments highlighted.

Field view with duplicated text fragments highlighted.

Context-aligned duplicated text fragments.
Context-aligned duplicated text fragments.

Context-aligned duplicated text fragments.

In the remaining part of this chapter we will see how the text overlap request is built and combined with other request stages. You should be familiar with the duplicate detection requests and the concepts discussed there (preliminary grouping, pairwise similarity, hashes).

Request structure

The illustration below shows an outline structure of a document overlap (text duplication) request:

{
  "type": "documentOverlap",
  "documentPairs": {
      ...
  },
  "pairwiseSimilarity": {
    "type": "pairwiseSimilarity:documentOverlapRatio",
    ...
  },
  "alignedFragments": {
      ...
  },
  "fragmentsInFields": {
      ...
  }
}    

We intentionally omit the full specification of all properties, leaving only the key blocks. The document​Pairs element points at the stream of document pairs — each document pair is independent and will emit overlap information for only that pair. Typically, this property will be a reference to the duplicate detection stage, discussed earlier.

The pairwise​Similarity element describes how to identify identical text fragments in both documents. We already discussed the concept of pairwise similarity between document fields and gave a few examples of how it can be defined. The document overlap stage uses a constant, non-customizable type of similarity — the document​Overlap​Ratio. The similarity metric is constant because it emits a useful byproduct of how the score is computed: a list of contiguous word sequences shared by the two documents. Even though the type of similarity is constant, its parameters (primarily the n-gram window) can be fine-tuned, as we will show soon.

The two remaining properties (aligned​Fragments and fragments​In​Fields) specify the output requested from the component. We will explain their meaning and parameters on concrete examples later in this chapter.

Finding cloned text passages in arXiv abstracts

Let's reuse the arXiv example data set again to find cloned text passages in the "abstract" field, where the overall abstract text similarity falls between 60% and 70%.

Run the indexing command and the server (this will take a while):

l4g index  -p datasets/dataset-arxiv
l4g server -p datasets/dataset-arxiv

Then open the sandbox API editor (it should be available at https://localhost:8080.

A little unconventionally, we'll start with the full final request and then break it down into smaller pieces, explaining what they do along the way. Here is the full request, run it in the API editor.

{
  "variables": {
    "fieldsToCompare": {
      "value": [
        "abstract"
      ]
    }
  },

  "components": {
    "sourceFields": {
      "type": "fields:simple",
      "fields": {
        "@var": "fieldsToCompare"
      }
    },
    "documentSimilarity": {
      "type": "pairwiseSimilarity:documentOverlapRatio",
      "fields": {
        "type": "fields:reference",
        "use": "sourceFields"
      },
      "ngramWindow": 10
    }
  },

  "stages": {

    "similarPairs": {
      "type": "documentPairs:duplicates",

      "query": {
        "type": "query:string",
        "query": "created:[2015 TO 2017]"
      },

      "hashGrouping": {
        "pairing": {
          "maxHashGroupSize": 200
        },
        "features": {
          "type": "featureSource:sentences",
          "fields":{
            "type": "fields:reference",
            "use": "sourceFields"
          }
        }
      },

      "validation": {
        "pairwiseSimilarity": {
          "type": "pairwiseSimilarity:reference",
          "use": "documentSimilarity"
        },
        "min": 0.6,
        "max": 0.7
      }
    },

    "documents": {
      "type": "documentContent",
      "limit": "unlimited",

      "documents": {
        "type": "documents:fromDocumentPairs",
        "documentPairs": {
          "type": "documentPairs:reference",
          "use": "similarPairs"
        }
      },

      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "id": {},
          "title": {},
          "author_name": {},
          "created": {},
          "updated": {}
        }
      }
    },

    "overlaps": {
      "type": "documentOverlap",

      "documentPairs": {
        "type": "documentPairs:reference",
        "use": "similarPairs"
      },

      "pairwiseSimilarity": {
        "type": "pairwiseSimilarity:reference",
        "use": "documentSimilarity"
      },

      "alignedFragments": {
        "contextChars": 80,
        "maxFragments": 10,
        "fields": {
          "type": "contentFields:grouped",
          "groups": [
            {
              "fields": {
                "@var": "fieldsToCompare"
              },
              "config": {
                "maxValueLength": 3000
              }
            }
          ]
        }
      },

      "fragmentsInFields": {
        "contextChars": 600,
        "fields": {
          "type": "contentFields:grouped",
          "groups": [
            {
              "fields": {
                "@var": "fieldsToCompare"
              },
              "config": {
                "maxValues": 10,
                "maxValueLength": 3000
              }
            }
          ]
        }
      }
    }
  }
}

An API request to find abstracts with the similarity score between 60 and 70%, including pointers to duplicated text fragments in each abstract.

You'll probably notice the request is rather long, but it's because it combines three separate stages, each fulfilling a different role. Let's discuss them in the order they appear in the request.

Variables and reusable components

At the very beginning of the request we leverage Lingo4G's ability to inject variables and reference other components so that we won't have to repeat the same code over and over:

"variables": {
  "fieldsToCompare": {
    "value": [
      "abstract"
    ]
  }
},

"components": {
  "sourceFields": {
    "type": "fields:simple",
    "fields": {
      "@var": "fieldsToCompare"
    }
  },
  "documentSimilarity": {
    "type": "pairwiseSimilarity:documentOverlapRatio",
    "fields": {
      "type": "fields:reference",
      "use": "sourceFields"
    },
    "ngramWindow": 10
  }
},

The fields​To​Compare variable contains the field we will use for computing pairwise document similarity and text overlaps (abstract). We also declare a component with this list of fields (source​Fields) and the pairwise similarity component computing similarity score based on the overlap of word 10-grams (document​Similarity).

Documents with 60-70% abstract similarity

Here is the document duplicate detection stage part of our original request:

"similarPairs": {
  "type": "documentPairs:duplicates",

  "documents": {
    "type": "documents:byQuery",
    "query": {
      "type": "query:string",
      "query": "*"
    },
    "limit": "unlimited"
  },

  "hashGrouping": {
    "pairing": {
      "maxHashGroupSize": 200
    },
    "features": {
      "type": "featureSource:sentences",
      "fields":{
        "type": "fields:reference",
        "use": "sourceFields"
      }
    }
  },

  "validation": {
    "pairwiseSimilarity": {
      "type": "pairwiseSimilarity:reference",
      "use": "documentSimilarity"
    },
    "min": 0.6,
    "max": 0.7
  }
},

Since we're looking for abstracts with 60-70% of text similarity, it seems safe to assume that such abstracts will share at least one full sentence. This assumption is reflected in the definition of the preliminary grouping phase (hash​Grouping), where we use a feature source returning sentences from the text of the abstract. We could have used n-grams or other types of features here as well but entire sentences are quite unique and are more efficient here (the number of features is smaller compared to n-grams).

After the preliminary grouping collects documents with at least one shared sentence, we compare them using the pairwise​Similarity:​document​Overlap​Ratio similarity referenced from the reusable components block. We're only interested in document pairs with similarity score between 0.6 and 0.7 and the min and max properties express this.

Including other fields

The second part of the request declares the document​Content stage which fetches additional fields for any document that passes the similarity score. We convert similar​Pairs into a flat list of documents using the documents:​from​Document​Pairs stage and use the flat list as the source of documents for field value retrieval.

"documents": {
  "type": "documentContent",
  "limit": "unlimited",
  "documents": {
    "type": "documents:fromDocumentPairs",
    "documentPairs": {
      "type": "documentPairs:reference",
      "use": "similarPairs"
    }
  },
  "fields":{
    "type": "contentFields:simple",
    "fields": {
      "id": {},
      "title": {},
      "author_name": {},
      "created": {},
      "updated": {}
    }
  }
},

Retrieving duplicated text fragments

The last part of the request contains the definition of the text overlap stage. It starts by referencing the similar​Pairs stage to provide the source pairs of documents to run the analysis on:

"overlaps": {
  "type": "documentOverlap",

  "documentPairs": {
    "type": "documentPairs:reference",
    "use": "similarPairs"
  },

  "pairwiseSimilarity": {
    "type": "pairwiseSimilarity:reference",
    "use": "documentSimilarity"
  },

...
  

Then it references the shared pairwise similarity component (document​Similarity) we also used previously for duplicate detection:

...

"pairwiseSimilarity": {
  "type": "pairwiseSimilarity:reference",
  "use": "documentSimilarity"
},

...

The next section contains the specification of the output we expect to receive from the component. Let's start by explaining a bit of the mechanics of how the algorithm works. The document​Overlap stage detects duplicated sequences of tokens in text. It is important to understand that these duplicated fragments may appear anywhere, for example at the beginning of one document, but at the end of the other. The algorithm implemented in Lingo4G is not an ordered linear "difference" between the two documents (corresponding to potential insertions or deletions of text) but rather a set of pointers to maximum-length sub-sequences containing the same text. The difference is subtle but important. Consider these two example documents:

document A document B
Lucy has a gray cat. Mary has a blue pigeon. Bob has a wild turkey. Bob has a wild turkey. Katie has a purple fish. Lucy has a gray cat.

Assuming the n-gram window for the pairwise​Similarity is 3, there would be two overlapping fragments detected by the algorithm:

fragment text
1 Bob has a wild turkey.
2 Lucy has a gray cat.

Note that if these documents were very long, and the number of overlapping fragments larger, it would quickly become very difficult to visually locate each duplicated fragment, even if we knew that they must exist somewhere. That's why we can ask the document​Overlap stage for two different views of the same text overlap information: one that highlights them directly in the field of each document, and another which returns a list of duplicated fragments, with some left and right context surrounding it in the text of each document.

The first option is called fragments​In​Fields and would look something like this if we applied it to our example (we'll see the real API output in a second):

document A document B
Lucy has a gray cat. Mary has a blue pigeon. Bob has a wild turkey. Bob has a wild turkey. Katie has a purple fish. Lucy has a gray cat.

The second type of view is called aligned​Fragments and it returns information transformed to something like the table below.

fragment document A document B
1 ... blue pigeon. Bob has a wild turkey. Bob has a wild turkey. Katie has a ...
2 Lucy has a gray cat. Mary has a ... ... purple fish. Lucy has a gray cat.

Note that the aligned view shows text duplications in the order of their length (longer passages first) and shows a limited context of the fragment in each document. We believe that both these views have benefits in different applications. The advantage of the "fragments in fields" view is that it provides a general glimpse at how much text is duplicated and where it occurs in the field value. The aligned view is more useful for quickly looking at each individual duplicate and its context.

Let's look at how to how these views are configured in our request and the real API output.

aligned​Fragments view

The aligned​Fragments view is configured with this block in our request:

"alignedFragments": {
  "contextChars": 80,
  "maxFragments": 10,
  "fields": {
    "type": "contentFields:grouped",
    "groups": [
      {
        "fields": {
          "@var": "fieldsToCompare"
        },
        "config": {
          "maxValueLength": 3000
        }
      }
    ]
  }
},
  

The most important properties are the max​Fragments, which declares the maximum number of the longest duplicated fragments returned, and context​Chars which tells how many characters to the left and right of the duplicated passage should be returned. An additional fields component is used to configure constraints on returned value. In case of very large duplicated fragments (or very large document values) it may be useful to truncate the returned fragments (this is the role of max​Value​Length above).

Here is an excerpt from the JSON output for our original request:

"overlaps": [
  {
    "pair": [
      870338,
      1172681
    ],
    "stats": {
      "a": {
        "abstract": {
          "fragments": 2,
          "fieldLength": 1105,
          "overlapLength": 773,
          "ratio": 0.6995475113122172
        }
      },
      "b": {
        "abstract": {
          "fragments": 2,
          "fieldLength": 1235,
          "overlapLength": 772,
          "ratio": 0.6251012145748988
        }
      }
    },
    ...
    

The output for the stage (overlaps) contains an array of objects corresponding to document pairs from similarity analysis. The pair property contains internal identifiers of the documents for which duplicated text fragments follow. The stats property contains the per-field information about the number of fragments, total field value length and the overlap length (in characters).

What follows in the output for the aligned fragments view:

"alignedFragments": [
  {
    "length": 584,
    "context": {
      "a": "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\nfrequentist literature, little is known about the properties of such priors and\nthe convergence and concentration of the corresponding posterior
",
      "b": "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\ncorresponding frequentist literature, very little is known about the properties\nof such priors. Focusing on a broad class of shrinkage priors, we
"
    }
  },
  {
    "length": 189,
    "context": {
      "a": "
this article, we propose a new class of Dirichlet--Laplace (DL) priors,\nwhich possess optimal posterior concentration and ⁌overlap⁍lead to efficient posterior\ncomputation exploiting results from normalized random measure theory. Finite\nsample performance of Dirichlet--Laplace priors relative to alternatives is\nassessed⁌\\overlap⁍ in simulated and real data examples.\n",
      "b": "
Bayesian\nLasso, are suboptimal in high-dimensional settings. A new class of Dirichlet\nLaplace (DL) priors are proposed, which are optimal and ⁌overlap⁍lead to efficient\nposterior computation exploiting results from normalized random measure theory.\nFinite sample performance of Dirichlet Laplace priors relative to alternatives\nis assessed⁌\\overlap⁍ in simulations.\n"
    }
  }
],

Each aligned fragment is an object with the length of the duplicated fragment (in characters) and a text value for documents ("a" indicates the first document and "b" the second identifier from the pair element). The next value contains an ellipsis character where the value was truncated (
) and a pair of special markers to indicate the start and end of the fragment: ⁌overlap⁍, ⁌\overlap⁍. These markers contain fairly unique Unicode characters, so they're not likely to appear anywhere in the text. They can be used directly for highlighting; here is how Lingo4G Explorer shows aligned fragments:

Context-aligned duplicated text fragments.
Context-aligned duplicated text fragments.

Context-aligned duplicated text fragments.

fragments​In​Fields view

The fragments-in-fields view is configured similarly to the aligned fragments, although the interpretation of highlighting directives is a bit different.

"fragmentsInFields": {
  "contextChars": 600,
  "fields": {
    "type": "contentFields:grouped",
    "groups": [
      {
        "fields": {
          "@var": "fieldsToCompare"
        },
        "config": {
          "maxValues": 10,
          "maxValueLength": 3000
        }
      }
    ]
  }
}
  

The algorithm takes each document's field value and all the overlap regions, then tries to find a display window that maximizes the number of overlaps present inside. The intuition here is to show the context of each overlap within the value — much like snippets from the matching Web page in a search engine. The max​Values defines the maximum number of value ranges to display, the max​Value​Length is the maximum length (in characters) of each value. If max​Value​Length is larger than the actual field length, the entire field value is shown, with all overlapping fragments highlighted. An additional context​Chars property controls the leading and trailing context of each value range selected for display.

Here is the relevant JSON output in the response. Note the ⁌overlap⁍ and ⁌\\overlap⁍ highlight markers. We used a very large max​Value​Length of 3000 so the entire field value is returned as one passage, with no truncations.

"fragmentsInFields": {
  "a": {
    "abstract": [
      "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\nfrequentist literature, little is known about the properties of such priors and\nthe convergence and concentration of the corresponding posterior distribution.\nIn this article, we propose a new class of Dirichlet--Laplace (DL) priors,\nwhich possess optimal posterior concentration and ⁌overlap⁍lead to efficient posterior\ncomputation exploiting results from normalized random measure theory. Finite\nsample performance of Dirichlet--Laplace priors relative to alternatives is\nassessed⁌\\overlap⁍ in simulated and real data examples.\n"
    ]
  },
  "b": {
    "abstract": [
      "  ⁌overlap⁍Penalized regression methods, such as undefined regularization, are routinely\nused in high-dimensional applications, and there is a rich literature on\noptimality properties under sparsity assumptions. In the Bayesian paradigm,\nsparsity is routinely induced through two-component mixture priors having a\nprobability mass at zero, but such priors encounter daunting computational\nproblems in high dimensions. This has motivated an amazing variety of\ncontinuous shrinkage priors, which can be expressed as global-local scale\nmixtures of Gaussians, facilitating computation. In sharp contrast⁌\\overlap⁍ to the\ncorresponding frequentist literature, very little is known about the properties\nof such priors. Focusing on a broad class of shrinkage priors, we provide\nprecise results on prior and posterior concentration. Interestingly, we\ndemonstrate that most commonly used shrinkage priors, including the Bayesian\nLasso, are suboptimal in high-dimensional settings. A new class of Dirichlet\nLaplace (DL) priors are proposed, which are optimal and ⁌overlap⁍lead to efficient\nposterior computation exploiting results from normalized random measure theory.\nFinite sample performance of Dirichlet Laplace priors relative to alternatives\nis assessed⁌\\overlap⁍ in simulations.\n"
    ]
  }
}

Here is how this information (combined with the extra fields output) is displayed in the Explorer:

Field view with duplicated text fragments highlighted.
Field view with duplicated text fragments highlighted.

Field view with duplicated text fragments highlighted.

If we wanted to display shorter snippets with overlap highlights, instead of full values, we could set the max​Value​Length to a smaller value (300 characters) and the context characters to 100 characters, like this:

"fragmentsInFields": {
  "contextChars": 100,
  "fields": {
    "type": "contentFields:grouped",
    "groups": [
      {
        "fields": {
          "@var": "fieldsToCompare"
        },
        "config": {
          "maxValues": 10,
          "maxValueLength": 300
        }
      }
    ]
  }
}
  

The output would then look as shown below (note ellipsis truncation markers):

Snippet-like field view with duplicated text fragments.
Snippet-like field view with duplicated text fragments.

Snippet-like field view with duplicated text fragments.

Finally, note that the algorithm for choosing which snippets to choose is independent for each document in a pair, so it is possible that different regions of each document (with different duplicated fragments) are selected and returned as output. For side-by-side view, it is recommended to set the max​Value​Length to a number exceeding maximum field length of any document.