Duplicate detection
Duplicate detection identifies pairs of documents with overlapping content in one or more of their fields. The degree of overlap can range from entire field values (exact duplicates), almost all the content (near duplicates) or just partial overlap (sentences, paragraphs). You can tune many parameters to define the nature of the overlap and vary the sensitivity of detection.
Duplicate detection introduction
Before we discuss the details of duplication detection and similarity analysis in Lingo4G, it's worth taking a look at a toy example that demonstrates the difficulties of the problem itself as well as introduce the key concepts used later in this chapter: features, hash grouping and pairwise similarity.
Let's imagine with following situation:
There is a parent-teacher conference at a large school. Many kids, teachers and their parents or legal guardians showed up:
🧔🏼 👵 👨🏼 👵🏻 👧🏽 👵🏼 👨🏽🦱 👴🏽 🧓🏻 👵🏽 👩🏼🦱 👵🏿 🧑🏼 👵🏾 👦🏿 👱🏼 👨🦰 👴 👱 👨🦲 👨🏽🦲 👴🏻 🧑🏿 👨🦱 👩🏾🦱 👴🏼 👩🏾 👨🏾🦰 👨🏾🦱 👴🏾 👨 👧 🧓🏾 👴🏿 👨🏾🦲 👦🏼 👨🏿 🧓 👧🏿 🧑 👩🏼 🧓🏼 👧🏼 👨🏽🦰 👨🏽 🧓🏽 👱🏾 👨🏻🦱 👨🏻🦰 🧓🏿 👦🏻 🧑🏾 👱🏻 👩 👧🏾 👩🦰 🧑🏽 👩🏻 🧔🏾 👱🏿 👨🏻🦲 👩🏽 👨🏻 👩🏾🦰 👦 👩🏾 👨🏾 👧🏻 👦🏾 👩🏿 👨🏼🦰 👦🏽 🧑🏻 👩🏻🦰 👨🏿🦰 🧔🏻 👩🏻🦲 👩🏼🦰 👨🏼🦱 👩🏿🦱 👩🏽🦲 👩🏽🦰 👨🏿🦱 👩🦲 👩🏿🦲 👩🏿🦰 👨🏼🦲 👩🏼🦲 🧔🏻 👩🦱 👨🏿🦲 👩🏾🦲 🧔🏽 👩🏻🦱 👱🏽 🧔 🧔🏿 👩🏽🦱
The problem to consider is this:
Are there any two (or more) people that are lookalikes?
The first thing to realize is that the definition of a "lookalike" is really vague. Say, a stunt double should be very similar to the actor they substitute on the set (but they're rarely identical). This is hardly a useful definition: we need a more concrete, measurable notion of pairwise similarity between two objects.
Features and pairwise similarity
Each face in our example has a number of attributes, we will call them (quite appropriately) features:
- sex (male, female, other),
- hair color (blond, brown, dark, white, red, none),
- skin tone (yellow, white, light brown, brown, dark brown, black),
- facial hair (beard, moustache, none),
- age (child, teen, adult, elderly),
- glasses (yes, no).
Note that some features may have a discrete set of values (glasses, facial hair, hair color), while others may be contiguous (age). We will use discrete values everywhere for simplicity.
It seems intuitive that people sharing more features should be better candidates for a "lookalike" (or a stunt double). Our pairwise similarity between any two faces can be defined as the number of shared features and our goal would be to maximize this similarity. Here are three example faces and their features:
sex | hair color | skin tone | facial hair | age | glasses | |
👩🏻 | female | dark | white | none | teen | no |
👩🏻🦱 | female | dark | white | none | teen | no |
👩🏽🦱 | female | brown | light brown | none | teen | no |
We can now compute the similarity for all combinations:
similarity(👩🏻, 👩🏻🦱) = 6, similarity(👩🏻, 👩🏽🦱) = 5, similarity(👩🏻🦱, 👩🏽🦱) = 5.
The faces in the first two rows agree on all the features: these would be "perfect lookalikes". Even if they are not identical (perhaps we missed a feature - hair "type"), they are clearly more similar to each other than they are to the face in the third row, at least according to our notion of pairwise similarity.
Equipped with pairwise similarity and features, we can now go back and apply a similar algorithm to the original problem:
-
Take the first person and compare it to all the remaining ones. If we find any faces that share a certain minimum number of features, they (all) form a lookalike group and are added to the result.
-
If there are any people left, take the first one and repeat the same process until no more remains.
This solution, while correct, has one fundamental flaw: the number of pairwise comparisons required to establish the similarity among all the faces is enormous. In fact, it equals (in the worst case), so for the 90 or so faces above, we'd need to compare over 4,000 pairs of faces...
Preliminary grouping
A better idea is to leverage the fact that people who are identical to each other (according to our measure of similarity!) must have at least one feature in common. We can then split them into disjoint subsets and apply the same naive pairwise-comparison procedure we used previously, but restricted to only the people in each subset. For example, if we take the hair color feature, we'd have the following subsets:
subset | hair | |
A | blond | 🧔🏼 👨🏼 👩🏼🦱 🧑🏼 👱🏼 👱 👦🏼 👩🏼 👧🏼 👱🏾 👱🏻 👱🏿 👨🏼🦱 👱🏽 |
B | brown | 👧🏽 👨🏽🦱 👨 👧 🧑 👨🏽 👩 🧑🏽 👩🏽 👦 👦🏽 🧔🏽 🧔 👩🏽🦱 |
C | dark | 👦🏿 🧑🏿 👨🦱 👩🏾🦱 👩🏾 👨🏾🦱 👨🏿 👧🏿 👨🏻🦱 👦🏻 🧑🏾 👧🏾 👩🏻 🧔🏾 👨🏻 👩🏾 👨🏾 👧🏻 👦🏾 👩🏿 🧑🏻 🧔🏻 👩🏿🦱 👨🏿🦱 🧔🏻 👩🦱 👩🏻🦱 🧔🏿 |
D | white | 👵 👵🏻 👵🏼 👴🏽 🧓🏻 👵🏽 👵🏿 👵🏾 👴 👴🏻 👴🏼 👴🏾 🧓🏾 👴🏿 🧓 🧓🏼 🧓🏽 🧓🏿 |
E | red | 👨🦰 👨🏾🦰 👨🏽🦰 👨🏻🦰 👩🦰 👩🏾🦰 👨🏼🦰 👩🏻🦰 👨🏿🦰 👩🏼🦰 👩🏽🦰 👩🏿🦰 |
F | other | 👨🦲 👨🏽🦲 👨🏾🦲 👨🏻🦲 👩🏻🦲 👩🏽🦲 👩🦲 👩🏿🦲 👨🏼🦲 👩🏼🦲 👨🏿🦲 👩🏾🦲 |
After this preliminary grouping, the maximum pessimistic number of required pairwise comparisons drops to 845; a much smaller number than previously!
We can do even more: we can create those disjoint sets based on combinations of features, which makes them more unique and results in smaller groups of candidates for the naive algorithm. For example, we can split all the people based on the combination of their hair color and facial hair. The groups would then look like this:
subset | hair | facial hair | |
A | blond | beard, moustache | 🧔🏼 |
B | white | none | 👵 👵🏻 👵🏼 👴🏽 🧓🏻 👵🏽 👵🏿 👵🏾 👴 👴🏻 👴🏼 👴🏾 🧓🏾 👴🏿 🧓 🧓🏼 🧓🏽 🧓🏿 |
C | brown | none | 👧🏽 👧 🧑 👩 🧑🏽 👩🏽 👦 👦🏽 👩🏽🦱 |
D | black | none | 👦🏿 🧑🏿 👩🏾🦱 👩🏾 👧🏿 👦🏻 🧑🏾 👧🏾 👩🏻 👩🏾 👧🏻 👦🏾 👩🏿 🧑🏻 👩🏿🦱 👩🦱 👩🏻🦱 |
E | red | moustache | 👨🦰 👨🏾🦰 👨🏽🦰 👨🏻🦰 👨🏼🦰 👨🏿🦰 |
F | black | beard, moustache | 🧔🏾 🧔🏻 🧔🏻 🧔🏿 |
G | red | none | 👩🦰 👩🏾🦰 👩🏻🦰 👩🏼🦰 👩🏽🦰 👩🏿🦰 |
... (and so on) ... |
Note many groups are now much smaller - in certain cases (group A) we can skip it immediately since we know no perfect lookalike to 🧔🏼 can exist.
Preliminary grouping is an important concept. It brings down the number of required pairwise comparisons to a computationally feasible range. It also matters because certain features may be cheaper to compute than others. The cheap features can be used for grouping and the costly features (in our example these could be retinal scan profiles or DNA sequencing) can be computed only for pairs within each subgroup - where there is a non-zero chance of elements being similar.
There is one caveat we quietly omitted in the discussion so far. Preliminary grouping assumes that similar objects must share all the features used for separating candidates into different groups. These two faces in our example: 👩🏻🦰, 👩🏻 are clearly very similar, but they would never be compared directly because they ended up in different subsets based on the set of features we chose for grouping.
This brings the question of which feature (or a combination of features) to pick for pregrouping. The short answer is: it's a bit of an art form... The choice of pregrouping features depends on how costly these features are to compute, how well they distribute objects to multiple subgroups and perhaps other factors that depend on the task in question. For the needs of our example, we assumed that lookalikes must have the same hair, with all the remaining features contributing to the final similarity score. While it is possible to relax this assumption, it goes beyond the scope of this simple introductory example so let's postpone it for later.
Hashing
There is one more important concept that we have to introduce and it's hashing. While a computer can deal with explicit symbolic features like the ones we defined, it is much better (and faster!) at handling numbers. If we can convert our features to numbers and then use these numbers for pregrouping, it would make the algorithm run faster.
Hashing is a very broad topic but it is actually quite simple. Let's imagine every letter in the English alphabet is assigned with a natural (unique) number. For example something like this:
a→1, b→2, c→3, d→4, e→5, f→6, g→7, h→8, i→9, j→10, k→11, l→12, m→13, n→14, o→15, p→16, q→17, r→18, s→19, t→20, u→21, v→22, w→23, x→24, z→25, y→26
We can then encode any letter-based word in the English language to a natural number by taking the letters of that word and using some creative function to compute the result from the combination of letters that word contains. The function can be as simple as an addition, for example:
brown = (b→2) + (r→18) + (o→15) + (w→23) + (n→14) = 72
blond = (b→2) + (l→12) + (o→15) + (n→14) + (d→4) = 47
We can now convert all the symbolic values of features describing our faces into their numeric hashes and use these for preliminary grouping. We'd have something like this then:
subset | hash(hair) | hash(facial hair) | |
A | 47 | 135 | 🧔🏼 |
B | 65 | 48 | 👵 👵🏻 👵🏼 👴🏽 🧓🏻 👵🏽 👵🏿 👵🏾 👴 👴🏻 👴🏼 👴🏾 🧓🏾 👴🏿 🧓 🧓🏼 🧓🏽 🧓🏿 |
C | 72 | 48 | 👧🏽 👧 🧑 👩 🧑🏽 👩🏽 👦 👦🏽 👩🏽🦱 |
D | 29 | 48 | 👦🏿 🧑🏿 👩🏾🦱 👩🏾 👧🏿 👦🏻 🧑🏾 👧🏾 👩🏻 👩🏾 👧🏻 👦🏾 👩🏿 🧑🏻 👩🏿🦱 👩🦱 👩🏻🦱 |
E | 27 | 105 | 👨🦰 👨🏾🦰 👨🏽🦰 👨🏻🦰 👨🏼🦰 👨🏿🦰 |
F | 29 | 135 | 🧔🏾 🧔🏻 🧔🏻 🧔🏿 |
G | 27 | 48 | 👩🦰 👩🏾🦰 👩🏻🦰 👩🏼🦰 👩🏽🦰 👩🏿🦰 |
... (and so on) ... |
These are the same feature "values" as before - only represented by their hash number. We can do one more clever trick, though: we can hash the resulting hashes and have just one number represent an arbitrary combination of features. If we use the addition, like before, we'd arrive at the following output:
subset | hash([hair, facial hair]) | |
A | 182 | 🧔🏼 |
B | 113 | 👵 👵🏻 👵🏼 👴🏽 🧓🏻 👵🏽 👵🏿 👵🏾 👴 👴🏻 👴🏼 👴🏾 🧓🏾 👴🏿 🧓 🧓🏼 🧓🏽 🧓🏿 |
C | 120 | 👧🏽 👧 🧑 👩 🧑🏽 👩🏽 👦 👦🏽 👩🏽🦱 |
D | 77 | 👦🏿 🧑🏿 👩🏾🦱 👩🏾 👧🏿 👦🏻 🧑🏾 👧🏾 👩🏻 👩🏾 👧🏻 👦🏾 👩🏿 🧑🏻 👩🏿🦱 👩🦱 👩🏻🦱 |
E | 132 | 👨🦰 👨🏾🦰 👨🏽🦰 👨🏻🦰 👨🏼🦰 👨🏿🦰 |
F | 164 | 🧔🏾 🧔🏻 🧔🏻 🧔🏿 |
G | 75 | 👩🦰 👩🏾🦰 👩🏻🦰 👩🏼🦰 👩🏽🦰 👩🏿🦰 |
... (and so on) ... |
A desirable property of a hash function is to return a possibly unique number for each unique input (be it word letters or combination of numbers). This is sadly not always possible and sometimes the same input hashes to the same number. We call this situation a hash collision. For example, in our addition-based hash function we could have:
act = (a→1) + (c→3) + (t→20) = 24
cat = (c→3) + (a→1) + (t→20) = 24.
This is not really a big problem if collisions are rare (and most hash functions are cleverly designed to minimize the probability of this happening). Even if a hash collision occurs, all that happens is objects from two subsets that should be disjoint will end up in a single subgroup. The result will still be correct: we still need to compare all pairs of objects within a subgroup, after all.
The gain from using hash functions in preliminary grouping is great. No matter how many and how complex features we combine, we will always end up with a single, very likely unique, number that characterizes those features. For example, we can now easily find identical faces at our school by combining all six features into a single hash. After preliminary grouping, the result would be pretty much this:
subset | hash([all features]) | |
A | 202 | 👦🏿 |
B | 210 | 🧑🏿 |
C | 235 | 👩🏿, 👩🏿🦱 |
D | 271 | 👩🏻, 👩🏻🦱, 👱🏽 |
... (lots more) ... | ||
E | 312 | 👩🏾🦱, 👩🏾, 👩🏾 |
F | 319 | 👦🏼 |
G | 347 | 🧔🏻, 🧔🏻 |
... (lots more) ... |
The two pairs of identical faces in our school is outlined in red above. Note that the additive hash function does result in a number of collisions (group E has a face with different features, yet the same resulting hash, other groups with more than one face have non-identical elements) but overall each group after the preliminary grouping is very, very tiny. In fact, most groups contain just a single, unique face.
Any duplicate detection pipeline in Lingo4G is built using the key concepts described above, all of which should by now be easier to understand:
- document features,
- pairwise document similarity,
- preliminary document grouping,
- feature hashes and hash functions applied to other hashes.
Time to leave our introductory example aside and get to some more realistic Lingo4G API code!
API request structure
Lingo4G's
documentPairs:duplicates
request stage is responsible for finding similar document pairs. There are three key elements of this request and
they correspond exactly to the description in our toy example above. Here is an example request structure:
{
"type": "documentPairs:duplicates",
"query": {
...
},
"hashGrouping": {
...
},
"validation": {
...
}
}
The
query
property specifies which documents from the entire index should be included in the pairwise similarity search. You
can use the query to, for example, restrict duplicate searches to just a range of dates, a given author or use any
other valid
query.
The
hashGrouping
property configures preliminary document grouping: how document features are defined, how their hashes are
computed and how they're compared.
Finally, the
validation
element provides the way of computing the final pairwise document similarity for any pair of documents that have a
hash collision. This element also includes the thresholds of the similarity range which form the stage's result.
Every duplicate detection request will have the three elements described above. All the different intents and applications of this request stage are expressed by varying the source of features, how hashes are computed and how the pairwise similarity is defined.
Let's dive straight into a real-life example to discuss the internal structure of each of these primary building blocks.
Finding identical duplicates
Let's find pairs of Arxiv publications with identical titles published between 2015 and 2017. Lingo4G comes with a ready-to-use example dataset descriptor (which also downloads a preprocessed Arxiv data dump). Run the indexing command and the server (this will take a while):
l4g index -p datasets/dataset-arxiv
l4g server -p datasets/dataset-arxiv
Then open the sandbox API editor (it should be available at https://localhost:8080 and enter the request below.
The three main request blocks are highlighted in the request above. How do they work together?
-
The
query
property defines the documents among which to find duplicates:"query": { "type": "query:string", "query": "created:[2015 TO 2017]" }
We use the
query:string
component and provide a range query that matches documents created between 2015 and 2017. -
The
hashGrouping
property configures preliminary grouping:"hashGrouping": { "pairing": { "maxHashGroupSize": 200 }, "features": { "type": "featureSource:values", "fields": { "type": "fields:reference", "use": "sourceFields" } } }
The
maxHashGroupSize
property declares the maximum cardinality of a hash-colliding group: groups containing more than this number of documents will be omitted from pairwise comparisons. Remember the number of such comparisons grows exponentially so this is a safety limit to prevent the algorithm from slowing down on accidental very large collision groups. In our case it is unlikely that more than 200 papers in arxiv share an identical title, so it's a reasonable setting.The
features
property specifies a feature source. Here, we use entire field values as the source of features from which hashes are then computed. This means each unique title will have a hash value that will collide with all other documents with the same title. We use a reference to thesourceFields
component to specify which fields should be actually used as a source; we do it because the same set of fields is used from more than one place of the request. Here is the definition of this reusable component:"components": { "sourceFields": { "type": "fields:simple", "fields": [ "title" ] } }
-
Finally, we declare pairwise document similarity for documents that ended up in the same group after preliminary hash grouping:
"validation": { "pairwiseSimilarity": { "type": "pairwiseSimilarity:featureIntersectionToUnionRatio", "features": { "type": "featureSource:values", "fields": { "type": "fields:reference", "use": "sourceFields" } } }, "min": 1, "max": 1 }
The key element here is
pairwiseSimilarity
for which we use thefeatureIntersectionToUnionRatio
implementation. This type of pairwise similarity computes a set of unique features for each document and then divides the number of features they have in common by the total number of unique features. We declare the source of features in an identical way as for hash grouping (entire values, referencing the same set of fields) so either the two documents will have a similarity of 1 (the same value in the title field) or 0 (accidental hash collision from different values). Since so, we also declare the threshold of similarity values we're interested in as[1, 1]
(min
andmax
properties above, intervals boundaries are inclusive).Note that we could declare a different source of features for hash grouping and for the pairwise similarity computation. Later on, we show where such a set up can be useful.
Once the above request executes, you should see a screen similar to:
The response of our similarPairs
stage is a JSON structure with a few nested elements. It is very
often useful to retrieve not just the result of the request but also gain some insights into the internal stages
of the algorithm. This is the purpose of the leading
diagnostics
block that contains some human-readable information which can be used to tune (or diagnose) similarity requests.
For example:
"Computing feature hashes": {
"Features": "entire values from field: title",
"Documents": "50,434",
"Hash count": "50,434"
}
This section shows how many hashes were computed (Hash count
) from how many documents in scope
(Documents
) and what the source of features was ( entire values from field: title
). In
this example these numbers are the same, but only because we take entire values from the title field and the title
holds a single value only. We will see later how we can compute hashes from sub-fragments of a field (such as
sentences, word n-grams or even individual terms).
In the next section, we can gain some insight into hash grouping:
"Aggregating results": {
"Hash collision groups": "7",
"Skipped large hash collision groups": "0",
"Pairwise hash comparisons": "9",
"Candidate pairs": "9",
"Out of scope pairs": "0"
}
We had Hash collision groups
collisions between (out of the total of all computed feature hashes). No
group was larger than the maxHashGroupSize
threshold we specified in the request because
Skipped large hash collision groups
is zero. These hash collision groups further result in a number
of pairs promoted for direct pairwise similarity computation (Candidate pairs
block).
The Computing hash collisions
block is a very low-level diagnostic considering an implementation
detail, and we will skip it for now. A more useful and interesting information is at the very end:
"Validating pairs using pairwiseSimilarity:featureIntersectionToUnionRatio [entire values from field: title]": {
"Pairwise similarity function": "pairwiseSimilarity:featureIntersectionToUnionRatio [entire values from field: title]",
"Similarity thresholds": "[1.0, 1.0]",
"Pairs passing": "9"
}
The Pairs passing
element indicates the number of document pairs that passed the thresholds specified
in the pairwiseSimilarity
element above. This means that out of all candidate pairs with a hash
collision, we ended up with this number of document pairs that actually had the same title — this is exactly
what the pairs
block contains: an array of objects each describing a pair of documents similar to our
notion of pairwise similarity. Here is an example pair:
{
"pair": [
86,
226443
],
"similarity": 1,
"explanation": "Documents share 1 out of 1 distinct feature (entire values from field: title)."
},
The pair
sub-element contains exactly two (Lingo4G-internal) document identifiers. The similarity
between these two documents is 1
and the (optional) element explanation
precisely describes how the similarity was computed:
Documents share 1 out of 1 distinct feature (entire values from field: title).
Many applications will require not just the internal identifiers of duplicate documents but also additional
information, such as field values. To accomplish this, we need to combine the
documentContent
stage, which retrieves document fields, with
documents:fromDocumentPairs
, which converts document pairs into a flat list of documents. All we need to do then is to add the
documentContent
stage to our request and provide our
similarPairs
stage, converted to a flat list, as source of documents for content fetching:
"content": {
"type": "documentContent",
"limit": "unlimited",
"documents": {
"type": "documents:fromDocumentPairs",
"documentPairs": {
"type": "documentPairs:reference",
"use": "similarPairs"
}
},
"fields": {
"type": "contentFields:simple",
"fields": {
"id": {},
"title": {},
"author_name": {},
"created": {},
"updated": {},
"abstract": {
"maxValueLength": 250
}
}
}
}
A response to such an extended query contains both similar pairs of documents and their field values. Client-side code needs to intersect these structures manually, but this should not be a problem. Here is how the Explorer displays the result of such a combined request:
The title of these two documents is identical (as is the abstract) but the date of document submission and the author list are different. Whether this is coincidental or not is irrelevant now, but it leads to an important question how to construct a similarity request for when the text in two documents is almost the same but not quite identical. This problem is called near-duplicate detection, and we will discuss it next.
Finding near-duplicates
We will modify the request for finding identical arxiv paper titles, provided in the previous example. This time, we want to search for pairs of documents where the title is almost the same but not quite.
Perhaps the first thing we'd have to do is redefine pairwise document similarity. We want a "contiguous" measure of similarity between document titles: one that would distribute the score between identical titles and very different titles over some predetermined range of numbers. Many strategies can be used here, but we can start with something that is intuitive: let's use the words of each title as features. A normalized similarity score based on such features could be the number of unique words the two titles have in common, divided by the total number of unique words in both. Here is an example:
document | title | words |
1 | Lucy had a gray cat. | lucy, had, a, gray, cat |
2 | Lucy had a blue cat. | lucy, had, a, blue, cat |
The number of unique terms for both documents is 6 (lucy, had, a, blue, gray, cat). The number of terms in common is 4 (lucy, had, a, cat). The pairwise similarity between these two documents is therefore 4/6 ≈ 66%. Here is the definition of such a pairwise similarity in Lingo4G API:
"validation": {
"pairwiseSimilarity": {
"type": "pairwiseSimilarity:featureIntersectionToUnionRatio",
"features": {
"type": "featureSource:flatten",
"source": {
"type": "featureSource:words",
"fields": {
"type": "fields:reference",
"use": "sourceFields"
}
}
}
},
"min": 0.9,
"max": 0.999
},
Note we use the same similarity implementation as previously (featureIntersectionToUnionRatio
) but this time we restrict the score threshold to be between 0.9 and 0.999 (we're not interested in identical
duplicates but everything close to being identical is interesting).
The source of features is a bit more complex and requires a comment. We use the
featureSource:words
implementation type to extract a list of individual words from the title field (via the
sourceFields
reference). However, this implementation produces composite
features — higher-level features that contain smaller features. In this particular case,
featureSource:words
produces a single feature from each field value; each such feature internally
contains an array of smaller features representing individual words. If we used this implementation directly as
input to the pairwise similarity implementation (featureIntersectionToUnionRatio
), it would be no
different from using entire titles because individual words are "enclosed" by the composite feature and it is
those top-level features that would be compared. We need to "open up" composites and get access to individual
features contained by the composite. This is precisely what the top-level
featureSource:flatten
does.
Now that we have the pairwise similarity, we need to decide which features to use for preliminary grouping. Previously, we used the entire title but this will no longer work: if two titles differ by even a single word, the hash code of a feature constructed on top of such entire value would be (with high probability) different and the two documents would never make it to the same group of candidates for pairwise comparisons.
A different try would be to copy-paste the same feature source we used for the pairwise similarity above —
(flattened) words. This approach
does work but it is very, very inefficient. Single words are not unique: the same words occur in many,
many titles and their hashes would (in the majority of cases) create collision groups close to, or even exceeding,
the maxHashGroupSize
threshold.
Here is the diagnostic output from a request that uses flattened words for preliminary grouping but restricts the documents to just those having the word "solar" in the title:
"Computing feature hashes": {
"Features": "flattened composites of words from field: title",
"Documents": "42,276",
"Hash count": "367,799"
},
"Aggregating results": {
"Hash collisions": "9,539",
"Pairwise hash comparisons": "5,290,430",
"Candidate pairs": "4,975,598",
"Skipped large hash collision groups": "359"
},
"Validating pairs using pairwiseSimilarity:featureIntersectionToUnionRatio [flattened composites of words from field: title]": {
"Pairwise similarity function": "pairwiseSimilarity:featureIntersectionToUnionRatio [flattened composites of words from field: title]",
"Similarity thresholds": "[0.9, 0.999]",
"Pairs passing": "25"
},
The number of documents matching "solar" is 42,276 and the number of hash collision groups is a mere 9,539, but each such collision group contained many elements and the number of candidate pairs shot up to almost 5 million! In fact, 359 collision groups were skipped entirely because they contained too many candidates to even consider. In the end, out of the 5 million candidate pairs checked, we got an output of 25 document pairs that met the similarity threshold criteria. The result is accurate, but we made a lot of needless work.
There is clearly a need for a more intelligent feature selection strategy for preliminary grouping. Features should be unique enough to redistribute potential candidates into small groups with documents that have a high probability of meeting the pairwise similarity criteria.
In this example, we will use overlapping, fixed-length sub-sequences of words. This technique is known as shingling or term n-grams. Consider 3-grams from our example sentences above:
document | title | 3-grams |
1 | Lucy had a gray cat. | lucy had a, had a gray, a gray cat |
2 | Lucy had a blue cat. | lucy had a, had a blue, a blue cat |
The number of n-grams is approximately the same as the number of terms, but n-grams are typically much more unique to the document they come from. It is very likely to encounter a term "a" in any English text but quite infrequently we can find "a blue cat".
Longer n-grams will be more globally unique but if we make them too long, we run a risk of failing to group shorter titles or titles that have changed words scattered around. In the example above, the two documents do have shared 3-grams ("lucy had a", "had a blue") but they don't share any 4-grams. This tradeoff between possibly "missing" a similar document pair because n-grams are too long and having too many non-unique enough n-grams is difficult to solve. It requires some domain knowledge (How long are document fields? What is the content being analyzed? What is the cost of missing a similar pair vs. waiting for too long?) and perhaps a few rounds of trial-and-error on actual documents.
Continuing the example, we will pick 5-grams of words as features for preliminary grouping, leaving pairwise similarity defined on top of individual words. Below is the full API request. Note the highlighted preliminary grouping block:
Once run, the output should display pairs of documents with similar but not identical titles. Here is an example:
It's clear these titles fulfill our initial goal: they're very similar but not identical. If you imagine the same kind of similarity search over a long text field (for example the abstract in arxiv data), it's easy to imagine the difficulty in spotting such differences, especially when the similarity between fields is high. Luckily, Lingo4G has a built-in stage that can highlight duplicated text blocks found in a pair of documents and this is what we will do next.