featureSource
Feature source components convert text field or fields into a stream of comparable and hashable objects. Feature sources configure the operation of duplicate detection and detection and highlighting of overlapping text regions.
The following featureSource:*
stage types are available for use in analysis request JSONs:
-
featureSource:chunks
-
Constructs features from randomized sub-ranges of terms called chunks.
-
featureSource:count
-
A filter that passes through another source of features only if it fulfills the count criteria (minimum, maximum number of features).
-
featureSource:flatten
-
Flattens one or more composite features into a stream of their most atomic components.
-
featureSource:group
-
Groups one or more features into a composite feature.
-
featureSource:labels
-
Constructs features from indexed labels.
-
featureSource:minhash
-
A feature source emitting minhashes from a stream of other features.
-
featureSource:ngrams
-
Builds composite features as a moving window over a stream of other features.
-
featureSource:sentences
-
Constructs features from each full sentence in the text.
-
featureSource:simhash
-
A feature source emitting a simhash from a stream of other features (typically minhashes).
-
featureSource:unique
-
A filter that leaves only a set of unique features from the source of other features.
-
featureSource:values
-
Constructs features from entire field values.
-
featureSource:words
-
Constructs features from each term in the text.
featureSource:reference
-
References a
featureSource:*
component defined in the request or in the project's default components.
The selection and configuration of the feature source affects both runtime performance and quality of tasks that consume features. There is no single recipe for the best configuration: the data set and the task at hand will affect this choice.
All available feature source implementations can be grouped into the following categories:
primary text conversion | composition and decomposition | filters | advanced techniques |
---|---|---|---|
Primary text conversion sources emit a stream of atomic (or composite) features that are directly connected to the
source text of one or more fields. For example, the
values
source computes a stream of atomic features for each value
from the set of designated fields. This can be used to detect identical field values across the document selector
scope. Primary features can be more complex though: the
sentences
source emits a stream of features where each one
corresponds to a single sentence, as determined by Java's built-in
sentence boundary iterator. This can be used to detect both occurrences of the same sentence across many documents but also to measure how
many sentences two documents have in common (and highlight those sentences).
All other categories of feature sources (filters, composition) are used to manipulate or limit the output of primary feature sources in some way. We leave the details to the documentation of each source.
featureSource:chunks
Constructs features from randomized sub-ranges of terms called chunks.
{
"type": "featureSource:chunks",
"fields": {
"type": "fields:reference",
"auto": true
},
"minCharacters": 80,
"modulo": 5
}
A chunk feature source tries to strike some balance between
featureSource:sentences
and
featureSource:ngrams
. Unlike sentences, the start and end of a chunk does not depend on punctuation.
- The text is broken into tokens.
- Each an integer hash value is computed for each token.
- Tokens, for which the hash value modulo the provided parameter is zero become tombstones.
- Chunks extend from one tombstone (inclusively) to another (exclusively).
- Any chunks smaller than minCharacters parameter are filtered out.
The resulting features should be more probabilistic and fine-grained than sentences but (since they are not overlapping) their number should be much smaller compared to full n-grams.
Returns a stream of flat atomic chunks features for all fields.
fields
Declares one or more fields from which features should be computed.
minCharacters
Chunks smaller than this minimum number of characters are omitted.
modulo
Determines the tombstone marker frequency (probabilistic chunk length). A value of 5 means that, on average, every fifth word will be a tombstone. In reality the standard deviation may be high and tombstones may be next to each other or far apart. Statistically this should not affect feature collisions if the repeated text is long enough.
featureSource:count
A filtering feature source that accepts another source and filters out feature vectors smaller or larger than the provided thresholds.
{
"type": "featureSource:count",
"maxFeatureCount": "unlimited",
"minFeatureCount": 1,
"source": {
"type": "featureSource:reference",
"auto": true
}
}
maxFeatureCount
Maximum count of features for a document. If the number of features returned from the delegate source is larger, an empty feature vector is returned.
minFeatureCount
Minimum count of features for a document. If the number of features returned from the delegate source is smaller, an empty feature vector is returned.
source
The delegate source of features to be filtered. Note that composite features are not expanded automatically
(they count as one). Use
featureSource:flatten
to flatten composites, if needed.
featureSource:flatten
A filtering feature source that flattens a stream of composite features into a stream of atomic features.
{
"type": "featureSource:flatten",
"source": {
"type": "featureSource:reference",
"auto": true
}
}
source
The delegate source of composite or atomic features to be flattened.
featureSource:group
A filtering feature source that creates a composite feature from a stream of features returned by another source.
{
"type": "featureSource:group",
"source": {
"type": "featureSource:reference",
"auto": true
}
}
source
The delegate source of composite or atomic features to be grouped into a composite feature.
featureSource:labels
A feature source constructing features from document labels.
{
"type": "featureSource:labels",
"fields": {
"type": "featureFields:reference",
"auto": true
},
"maxDocFrequency": "unlimited",
"minDocFrequency": 1
}
Returns a stream of per-field composites (containing atomic label features).
Additional parameters can be used to control the minimum and maximum document frequency of allowed labels. In the
example below, we look for documents that contain between 80 and 99% of labels in common. Additional restrictions
are added to prevent the number of candidate pairs from blowing up: labels must occur at least twice in each
document, at least 5 labels must be present in each document and only features with relatively low collision rate
(maxHashGroupSize
parameter) will be used to compute candidate pairs. This last parameter restricts
the result to duplicates that contain a relatively unique label across the set of all documents.
fields
Declares one or more fields from which features should be computed.
maxDocFrequency
Maximum label frequency (within the document). Labels more frequent will be omitted.
minDocFrequency
Minimum label frequency (within the document). Labels less frequent will be omitted.
featureSource:minhash
This feature source takes a set of source features on input and produces a set of derived features representing a minhash vectors for this set.
{
"type": "featureSource:minhash",
"functionCount": 128,
"source": {
"type": "featureSource:reference",
"auto": true
}
}
Minhashing is a locality-sensitive hashing scheme. Minhashes computed for two sets of features should contain identical elements only if the two source sets of features contained identical elements.
This is a rather advanced feature source and is, typically, not the most intuitive (since the explanation of pairwise similarity will contain the number of minhash vectors in common, not the original features). Minhashes should be applied for large inputs, when the number of features for each document is large.
Returns a stream of exactly functionCount atomic features, each representing a minhash of the source, derived from a different hash function.
functionCount
The number of different hash functions (minhash vectors) to produce.
source
The source of features from which minhashes should be computed. Top-level features are used (composites are not flattened).
featureSource:ngrams
Constructs composite features using a rolling window over a stream of features from another source.
{
"type": "featureSource:ngrams",
"source": {
"type": "featureSource:reference",
"auto": true
},
"window": 10
}
N-gram features are useful when one is looking for unique features representing a sub-sequence of something else. For example, a stream of word features could be broken down into 3-grams (triplets) that are far more unique than individual words.
Composite features will overlap. A stream of N source features and a window size W will result in at most (N - W) composite features in the output.
Returns a stream of flat composite features of another source.
source
The source of features from which composite n-grams should be computed. The source must return composite
features as well (n-grams are computed for each composite feature's sub-features). If the source does not return
composite features, use
featureSource:group
to create a composite synthetically.
window
The length of the n-gram window (number of sub-features combined into a single composite).
featureSource:sentences
A feature source constructing features from punctuation-demarcated sentences in the input text.
{
"type": "featureSource:sentences",
"fields": {
"type": "fields:reference",
"auto": true
},
"minCharacters": 40
}
The text is broken down into sentences using Java's built-in unicode rules (BreakIterator
class).
Returns a stream of flat atomic sentence features.
This feature source is very useful for providing fast and relatively unique features. It can be used as a source of hash collisions if there is a justifiable assumption that similar document pairs have at least one sentence in common. The number of identical sentences can be a good similarity and validation condition in many contexts as well.
In the example below, we use sentences as a fast hash collision source, with the final validation condition being a much more costly text overlap similarity. The request ends by computing text overlaps so that repeated text fragments are easier to spot.
fields
Declares one or more fields from which features should be computed.
minCharacters
Sentences shorter than this threshold will be omitted. This can be used to omit very short sentences, which are likely not unique enough to be good indicators of similarity.
featureSource:simhash
A feature source that computes simhashes from a set of other features (typically minhashes).
{
"type": "featureSource:simhash",
"source": {
"type": "featureSource:reference",
"auto": true
}
}
Simhashes aggregate several bitfields of the same length into a single value that can be compared using Hamming distance. In Lingo4G, simhashes are computed over hash values of features read from another feature source.
Simhashes can be very useful to speed up computations when values to be discovered are identical or nearly
identical. For example, they are frequently used for detecting near-duplicates (minor edits or changes in
otherwise longer documents). When using simhashes, make sure to increase
maxHashBitsDifferent
from the default value so that hashes with a Hamming difference larger than 1 can be considered a hash collision.
The example below efficiently computes pairs of documents with a near-identical abstract field. It uses a simhash of minhashes of all sentences from that field.
source
The source of features whose hash values are used as bit fields for the computation of simhash features.
Typically, the source will use
minhashes
computed from yet other features.
featureSource:unique
A filtering feature source leaving only unique features from the source.
{
"type": "featureSource:unique",
"source": {
"type": "featureSource:reference",
"auto": true
}
}
This feature source can be used to leave only a set of unique features if their order and number does not play a key role. For example, we could use it to compute a set of unique word features in a document.
source
Declares one or more fields from which features should be computed.
featureSource:values
Constructs features from entire values in one or more fields.
{
"type": "featureSource:values",
"fields": {
"type": "fields:reference",
"auto": true
}
}
This type of feature source is useful when looking for collisions (or similarity) over entire field values. For example, consider this request which uses the duplicate detection stage to find pairs of arxiv documents published between 2015 and 2017 that have identical titles:
Note how the featureSources
component is configured to emit features from entire values of the
title
field. This component is then referenced from both the hash grouping phase and the validation
phase of the duplicate detection stage to compute pairs with the similarity exceeding 1 (at least one value in
common because pairwiseSimilarity:featureIntersectionSize
is used to compute the similarity).
The output is limited to at most five pairs, but more details regarding similarity computation are available in
the
diagnostics
section of the response, as shown below:
This feature source is also useful to detect repeated subsets of values in multi-valued fields. This can be
achieved using field-value features and
featureIntersectionSize
pairwise document similarity. Consider this example, which looks for pairs of arxiv documents mentioning
solar energy that have a repeated subset of 10 to 20 identical authors (yes, it is quite ridiculous):
As surprising as it may sound, there are dozens of documents returned for this request, here is the diagnostic section and a sample pair of documents:
fields
Declares one or more fields to be used to compute features. Each field unique field value is translated into one feature.
featureSource:words
A feature source constructing features from individual words in the input text.
{
"type": "featureSource:words",
"fields": {
"type": "fields:reference",
"auto": true
}
}
Returns a stream of per-field composites (containing atomic word features).
In the example below, for all documents written by Robert Williams we compute the ratio of words they have in common (the name is picked arbitrarily). We are interested in the top-10 scoring pairs. Note that we use words as hash collision features because we know the number of documents in scope will be relatively small. For larger queries, the number of collision pairs should be limited by picking a more unique (selective) hash feature source.
fields
Declares one or more fields from which features should be computed.
featureSource:*
Consumers of
The following stages and components take featureSource:*
as
input: