featureSource

Feature source components convert text field or fields into a stream of comparable and hashable objects. Feature sources configure the operation of duplicate detection and detection and highlighting of overlapping text regions.

The following feature​Source:​* stage types are available for use in analysis request JSONs:

feature​Source:​chunks

Constructs features from randomized sub-ranges of terms called chunks.

feature​Source:​count

A filter that passes through another source of features only if it fulfills the count criteria (minimum, maximum number of features).

feature​Source:​flatten

Flattens one or more composite features into a stream of their most atomic components.

feature​Source:​group

Groups one or more features into a composite feature.

feature​Source:​labels

Constructs features from indexed labels.

feature​Source:​minhash

A feature source emitting minhashes from a stream of other features.

feature​Source:​ngrams

Builds composite features as a moving window over a stream of other features.

feature​Source:​sentences

Constructs features from each full sentence in the text.

feature​Source:​simhash

A feature source emitting a simhash from a stream of other features (typically minhashes).

feature​Source:​unique

A filter that leaves only a set of unique features from the source of other features.

feature​Source:​values

Constructs features from entire field values.

feature​Source:​words

Constructs features from each term in the text.


feature​Source:​reference

References a feature​Source:​* component defined in the request or in the project's default components.


The selection and configuration of the feature source affects both runtime performance and quality of tasks that consume features. There is no single recipe for the best configuration: the data set and the task at hand will affect this choice.

All available feature source implementations can be grouped into the following categories:

primary text conversion composition and decomposition filters advanced techniques

Primary text conversion sources emit a stream of atomic (or composite) features that are directly connected to the source text of one or more fields. For example, the values source computes a stream of atomic features for each value from the set of designated fields. This can be used to detect identical field values across the document selector scope. Primary features can be more complex though: the sentences source emits a stream of features where each one corresponds to a single sentence, as determined by Java's built-in sentence boundary iterator. This can be used to detect both occurrences of the same sentence across many documents but also to measure how many sentences two documents have in common (and highlight those sentences).

All other categories of feature sources (filters, composition) are used to manipulate or limit the output of primary feature sources in some way. We leave the details to the documentation of each source.

feature​Source:​chunks

Constructs features from randomized sub-ranges of terms called chunks.

{
  "type": "featureSource:chunks",
  "fields": {
    "type": "fields:reference",
    "auto": true
  },
  "minCharacters": 80,
  "modulo": 5
}

A chunk feature source tries to strike some balance between feature​Source:​sentences and feature​Source:​ngrams. Unlike sentences, the start and end of a chunk does not depend on punctuation.

  1. The text is broken into tokens.
  2. Each an integer hash value is computed for each token.
  3. Tokens, for which the hash value modulo the provided parameter is zero become tombstones.
  4. Chunks extend from one tombstone (inclusively) to another (exclusively).
  5. Any chunks smaller than minCharacters parameter are filtered out.

The resulting features should be more probabilistic and fine-grained than sentences but (since they are not overlapping) their number should be much smaller compared to full n-grams.

Returns a stream of flat atomic chunks features for all fields.

fields

Type
fields
Default
{
  "type": "fields:reference",
  "auto": true
}
Required
no

Declares one or more fields from which features should be computed.

min​Characters

Type
integer
Default
80
Required
no

Chunks smaller than this minimum number of characters are omitted.

modulo

Type
integer
Default
5
Required
no

Determines the tombstone marker frequency (probabilistic chunk length). A value of 5 means that, on average, every fifth word will be a tombstone. In reality the standard deviation may be high and tombstones may be next to each other or far apart. Statistically this should not affect feature collisions if the repeated text is long enough.

feature​Source:​count

A filtering feature source that accepts another source and filters out feature vectors smaller or larger than the provided thresholds.

{
  "type": "featureSource:count",
  "maxFeatureCount": "unlimited",
  "minFeatureCount": 1,
  "source": {
    "type": "featureSource:reference",
    "auto": true
  }
}

max​Feature​Count

Type
limit
Default
unlimited
Required
no

Maximum count of features for a document. If the number of features returned from the delegate source is larger, an empty feature vector is returned.

min​Feature​Count

Type
integer
Default
1
Required
no

Minimum count of features for a document. If the number of features returned from the delegate source is smaller, an empty feature vector is returned.

source

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

The delegate source of features to be filtered. Note that composite features are not expanded automatically (they count as one). Use feature​Source:​flatten to flatten composites, if needed.

feature​Source:​flatten

A filtering feature source that flattens a stream of composite features into a stream of atomic features.

{
  "type": "featureSource:flatten",
  "source": {
    "type": "featureSource:reference",
    "auto": true
  }
}

source

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

The delegate source of composite or atomic features to be flattened.

feature​Source:​group

A filtering feature source that creates a composite feature from a stream of features returned by another source.

{
  "type": "featureSource:group",
  "source": {
    "type": "featureSource:reference",
    "auto": true
  }
}

source

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

The delegate source of composite or atomic features to be grouped into a composite feature.

feature​Source:​labels

A feature source constructing features from document labels.

{
  "type": "featureSource:labels",
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "maxDocFrequency": "unlimited",
  "minDocFrequency": 1
}

Returns a stream of per-field composites (containing atomic label features).

Additional parameters can be used to control the minimum and maximum document frequency of allowed labels. In the example below, we look for documents that contain between 80 and 99% of labels in common. Additional restrictions are added to prevent the number of candidate pairs from blowing up: labels must occur at least twice in each document, at least 5 labels must be present in each document and only features with relatively low collision rate (max​Hash​Group​Size parameter) will be used to compute candidate pairs. This last parameter restricts the result to duplicates that contain a relatively unique label across the set of all documents.

{
  "comment": [
    "Find document pairs referencing 'solar energy' with 80-99% of indexed labels in common.",
    "Each label must occur at least 2 times in the document and the set of labels per document must exceed 5."
  ],
  "components": {
    "featureSources": {
      "type": "featureSource:count",
      "minFeatureCount": 5,
      "source":{
        "type": "featureSource:flatten",
        "source": {
          "type": "featureSource:labels",
          "minDocFrequency": 2,
          "fields": {
            "type": "featureFields:simple",
            "fields": [
              "abstract$phrases"
            ]
          }
        }
      }
    }
  },
  "stages": {
    "pairs": {
      "type": "documentPairs:duplicates",
      "query": {
        "type": "query:string",
        "query": "solar energy"
      },
      "hashGrouping": {
        "pairing": {
          "maxHashGroupSize": 50
        },
        "features": {
          "type": "featureSource:reference",
          "use": "featureSources"
        }
      },
      "validation": {
        "min": 0.8,
        "max": 0.99,
        "pairwiseSimilarity": {
          "type": "pairwiseSimilarity:featureIntersectionMinRatio",
          "features": {
            "type": "featureSource:reference",
            "use": "featureSources"
          }
        }
      },
      "output": {
        "comment": "Limit the output to the top 10 pairs.",
        "explanations": true,
        "limit": 10
      }
    },

    "documents": {
      "type": "documentContent",
      "limit": "unlimited",

      "documents": {
        "type": "documents:fromDocumentPairs",
        "documentPairs": {
          "type": "documentPairs:reference",
          "use": "pairs"
        }
      },

      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "title": {},
          "author_and_inst": {
            "maxValues": 100
          }
        }
      }
    }
  }
}

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

Declares one or more fields from which features should be computed.

max​Doc​Frequency

Type
limit
Default
unlimited
Required
no

Maximum label frequency (within the document). Labels more frequent will be omitted.

min​Doc​Frequency

Type
integer
Default
1
Required
no

Minimum label frequency (within the document). Labels less frequent will be omitted.

feature​Source:​minhash

This feature source takes a set of source features on input and produces a set of derived features representing a minhash vectors for this set.

{
  "type": "featureSource:minhash",
  "functionCount": 128,
  "source": {
    "type": "featureSource:reference",
    "auto": true
  }
}

Minhashing is a locality-sensitive hashing scheme. Minhashes computed for two sets of features should contain identical elements only if the two source sets of features contained identical elements.

This is a rather advanced feature source and is, typically, not the most intuitive (since the explanation of pairwise similarity will contain the number of minhash vectors in common, not the original features). Minhashes should be applied for large inputs, when the number of features for each document is large.

Returns a stream of exactly functionCount atomic features, each representing a minhash of the source, derived from a different hash function.

function​Count

Type
integer
Default
128
Required
no

The number of different hash functions (minhash vectors) to produce.

source

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

The source of features from which minhashes should be computed. Top-level features are used (composites are not flattened).

feature​Source:​ngrams

Constructs composite features using a rolling window over a stream of features from another source.

{
  "type": "featureSource:ngrams",
  "source": {
    "type": "featureSource:reference",
    "auto": true
  },
  "window": 10
}

N-gram features are useful when one is looking for unique features representing a sub-sequence of something else. For example, a stream of word features could be broken down into 3-grams (triplets) that are far more unique than individual words.

Composite features will overlap. A stream of N source features and a window size W will result in at most (N - W) composite features in the output.

Returns a stream of flat composite features of another source.

source

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

The source of features from which composite n-grams should be computed. The source must return composite features as well (n-grams are computed for each composite feature's sub-features). If the source does not return composite features, use feature​Source:​group to create a composite synthetically.

window

Type
integer
Default
10
Required
no

The length of the n-gram window (number of sub-features combined into a single composite).

feature​Source:​sentences

A feature source constructing features from punctuation-demarcated sentences in the input text.

{
  "type": "featureSource:sentences",
  "fields": {
    "type": "fields:reference",
    "auto": true
  },
  "minCharacters": 40
}

The text is broken down into sentences using Java's built-in unicode rules (Break​Iterator class).

Returns a stream of flat atomic sentence features.

This feature source is very useful for providing fast and relatively unique features. It can be used as a source of hash collisions if there is a justifiable assumption that similar document pairs have at least one sentence in common. The number of identical sentences can be a good similarity and validation condition in many contexts as well.

In the example below, we use sentences as a fast hash collision source, with the final validation condition being a much more costly text overlap similarity. The request ends by computing text overlaps so that repeated text fragments are easier to spot.

{
  "comment": [
    "Find documents with a repeated text overlap ratio between 50 and 90% and create text overlap analysis."
  ],
  "variables": {
    "fieldsToCompare": {
      "value": [
        "abstract"
      ]
    }
  },
  "components": {
    "fields": {
      "type": "fields:simple",
      "fields": {
        "@var": "fieldsToCompare"
      }
    },
    "documentOverlapSimilarity": {
      "type": "pairwiseSimilarity:documentOverlapRatio",
      "fields": {
        "type": "fields:reference",
        "use": "fields"
      },
      "allowedGapRatio": 0,
      "ngramWindow": 5
    }
  },
  "stages": {
    "pairs": {
      "type": "documentPairs:duplicates",
      "query": {
        "type": "query:string",
        "query": "solar energy"
      },
      "hashGrouping": {
        "features": {
          "type": "featureSource:unique",
          "source": {
            "type": "featureSource:sentences",
            "minCharacters": 80,
            "fields": {
              "type": "fields:reference",
              "use": "fields"
            }
          }
        }
      },
      "validation": {
        "min": 0.5,
        "max": 0.9,
        "pairwiseSimilarity": {
          "type": "pairwiseSimilarity:reference",
          "use": "documentOverlapSimilarity"
        }
      },
      "output": {
        "comment": "Limit the output to the top 10 pairs.",
        "explanations": true,
        "limit": 10
      }
    },

    "documents": {
      "type": "documentContent",
      "limit": "unlimited",

      "documents": {
        "type": "documents:fromDocumentPairs",
        "documentPairs": {
          "type": "documentPairs:reference",
          "use": "pairs"
        }
      },

      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "title": {},
          "author_and_inst": {
            "maxValues": 100
          }
        }
      }
    },

    "overlaps": {
      "type": "documentOverlap",

      "documentPairs": {
        "type": "documentPairs:reference",
        "use": "pairs"
      },

      "pairwiseSimilarity": {
        "type": "pairwiseSimilarity:reference",
        "use": "documentOverlapSimilarity"
      },

      "alignedFragments": {
        "contextChars": 80,
        "maxFragments": 10,
        "fields": {
          "type": "contentFields:grouped",
          "groups": [
            {
              "fields": {
                "@var": "fieldsToCompare"
              },
              "config": {
                "maxValueLength": 3000
              }
            }
          ]
        }
      },

      "fragmentsInFields": {
        "contextChars": 600,
        "fields": {
          "type": "contentFields:grouped",
          "groups": [
            {
              "fields": {
                "@var": "fieldsToCompare"
              },
              "config": {
                "maxValues": 10,
                "maxValueLength": 3000
              }
            }
          ]
        }
      }
    }
  }
}

fields

Type
fields
Default
{
  "type": "fields:reference",
  "auto": true
}
Required
no

Declares one or more fields from which features should be computed.

min​Characters

Type
integer
Default
40
Required
no

Sentences shorter than this threshold will be omitted. This can be used to omit very short sentences, which are likely not unique enough to be good indicators of similarity.

feature​Source:​simhash

A feature source that computes simhashes from a set of other features (typically minhashes).

{
  "type": "featureSource:simhash",
  "source": {
    "type": "featureSource:reference",
    "auto": true
  }
}

Simhashes aggregate several bitfields of the same length into a single value that can be compared using Hamming distance. In Lingo4G, simhashes are computed over hash values of features read from another feature source.

Simhashes can be very useful to speed up computations when values to be discovered are identical or nearly identical. For example, they are frequently used for detecting near-duplicates (minor edits or changes in otherwise longer documents). When using simhashes, make sure to increase max​Hash​Bits​Different from the default value so that hashes with a Hamming difference larger than 1 can be considered a hash collision.

The example below efficiently computes pairs of documents with a near-identical abstract field. It uses a simhash of minhashes of all sentences from that field.

{
  "comment": [
    "Finds pairs of documents with similar but not identical content in the 'abstract' field.",
    "You can use requests of this type to identify potentially plagiarized content, where large",
    " parts of text are identical or nearly identical."
  ],
  "variables": {
    "fieldsToCompare": {
      "name": "Fields to compare",
      "comment": "Fields to check for duplicated content.",
      "value": [
        "abstract"
      ]
    },
    "scopeQuery": {
      "name": "Duplicate search scope query",
      "comment": "Determines the set of documents to search for duplicated content.",
      "value": "created:[2015 TO 2017]"
    },
    "scopeMaxDocuments": {
      "name": "Duplicate search scope size",
      "comment": "Determines how many of the in-scope documents to search for duplicated content.",
      "value": "unlimited"
    }
  },
  "components": {
    "fields": {
      "type": "fields:simple",
      "fields": {
        "@var": "fieldsToCompare"
      }
    }
  },
  "stages": {
    "duplicates": {
      "type": "documentPairs:duplicates",
      "query": {
        "type": "query:string",
        "query": {
          "@var": "scopeQuery"
        }
      },
      "hashGrouping": {
        "pairing": {
          "comment": "Maximum Hamming distance between final hashes that cause a conflict (candidate pair).",
          "maxHashBitsDifferent": 2,
          "maxHashGroupSize": 500
        },
        "features": {
          "comment": "Use a simhash of minhashes of sentences in the document.",
          "type": "featureSource:simhash",
          "source": {
            "type": "featureSource:minhash",
            "source": {
              "type": "featureSource:sentences",
              "fields":{
                "type": "fields:reference",
                "use": "fields"
              }
            }
          }
        }
      },
      "validation": {
        "comment": "We want near-duplicates, so exclude 1 from maximum.",
        "min": 0.9,
        "max": 0.999,
        "pairwiseSimilarity": {
          "type": "pairwiseSimilarity:documentOverlapRatio",
          "fields": {
            "type": "fields:reference",
            "use": "fields"
          },
          "allowedGapRatio": 0,
          "ngramWindow": 5
        }
      },
      "output": {
        "explanations": true,
        "limit": 10
      }
    },
    "content": {
      "type": "documentContent",
      "documents": {
        "type": "documents:fromDocumentPairs",
        "documentPairs": {
          "type": "documentPairs:reference",
          "use": "duplicates"
        }
      },
      "fields": {
        "type": "contentFields:grouped",
        "groups": [
          {
            "fields": {
              "@var": "fieldsToCompare"
            },
            "config": {
              "maxValues": 1,
              "maxValueLength": 500
            }
          }
        ]
      }
    }
  }
}

source

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

The source of features whose hash values are used as bit fields for the computation of simhash features. Typically, the source will use minhashes computed from yet other features.

feature​Source:​unique

A filtering feature source leaving only unique features from the source.

{
  "type": "featureSource:unique",
  "source": {
    "type": "featureSource:reference",
    "auto": true
  }
}

This feature source can be used to leave only a set of unique features if their order and number does not play a key role. For example, we could use it to compute a set of unique word features in a document.

source

Type
featureSource
Default
{
  "type": "featureSource:reference",
  "auto": true
}
Required
no

Declares one or more fields from which features should be computed.

feature​Source:​values

Constructs features from entire values in one or more fields.

{
  "type": "featureSource:values",
  "fields": {
    "type": "fields:reference",
    "auto": true
  }
}

This type of feature source is useful when looking for collisions (or similarity) over entire field values. For example, consider this request which uses the duplicate detection stage to find pairs of arxiv documents published between 2015 and 2017 that have identical titles:

{
  "comment": "Find document pairs which share the same title.",
  "components": {
    "featureSources": {
      "type": "featureSource:values",
      "fields": {
        "type": "fields:simple",
        "fields": [
          "title"
        ]
      }
    }
  },
  "stages": {
    "pairs": {
      "type": "documentPairs:duplicates",
      "query": {
        "type": "query:string",
        "query": "created:[2015 TO 2017]"
      },
      "hashGrouping": {
        "features": {
          "type": "featureSource:reference",
          "use": "featureSources"
        }
      },
      "validation": {
        "min": 1,
        "pairwiseSimilarity": {
          "type": "pairwiseSimilarity:featureIntersectionSize",
          "features": {
            "type": "featureSource:reference",
            "use": "featureSources"
          }
        }
      },
      "output": {
        "explanations": true,
        "comment": "We just need a few example pairs, not all of them.",
        "limit": 5
      }
    }
  }
}

Note how the feature​Sources component is configured to emit features from entire values of the title field. This component is then referenced from both the hash grouping phase and the validation phase of the duplicate detection stage to compute pairs with the similarity exceeding 1 (at least one value in common because pairwise​Similarity:​feature​Intersection​Size is used to compute the similarity).

The output is limited to at most five pairs, but more details regarding similarity computation are available in the diagnostics section of the response, as shown below:

{
  "result" : {
    "pairs" : {
      "diagnostics" : {
        "Computing feature hashes" : {
          "Features" : "entire values from field: title",
          "Documents" : "50,434",
          "Hash count" : "50,434"
        },
        "Computing hash collisions" : {
          "Hash count" : "50,434",
          "Band count" : "1",
          "Hash batches" : "6"
        },
        "Preparing hash groups" : { },
        "Scanning hash groups" : { },
        "Aggregating results" : {
          "Hash collision groups" : "7",
          "Skipped large hash collision groups" : "0",
          "Pairwise hash comparisons" : "9",
          "Candidate pairs" : "9",
          "Out of scope pairs" : "0"
        },
        "Validating pairs using pairwiseSimilarity:featureIntersectionSize [entire values from field: title]" : {
          "Pairwise similarity function" : "pairwiseSimilarity:featureIntersectionSize [entire values from field: title]",
          "Similarity thresholds" : "[1.0, ∞]",
          "Pairs passing" : "9"
        },
        "Preparing output" : {
          "Output limit" : "5"
        }
      },
      "similarityDistribution" : {
        "[1.00, 1.00]" : 9
      },
      "matches" : {
        "value" : 9,
        "relation" : "EXACT"
      },
      "pairs" : [
        {
          "pair" : [
            7899,
            341930
          ],
          "similarity" : 1.0,
          "explanation" : "Documents share 1 distinct feature (entire values from field: title)."
        },
        {
          "pair" : [
            236736,
            271696
          ],
          "similarity" : 1.0,
          "explanation" : "Documents share 1 distinct feature (entire values from field: title)."
        },
        {
          "pair" : [
            240752,
            315839
          ],
          "similarity" : 1.0,
          "explanation" : "Documents share 1 distinct feature (entire values from field: title)."
        },
        {
          "pair" : [
            270337,
            359781
          ],
          "similarity" : 1.0,
          "explanation" : "Documents share 1 distinct feature (entire values from field: title)."
        },
        {
          "pair" : [
            270337,
            459282
          ],
          "similarity" : 1.0,
          "explanation" : "Documents share 1 distinct feature (entire values from field: title)."
        }
      ]
    }
  }
}

This feature source is also useful to detect repeated subsets of values in multi-valued fields. This can be achieved using field-value features and feature​Intersection​Size pairwise document similarity. Consider this example, which looks for pairs of arxiv documents mentioning solar energy that have a repeated subset of 10 to 20 identical authors (yes, it is quite ridiculous):

{
  "comment": "Find document pairs mentioning 'solar energy' with 10 to 20 identical authors.",
  "components": {
    "featureSources": {
      "type": "featureSource:values",
      "fields": {
        "type": "fields:simple",
        "fields": [
          "author_and_inst"
        ]
      }
    }
  },
  "stages": {
    "pairs": {
      "type": "documentPairs:duplicates",
      "query": {
        "type": "query:string",
        "query": "solar energy"
      },
      "hashGrouping": {
        "features": {
          "type": "featureSource:reference",
          "use": "featureSources"
        }
      },
      "validation": {
        "min": 10,
        "max": 20,
        "pairwiseSimilarity": {
          "type": "pairwiseSimilarity:featureIntersectionSize",
          "features": {
            "type": "featureSource:reference",
            "use": "featureSources"
          }
        }
      },
      "output": {
        "comment": "We just need one example pair, not all of them.",
        "explanations": true,
        "limit": 1
      }
    },

    "documents": {
      "type": "documentContent",
      "limit": "unlimited",

      "documents": {
        "type": "documents:fromDocumentPairs",
        "documentPairs": {
          "type": "documentPairs:reference",
          "use": "pairs"
        }
      },

      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "title": {},
          "author_and_inst": {
            "maxValues": 100
          }
        }
      }
    }
  }
}

As surprising as it may sound, there are dozens of documents returned for this request, here is the diagnostic section and a sample pair of documents:

{
  "result" : {
    "pairs" : {
      "diagnostics" : {
        "Computing feature hashes" : {
          "Features" : "entire values from field: author_and_inst",
          "Documents" : "2,192",
          "Hash count" : "16,620"
        },
        "Computing hash collisions" : {
          "Hash count" : "16,620",
          "Band count" : "1",
          "Hash batches" : "2"
        },
        "Preparing hash groups" : { },
        "Scanning hash groups" : { },
        "Aggregating results" : {
          "Hash collision groups" : "1,725",
          "Skipped large hash collision groups" : "0",
          "Pairwise hash comparisons" : "5,013",
          "Candidate pairs" : "2,331",
          "Out of scope pairs" : "0"
        },
        "Validating pairs using pairwiseSimilarity:featureIntersectionSize [entire values from field: author_and_inst]" : {
          "Pairwise similarity function" : "pairwiseSimilarity:featureIntersectionSize [entire values from field: author_and_inst]",
          "Similarity thresholds" : "[10.0, 20.0]",
          "Pairs passing" : "20"
        },
        "Preparing output" : {
          "Output limit" : "1"
        }
      },
      "similarityDistribution" : {
        "[10.00, 11.00)" : 4,
        "[12.00, 13.00)" : 1,
        "[13.00, 14.00)" : 4,
        "[14.00, 15.00)" : 6,
        "[16.00, 17.00)" : 1,
        "[17.00, 18.00)" : 2,
        "[19.00, 20.00]" : 2
      },
      "matches" : {
        "value" : 20,
        "relation" : "EXACT"
      },
      "pairs" : [
        {
          "pair" : [
            256985,
            340697
          ],
          "similarity" : 20.0,
          "explanation" : "Documents share 20 distinct features (entire values from field: author_and_inst)."
        }
      ]
    },
    "documents" : {
      "documents" : [
        {
          "id" : 256985,
          "fields" : {
            "title" : {
              "values" : [
                "Helium fluxes measured by the PAMELA experiment from the minimum to the maximum solar activity for solar cycle 24"
              ]
            },
            "author_and_inst" : {
              "values" : [
                "Marcelli, N. → (unassigned)",
                "Boezio, M. → (unassigned)",
                "Lenni, A. → (unassigned)",
                "Menn, W. → (unassigned)",
                "Munini, R. → (unassigned)",
                "Aslam, O. P. M. → (unassigned)",
                "Bisschoff, D. → (unassigned)",
                "Ngobeni, M. D. → (unassigned)",
                "Potgieter, M. S. → (unassigned)",
                "Adriani, O. → (unassigned)",
                "Barbarino, G. C. → (unassigned)",
                "Bazilevskaya, G. A. → (unassigned)",
                "Bellotti, R. → (unassigned)",
                "Bogomolov, E. A. → (unassigned)",
                "Bongi, M. → (unassigned)",
                "Bonvicini, V. → (unassigned)",
                "Bruno, A. → (unassigned)",
                "Cafagna, F. → (unassigned)",
                "Campana, D. → (unassigned)",
                "Carlson, P. → (unassigned)",
                "Casolino, M. → (unassigned)",
                "Castellini, G. → (unassigned)",
                "De Santis, C. → (unassigned)",
                "Galper, A. M. → (unassigned)",
                "Koldashov, S. V. → (unassigned)",
                "Koldobskiy, S. → (unassigned)",
                "Kvashnin, A. N. → (unassigned)",
                "Leonov, A. A. → (unassigned)",
                "Malakhov, V. V. → (unassigned)",
                "Marcelli, L. → (unassigned)",
                "Martucci, M. → (unassigned)",
                "Mayorov, A. G. → (unassigned)",
                "Merge, M. → (unassigned)",
                "Mocchiutti, E. → (unassigned)",
                "Monaco, A. → (unassigned)",
                "Mori, N. → (unassigned)",
                "Mikhailov, V. V. → (unassigned)",
                "Osteria, G. → (unassigned)",
                "Panico, B. → (unassigned)",
                "Papini, P. → (unassigned)",
                "Pearce, M. → (unassigned)",
                "Picozza, P. → (unassigned)",
                "Ricci, M. → (unassigned)",
                "Ricciarini, S. B. → (unassigned)",
                "Simon, M. → (unassigned)",
                "Sotgiu, A. → (unassigned)",
                "Sparvoli, R. → (unassigned)",
                "Spillantini, P. → (unassigned)",
                "Stozhkov, Y. I. → (unassigned)",
                "Vacchi, A. → (unassigned)",
                "Vannuccini, E. → (unassigned)",
                "Vasilyev, G. I. → (unassigned)",
                "Voronov, S. A. → (unassigned)",
                "Yurkin, Y. T. → (unassigned)",
                "Zampa, G. → (unassigned)",
                "Zampa, N. → (unassigned)"
              ]
            }
          }
        },
        {
          "id" : 340697,
          "fields" : {
            "title" : {
              "values" : [
                "GAMMA-400 gamma-ray observatory"
              ]
            },
            "author_and_inst" : {
              "values" : [
                "Topchiev, N. P. → (unassigned)",
                "Galper, A. M. → (unassigned)",
                "Bonvicini, V. → (unassigned)",
                "Adriani, O. → (unassigned)",
                "Aptekar, R. L. → (unassigned)",
                "Arkhangelskaja, I. V. → (unassigned)",
                "Arkhangelskiy, A. I. → (unassigned)",
                "Bakaldin, A. V. → (unassigned)",
                "Bergstrom, L. → (unassigned)",
                "Berti, E. → (unassigned)",
                "Bigongiari, G. → (unassigned)",
                "Bobkov, S. G. → (unassigned)",
                "Boezio, M. → (unassigned)",
                "Bogomolov, E. A. → (unassigned)",
                "Bonechi, L. → (unassigned)",
                "Bongi, M. → (unassigned)",
                "Bottai, S. → (unassigned)",
                "Castellini, G. → (unassigned)",
                "Cattaneo, P. W. → (unassigned)",
                "Cumani, P. → (unassigned)",
                "Dalkarov, O. D. → (unassigned)",
                "Dedenko, G. L. → (unassigned)",
                "De Donato, C. → (unassigned)",
                "Dogiel, V. A. → (unassigned)",
                "Finetti, N. → (unassigned)",
                "Gascon, D. → (unassigned)",
                "Gorbunov, M. S. → (unassigned)",
                "Gusakov, Yu. V. → (unassigned)",
                "Hnatyk, B. I. → (unassigned)",
                "Kadilin, V. V. → (unassigned)",
                "Kaplin, V. A. → (unassigned)",
                "Kaplun, A. A. → (unassigned)",
                "Kheymits, M. D. → (unassigned)",
                "Korepanov, V. E. → (unassigned)",
                "Larsson, J. → (unassigned)",
                "Leonov, A. A. → (unassigned)",
                "Loginov, V. A. → (unassigned)",
                "Longo, F. → (unassigned)",
                "Maestro, P. → (unassigned)",
                "Marrocchesi, P. S. → (unassigned)",
                "Martinez, M. → (unassigned)",
                "Menshenin, A. L. → (unassigned)",
                "Mikhailov, V. V. → (unassigned)",
                "Mocchiutti, E. → (unassigned)",
                "Moiseev, A. A. → (unassigned)",
                "Mori, N. → (unassigned)",
                "Moskalenko, I. V. → (unassigned)",
                "Naumov, P. Yu. → (unassigned)",
                "Papini, P. → (unassigned)",
                "Paredes, J. M. → (unassigned)",
                "Pearce, M. → (unassigned)",
                "Picozza, P. → (unassigned)",
                "Rappoldi, A. → (unassigned)",
                "Ricciarini, S. → (unassigned)",
                "Runtso, M. F. → (unassigned)",
                "Ryde, F. → (unassigned)",
                "Serdin, O. V. → (unassigned)",
                "Sparvoli, R. → (unassigned)",
                "Spillantini, P. → (unassigned)",
                "Stozhkov, Yu. I. → (unassigned)",
                "Suchkov, S. I. → (unassigned)",
                "Taraskin, A. A. → (unassigned)",
                "Tavani, M. → (unassigned)",
                "Tiberio, A. → (unassigned)",
                "Tyurin, E. M. → (unassigned)",
                "Ulanov, M. V. → (unassigned)",
                "Vacchi, A. → (unassigned)",
                "Vannuccini, E. → (unassigned)",
                "Vasilyev, G. I. → (unassigned)",
                "Ward, J. E. → (unassigned)",
                "Yurkin, Yu. T. → (unassigned)",
                "Zampa, N. → (unassigned)",
                "Zirakashvili, V. N. → (unassigned)",
                "Zverev, V. G. → (unassigned)"
              ]
            }
          }
        }
      ]
    }
  }
}

fields

Type
fields
Default
{
  "type": "fields:reference",
  "auto": true
}
Required
no

Declares one or more fields to be used to compute features. Each field unique field value is translated into one feature.

feature​Source:​words

A feature source constructing features from individual words in the input text.

{
  "type": "featureSource:words",
  "fields": {
    "type": "fields:reference",
    "auto": true
  }
}

Returns a stream of per-field composites (containing atomic word features).

In the example below, for all documents written by Robert Williams we compute the ratio of words they have in common (the name is picked arbitrarily). We are interested in the top-10 scoring pairs. Note that we use words as hash collision features because we know the number of documents in scope will be relatively small. For larger queries, the number of collision pairs should be limited by picking a more unique (selective) hash feature source.

{
  "comment": [
    "Find word overlap among pairs of documents authored by Robert Williams."
  ],
  "components": {
    "featureSources": {
      "type": "featureSource:unique",
      "source": {
        "type": "featureSource:flatten",
        "source": {
          "type": "featureSource:words",
          "fields":{
            "type": "fields:simple",
            "fields": [
              "abstract"
            ]
          }
        }
      }
    }
  },
  "stages": {
    "pairs": {
      "type": "documentPairs:duplicates",
      "query": {
        "type": "query:string",
        "query": "author_name:\"Williams, Robert\""
      },
      "hashGrouping": {
        "features": {
          "type": "featureSource:reference",
          "use": "featureSources"
        }
      },
      "validation": {
        "comment": "We are interested in validation scores of all document pairs.",
        "min": 0,
        "max": 1,
        "pairwiseSimilarity": {
          "type": "pairwiseSimilarity:featureIntersectionMinRatio",
          "features": {
            "type": "featureSource:reference",
            "use": "featureSources"
          }
        }
      },
      "output": {
        "comment": "Limit the output to the top 10 pairs.",
        "explanations": true,
        "limit": 10
      }
    },

    "documents": {
      "type": "documentContent",
      "limit": "unlimited",

      "documents": {
        "type": "documents:fromDocumentPairs",
        "documentPairs": {
          "type": "documentPairs:reference",
          "use": "pairs"
        }
      },

      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "title": {},
          "author_and_inst": {
            "maxValues": 100
          }
        }
      }
    }
  }
}

fields

Type
fields
Default
{
  "type": "fields:reference",
  "auto": true
}
Required
no

Declares one or more fields from which features should be computed.

Consumers of feature​Source:​*

The following stages and components take feature​Source:​* as input:

Stage or component Property
document​Pairs:​duplicates
  • features
  • feature​Source:​count
  • source
  • feature​Source:​flatten
  • source
  • feature​Source:​group
  • source
  • feature​Source:​minhash
  • source
  • feature​Source:​ngrams
  • source
  • feature​Source:​simhash
  • source
  • feature​Source:​unique
  • source
  • pairwise​Similarity:​feature​Intersection​Min​Ratio
  • features
  • pairwise​Similarity:​feature​Intersection​Size
  • features
  • pairwise​Similarity:​feature​Intersection​To​Union​Ratio
  • features