documents
The documents:*
stages group various ways of producing lists of documents. You can feed the results of
the documents stage as input to another stage, such as
document content retrieval, collecting
labels from documents
or creating a matrix of
similarities between documents.
You can use the following documents stages in your analysis requests:
-
documents:byId
-
Selects documents by their internal identifiers.
-
documents:byQuery
-
Selects documents matching the query you provide.
-
documents:byWeight
-
Given a list of documents, filters out documents with weights smaller than the threshold you provide.
-
documents:byWeightMass
-
Given a list of documents, selects the top scoring documents that account for the specified percentage of the total weight of the documents.
-
documents:composite
-
Takes a union or intersection of the document lists you provide.
-
documents:contrastScore
-
Computes contrast score for the documents you provide. For certain collections, contrast score may reveal documents that introduce novel concepts, not seen in preceding documents.
-
documents:embeddingNearestNeighbors
-
Selects documents that are most similar to the multidimensional vector you provide.
-
documents:fromClusterExemplars
-
Collects exemplars of the document clusters you provide into a flat document list.
-
documents:fromClusterMembers
-
Collects members of the document clusters you provide into a flat document list.
-
documents:fromDocumentPairs
-
Converts the list of document pairs you provide into a flat list of unique documents occurring in at least one input pair.
-
documents:fromMatrixColumns
-
Given a matrix with columns corresponding to documents, selects the top-scoring columns and returns them as a document list.
-
documents:rwmd
-
Computes an approximation of the Relaxed Word Movers Distance between the documents and labels you provide.
-
documents:sample
-
Applies random sampling to the documents you provide.
-
documents:scored
-
Computes new weights for the set of documents you provide based on the scoring component of your choice.
-
documents:vectorFieldNearestNeighbors
-
Selects documents which are most similar to the vector you provide (using externally provided vector field).
documents:reference
-
References the results of another
documents:*
stage defined in the request.
The JSON output of a documents stage has the following outline structure:
The documents
array is mandatory for all implementations and consists of objects with the following
fields:
id
- Internal identifier of the document. Internal document identifiers may change between subsequent commits and reindexing runs.
weight
- Weight of the document. The semantics of the weight depends on the specific stage.
Different implementations may return additional properties. Here, the
documents:byQuery
stage returned an additional
matches
element which contains the number of matching documents and whether this number is approximate or exact.
documents:byId
Returns documents matching the provided internal identifiers. This component can be helpful for debugging or when internal identifiers are returned from another source. Note that internal document identifiers need not be contiguous and can change after document or label reindexing.
{
"type": "documents:byId",
"documents": []
}
documents
An array of objects, each with an id
property pointing at an internal document identifier. For
example:
Identifiers corresponding to non-existing or deleted documents will cause an error.
documents:byQuery
Returns documents matching the provided query.
{
"type": "documents:byQuery",
"accurateHitCount": false,
"limit": 10000,
"query": null,
"requireScores": true
}
Use the documents:byQuery
stage to select documents matching a query provided by the user. One common
query type is query:string
, which parses the Apache Lucene query syntax.
documents:byQuery
returns the following JSON structure:
matches
-
Information about the total number of documents matching the query.
value
-
Total number of documents matching the query. Note that this number may be approximate and larger than the
number of documents actually returned in the
documents
array. relation
-
Indicates whether the total number of documents is exact or an approximation.
EXACT
value
contains the exact number of matches.GREATER_OR_EQUAL
-
value
is a lower bound on the number of matches. To force Lingo4G to compute the exact number of matches, set theaccurateHitCount
property totrue
.
documents
-
Array of selected documents. The
weight
property contains the search score returned by Lucene for the specific document. Length of the array is not greater thanlimit
.
accurateHitCount
If true
, the returned number of matching documents is guaranteed to be accurate, otherwise it may
be an approximation. Accurate results are typically more costly to compute.
Here is an example stage requesting approximate total:
The output for the above stage is:
Compare the above to the result below, when an accurate hit count is requested:
limit
The maximum number of documents to select.
Value must be an integer >= 0 or the string unlimited
, in which case the stage returns all
matches.
If limit is smaller than the number of documents matching the query, the stage returns the top-scoring documents.
If limit is zero, Lingo4G computes the number of documents matching the query and returns the result in the
matches
section of the response, leaving the
documents
array empty. Counting the number of matches is often faster than selecting the identifiers of matching
documents, so if you only want to
count query matches, set limit to 0.
query
The query to execute and retrieve matching documents for. The following
example uses the
query:string
component:
requireScores
If false
, the selector will query document identifiers only (without scores or score-implied sort
order). This can be used to accelerate large queries where scores are not used or are irrelevant to the result.
Document order is not guaranteed and may be random in scoreless query mode.
documents:byWeight
Given a list of documents, filters out documents with weights smaller than the threshold you provide.
{
"type": "documents:byWeight",
"documents": {
"type": "documents:reference",
"auto": true
},
"minWeight": 0.7
}
You can use this stage to select documents with a dynamically-changing limit based on the minimum document weight you provide. For example, the following request selects documents whose embedding vectors are similar to the embedding vector of the clustering algorithm label and where the similarity between the document and search label vector is not smaller than 0.7.
To make sure that the pool of candidates contains document with large enough weights, we set the
limit
on the underlying
documents:embeddingNearestNeighbors
stage to 20000.
documents
The documents to which to apply filtering.
minWeight
The minimum weight each document must have to be included in the result.
documents:byWeightMass
Given a list of documents, selects the top scoring documents that account for the specified percentage of the total weight of the documents.
{
"type": "documents:byWeightMass",
"applyToEqualScores": false,
"documents": {
"type": "documents:reference",
"auto": true
},
"minWeightMass": 1
}
applyToEqualScores
Determines if filtering should also apply if all input documents have equal scores.
If false
and all input documents have equal scores, Lingo4G does not apply filtering and returns
all documents
.
If true
and all input documents have equal scores, Lingo4G applies the filtering, which effectively
results in returning the first minWeightMass
* 100 percent of the input documents.
documents
The input documents to filter.
minWeightMass
The accumulated document weight threshold at which to filter the input documents.
To perform the filtering, Lingo4G computes the sum of the weights of all input documents. Then, it passes to the
output the first top-scoring documents that account for at least minWeightMass
of the total weight of the documents.
documents:composite
Takes a union or intersection of the nested document selectors, aggregating weights of identical documents in multiple selectors and sorting documents by those weights.
{
"type": "documents:composite",
"operator": "OR",
"selectors": [],
"sortOrder": "DESCENDING",
"weightAggregation": "SUM"
}
operator
Declares the way documents from selectors
are combined. The
operator
property supports the following values:
OR
-
Produces the union of all unique documents from all selectors.
AND
-
Produces the intersection of all documents from all selectors. A document must appear in all selectors to appear in the output.
selectors
An array of nested
document:*
selectors to combine.
sortOrder
Controls the order of the documents on output. Documents are sorted by their weight. See
sortOrder
for the list of possible sorting orders.
weightAggregation
Controls how document weights (scores) are aggregated for documents that exist in more than one selector.
See
weightAggregation
in the documentation of common types for the list of possible values.
documents:contrastScore
Computes contrast score for the documents you provide. For certain collections, contrast score may reveal documents that introduce novel concepts, not seen in preceding documents.
{
"type": "documents:contrastScore",
"contextTimestamps": null,
"documentTimestamps": null,
"documents": {
"type": "documents:reference",
"auto": true
},
"forceSymmetricalContext": true,
"limit": 10000,
"matrixRows": {
"type": "matrixRows:reference",
"auto": true
},
"minSimilarDocuments": 0,
"sortOrder": "DESCENDING"
}
Contrast score of a document depends on how many similar documents precede and follow the document. For example, an arXiv paper that is similar to very few papers published earlier, and at the same time similar to very many papers published later, will have a high contrast score, which may indicate that the paper introduces some novel ideas that inspired follow-up research. Similarly, a paper similar to many preceding papers and at the same similar to very few succeeding papers will have a low contrast score, which may indicate it does not introduce any novel ideas.
Algorithm
Lingo4G requires the following pieces of data to compute contrast scores:
-
documents
- The list of input documents for which to compute contrast scores.
- Context documents
-
The pool of documents to use as the before / after context. For each input document, Lingo4G computes similarities to all context documents to determine the number of similar documents that precede and follow the document being scored.
You provide the context documents indirectly through the document similarity
matrixRows
property. -
matrixRows
- Rows of the similarity matrix between input and context documents. Rows in the matrix must correspond to input documents, columns must correspond to context documents.
-
documentTimestamps
contextTimestamps
-
Time stamps, such as dates, for input and context documents. Lingo4G uses them to determine which context document were written before and which after the input documents.
Time stamps must be of string or number type. Lingo4G compares document time stamps using natural and lexicographic order, respectively.
To compute contrast scores Lingo4G performs the following steps:
-
For each row from
matrixRows
, which contains context documents that are most similar to one input document, split the context documents into those that were written before and after the input document. Use natural order numeric time stamps and lexicographic order for string timestamps. -
If
forceSymmetricalContext
istrue
, truncate the larger of the "before" and "after" context pools to make the numbers of documents in both pools equal. -
If the total size of the "before" and "after" context document pools is less than
minSimilarDocuments
, do not compute contrast score for this input document.The
minSimilarDocuments
threshold prevents computation of contrast scores based on very few context documents. In an extreme case, forminSimilarDocuments
equal to0
, you could receive a perfect contrast score of 1.0 for just one "after" context document and zero "before" context documents. -
Add up similarities of the input document to the "before" and "after" context documents to form the and values.
-
Compute the input document's contrast score as:
The contrast score can take values in the -1...1 range. Contrast score of -1 means there are no succeeding context documents similar to the input document, so the input document probably does not introduce any novel ideas. Conversely, a contrast score of +1 means that all similar context documents were written after the input document, which may suggest that the input document introduces novel ideas.
-
Sort the results by contrast score and return the top
limit
highest-scoring documents.
Results format
The documents:contrastScore
stage produces the following JSON structure:
The documents
array contains up to limit
input documents, sorted decreasingly by contrast score. Each object in the array corresponds to one input
document and contains the following properties:
id
- Internal identifier of the document.
score
,weight
-
Contrast score of the document. Both the
score
andweight
properties contain the same value. -
confidence
-
Summarizes the quality of the contrast score of this document. Confidence is 1.0 if all the available context documents were eligible for contrast score computation. If
forceSymmetricalContext
istrue
, confidence may be lower than 1.0 to indicate that Lingo4G had to ignore some of the context documents to make the "before" and "after" context pools contain equal numbers of documents.Lingo4G computes the confidence factor using the following formula:
-
balance
-
Summarizes the quality of the "before" and "after" context of this document. Balance is 1.0 if the pools of "before" and "after" context documents for this input document are equal. If the pools are not equal, balance is less than 1.0 and falls to 0.0 if any of the context document pools is empty.
Lingo4G computes the balance factor using the following formula:
-
before
,after
-
Statistics about the "before" and "after" pools of documents for this input document.
-
similar
-
The number of similar documents in the respective context pool.
-
similarity
-
The sum of input-to-context document similarities in the respective context pool
-
context
-
The total number of context documents in the respective pool.
Note that this number will most of the time be larger than the
similar
property because not all documents in the context pool are similar to the input documents. (In fact, most documents in the context pool are not similar to the input document.)
-
Example request
The following request computes contrast scores arXiv papers published in 2014.
The documents
stage selects documents for which to compute the contrast score. Our request uses a
range query to select all documents created in 2014. The
context
stage selects the context documents, which the contrast score computation algorithm splits into "before" and
"after" pools. Our request uses a window of +/- 4 years for the context, so it selects papers created between
2011 and 2018.
The similarities
component defines the rows of the similarity matrix between the input and context
document. It uses
matrixRows:knnVectorsSimilarity
to select the 200 most similar context documents for each input document. Our request passes to the component a
reference to the input documents to be used as rows of the similarity matrix, and a reference to the context
documents to be used as matrix columns. As an alternative to embedding-based similarities, you could use the
matrixRows:keywordDocumentSimilarity
, which does not require document embeddings present in the index, but takes much longer to compute.
The documentTimestamps
and contextTimestamps
use
values:fromDocumentField
to retrieve values of the created
field for input and context documents. The
created
field contains paper creation dates in the
YYYY-MM-DD
format, which is suitable for lexicographic comparisons.
The scores
stage uses
documents:contrastScore
to compute the scores. The request
passes most required data as references. One exception is the similarity
component, which Lingo4G
resolves as an automatic reference.
Finally, the content
stage retrieves titles and abstracts for the documents with the highest
contrast scores.
If you run the request in the JSON Sandbox app, you should receive a response similar to the following JSON:
The documents
stage result contains information about the contrast score and other related
statistics. The above sample response contains only one document in the array, real-world responses contain up
to limit
results.
The content
stage result shows titles and abstract of the three documents with top scores. Notice
how they revolve around deep learning, which was a new hot topic around that time.
Notes
-
Not for real-time trend detection. Currently, Lingo4G can compute contrast scores only when it has access to documents that both precede and follow the document in question. Due to this, the method is useful only ex post: it is not suitable for novelty detection in real-time.
-
Provide a suitable window of context documents. For best results, ensure that the context documents fall in a symmetrical window centered around the period of input documents. For example, if you compute contrast score for papers written in 2015, make the context documents cover a period of +/- 3, 4 or 5 years.
-
Examine contrast score confidence. A contrast score close to 1.0 does not always mean a document contains innovative ideas. For example, when there is only, one "before" document available in the context pool, a score close to 1.0 is ill-founded.
Therefore, always examine the
confidence
of the score. As a rule of thumb, if the confidence is below 0.2, this means the high contrast score is probably ill-founded. Consider setting a non-zerominSimilarDocuments
to filter out such documents from scoring. Alternatively, increase the period of time covered by the context documents and see if this improves the confidence of the contrast scores.
contextTimestamps
Time stamps of the context documents to use for contrast score computation.
In typical cases, you can use the
values:fromDocumentField
stage to collect values of a specific document field, such as creation date, to serve as context time stamps.
Use the same set of context documents to compute the time stamp values and the columns of the similarity
matrixRows
. If time stamps don't match similarity matrix columns, Lingo4G throws an error.
The time stamp values must be strings or numbers. Lingo4G compares document time stamps using natural and lexicographic order, respectively.
documentTimestamps
Time stamps of the input documents to use for contrast score computation.
In typical cases, you can use the
values:fromDocumentField
stage to collect values of a specific document field, such as creation date, to serve as input document time
stamps.
Use the same set of context documents to compute the time stamp values and the rows of the similarity
matrixRows
. If time stamps don't match similarity matrix rows, Lingo4G throws an error.
The time stamp values must be strings or numbers. Lingo4G compares document time stamps using natural and lexicographic order, respectively.
documents
The input documents for which to compute contrast scores.
Provide the same documents as rows in the similarity
matrixRows
computation and in the
documentTimestamps
value collection. If the document sets don't match, Lingo4G throws an error.
forceSymmetricalContext
Ignores certain context documents to keep the window of "before" and "after" context documents symmetrical and centered around the input document.
When you compute contrast scores for a set of documents spanning, for example, one year, you will likely use
queries similar to created:[2014-01-01 TO 2014-12-31]
and
created:[2010-01-01 TO 2018-12-31]
for input and context documents. If you take a specific input
document published on 2014-01-01, the entire context document window will not be perfectly centered around that
document – the "after" part of the window is larger than the "before" part.
If you set forceSymmetricalContext
to true
, Lingo4G discards some of the context
documents to keep the context window symmetrical. Note that this may
lower the contrast score confidence.
limit
The number of top-scoring documents to return.
matrixRows
Defines similarities between input and context documents for contrast score computation.
In most cases, you can use the
matrixRows:knnVectorsSimilarity
or
matrixRows:keywordDocumentSimilarity
to compute the similarities.
Provide input documents
as rows and the context documents as columns for the similarity matrix rows computation.
Regardless of which matrixRows
component you choose, set its
maxNeighbors
property to at least 100 for meaningful contrast scores.
minSimilarDocuments
The minimum number of documents in the context window required for contrast score computation.
Lingo4G ignores input documents that have fewer than
minSimilarDocuments
of context documents in their context window. If you see high contrast score
documents with low confidence, increase
minSimilarDocuments
above zero to filter out documents with such low-quality scores. A good starting point is setting
minSimilarDocuments
to equal half of the number of the
maxNeighbors
value you used for the similarity matrix rows computation.
sortOrder
Controls the order of documents on output.
Lingo4G sorts the output documents based on their contrast score. The default value of
DESCENDING
puts the documents with the highest contrast score first. To see the documents with the lowest contrast score,
set
sortOrder
to
ASCENDING
. Finally, if you set sortOrder
to
UNSPECIFIED
, Lingo4G returns the input documents in their original order.
documents:embeddingNearestNeighbors
Selects documents whose embedding vectors are most similar to the vector you provide.
{
"type": "documents:embeddingNearestNeighbors",
"failIfEmbeddingsNotAvailable": true,
"filterQuery": {
"type": "query:all"
},
"limit": 100,
"searcher": "AUTO",
"vector": {
"type": "vector:reference",
"auto": true
}
}
See the document selection tutorial for examples of document- and label-based selection of embedding-wise similar documents.
failIfEmbeddingsNotAvailable
Determines the behavior of this stage if the index does not contain document embeddings.
If the index does not contain document embeddings and failIfEmbeddingsNotAvailable
is:
true
- this stage fails and logs an error.
false
- this stage returns an empty set of document embeddings.
If your request combines keyword- and embedding-based processing, you can set
failIfEmbeddingsNotAvailable
to false
to have Lingo4G degrade gently to keyword-based
processing if the index does not contain document embeddings.
filterQuery
Narrows down the returned documents to those matching the query you provide.
If you provide the query
property, Lingo4G narrows down the results of this stage to documents that
match the query.
For example, the following request limits the results of embedding-based document selection to arXiv papers in the cs.* category.
limit
The maximum number of documents to select.
searcher
Determines the document searching algorithm.
Lingo4G can use one of two algorithms to find documents whose embedding vectors lie closely to the input vector
you provide. The searcher
property determines the algorithm to use.
AUTO
-
Automatic algorithm choice based on the number of documents to select and the number of documents matching the
filterQuery
. Use automatic algorithm selection unless you notice this stage performs slowly for a specific search. APPROXIMATE
-
Forces Lingo4G to use the approximate search algorithm, which traverses a graph of similar vectors. This algorithm is only efficient for searches with low
limit
values or searches without results filtering. -
COMPLETE
-
Forces Lingo4G to perform a complete search of all document embedding vectors. If you notice slow performance of a search under the
AUTO
searcher, try theCOMPLETE
searcher, which may offer better performance for that particular search.
vector
The input vector for the similar document search.
You can use the following vector sources for this property:
-
vector:documentEmbedding
returns documents that are similar to the document that served as the source of the input vector. -
vector:labelEmbedding
returns documents that are similar to the label that served as the source of the input vector.
See the document selection tutorial for examples of document- and label-based selection of embedding-wise similar documents.
documents:fromClusterExemplars
Collects highest-weight top-level exemplars of the document clusters you provide into a flat document list.
{
"type": "documents:fromClusterExemplars",
"clusters": {
"type": "clusters:reference",
"auto": true
},
"documents": {
"type": "documents:reference",
"auto": true
},
"limit": 10000,
"sortOrder": "DESCENDING"
}
You can use this stage, combined with
clusters:ap
, which clusters documents into related groups, to reduce a large collection of documents into a much smaller set
of salient documents representing different themes present in the original collection.
Another use case of this stage is with combination with the
clusters:fromMatrixColumns
stage to process the result of synthetic clustering of matrix columns.
clusters
The clusters from which to collect exemplars.
documents
The documents that gave rise to the input clusters.
The input clusters
and
documents
must be compatible: the clusters must have been created based on the documents. Otherwise, Lingo4G logs an
error.
limit
The maximum number of exemplar documents to collect.
sortOrder
Determines the order in which to collect document exemplars.
ASCENDING
-
Collects up to
limit
of exemplar documents with the lowest exemplar weight values. -
DESCENDING
-
Collects up to
limit
of exemplar documents with the highest exemplar weight values. -
UNSPECIFIED
-
Collects up to
limit
of exemplar documents in the order they appear in the cluster list.
documents:fromClusterMembers
Collects members of the document clusters you provide into a flat document list.
{
"type": "documents:fromClusterMembers",
"clusters": {
"type": "clusters:reference",
"auto": true
},
"documents": {
"type": "documents:reference",
"auto": true
},
"limit": 10000,
"sortOrder": "DESCENDING"
}
clusters
The clusters from which to collect document members.
documents
The documents that gave rise to the input clusters.
The input clusters
and
documents
must be compatible: the clusters must have been created based on the documents. Otherwise, Lingo4G logs an
error.
limit
The maximum number of member documents to collect.
sortOrder
Determines the order in which to collect document exemplars.
ASCENDING
-
Collects up to
limit
of member documents with the lowest member weight values. -
DESCENDING
-
Collects up to
limit
of member documents with the highest member weight values. -
UNSPECIFIED
-
Collects up to
limit
of member documents in the order they appear in the cluster list.
documents:fromDocumentPairs
Converts the list of document pairs you provide into a flat list of unique documents occurring in at least one input pair.
{
"type": "documents:fromDocumentPairs",
"documentPairs": {
"type": "documentPairs:reference",
"auto": true
}
}
You can combine this stage with
documentContent
to fetch contents of documents involved in at least one of the pairs:
"content": {
"type": "documentContent",
"limit": "unlimited",
"documents": {
"type": "documents:fromDocumentPairs",
"documentPairs": {
"type": "documentPairs:reference",
"use": "similarPairs"
}
},
"fields": {
"type": "contentFields:simple",
"fields": {
"id": {},
"title": {},
"author_name": {},
"created": {},
"updated": {},
"abstract": {
"maxValueLength": 250
}
}
}
}
documentPairs
The document pairs to convert into a flat document list.
documents:fromMatrixColumns
Given a matrix with columns corresponding to documents, selects the top-scoring columns and returns them as a document list.
{
"type": "documents:fromMatrixColumns",
"documents": {
"type": "documents:reference",
"auto": true
},
"limit": 10000,
"matrixRows": {
"type": "matrixRows:reference",
"auto": true
},
"sortOrder": "DESCENDING",
"weightAggregation": "SUM"
}
This stage performs the following steps:
-
For each column of the input
matrixRows
, aggregate the column's values using theweightAggregation
function. -
Sort columns by their aggregated value computed in step 1, according to the
sortOrder
. -
Return a list of documents corresponding to up to
limit
first columns on the sorted list.
You can use the documents:fromMatrixColumns
stage to select top-scoring documents where the score is
an aggregation of a number of values. For example, if you build
matrixRows
of cross-similarities between a set of
cs.* and physics.* arXiv papers,
documents:fromMatrixColumns
can reveal the top
physics.* papers that are most similar to cs.* papers, showing where the two areas overlap.
documents
The documents that correspond to columns of the input matrix rows.
Make sure that the documents you provide in this property also gave rise to the columns of the input
matrixRows
. If the two are incompatible, Lingo4G logs an error.
limit
The maximum number of documents to select.
matrixRows
The matrix rows whose columns to aggregate.
Make sure that the documents
you provide gave rise to the columns of the input
matrixRows
. If the two are incompatible, Lingo4G logs an error.
sortOrder
Determines the sorting order for the aggregated column values.
ASCENDING
-
Collects up to
limit
of documents corresponding to columns with the largest aggregated values. -
DESCENDING
-
Collects up to
limit
of documents corresponding to columns with the smallest aggregated values. -
UNSPECIFIED
-
Collects up to
limit
of documents in the order their corresponding columns appear in the inputmatrixRows
.
weightAggregation
The column value aggregation function.
documents:rwmd
Computes an approximation of the Relaxed Word Movers Distance between the documents and labels you provide.
{
"type": "documents:rwmd",
"documents": {
"type": "documents:reference",
"auto": true
},
"failIfEmbeddingsNotAvailable": true,
"fields": {
"type": "featureFields:reference",
"auto": true
},
"labelFilter": {
"type": "labelFilter:reference",
"auto": true
},
"labels": {
"type": "labels:reference",
"auto": true
}
}
Relaxed Word Movers Distance (RWMD) aims to compute similarities between documents using multidimensional embedding vectors of the words appearing in the documents. Lingo4G's formulation computes the similarity between a list of labels and a list of documents you provide.
For each document in the document list, Lingo4G computes the RWMD similarity in the following way:
-
Collect all labels occurring in the document.
-
For each of the input
labels
, find the document's label with the highest embedding-wise similarity. -
Compute the document's RWMD score as the search score of the document against a union of the input labels and labels computed in step 2.
The above formulations makes it possible to compare RWMD scores with the regular keyword search scores you get
when combining the
documents:byQuery
stage with the
query:forLabels
query component. Therefore, a typical use case for the documents:rwmd
stage is to compute a unified score for keyword- and embedding-based similar document searches (MLT).
documents
The input documents for which to compute the RWMD score.
failIfEmbeddingsNotAvailable
Determines the behavior of this stage if the index does not contain document embeddings.
If the index does not contain document embeddings and failIfEmbeddingsNotAvailable
is:
true
- this stage fails and logs an error.
false
-
this stage returns documents with
weight
values equal to the weights of the input documents.
If your request combines keyword- and embedding-based processing, you can set
failIfEmbeddingsNotAvailable
to false
to have Lingo4G degrade gently to keyword-based
processing if the index does not contain document embeddings.
fields
Determines the document feature fields to use for label collection and document scoring.
labelFilter
Performs filtering of labels collected from individual documents.
labels
The labels against which to score the input documents.
documents:sample
Returns a uniform sample of documents returned by the provided
query
.
{
"type": "documents:sample",
"limit": 10000,
"query": null,
"randomSeed": 0,
"samplingRatio": 1
}
limit
The maximum number of documents to select.
Value must be an integer >= 0 or the string unlimited
.
query
One of the query components.
randomSeed
The random seed to use for sampling.
samplingRatio
The sampling ratio between 0 (exclusive) and 1 (inclusive). The
documents:sample
component will attempt to return a uniform sample of size
samplingRatio * sourceDocumentCount
documents.
documents:scored
Computes new weights for the set of documents you provide based on the
documentScorer:*
of your choice.
{
"type": "documents:scored",
"documents": {
"type": "documents:reference",
"auto": true
},
"limit": 10000,
"scorer": {
"type": "documentScorer:reference",
"auto": true
},
"sortOrder": "DESCENDING"
}
By default, this stage re-orders the documents in the decreasing order
of the score and returns the up to limit
of the top-scoring documents.
documents
The documents for which to compute new weights.
limit
The maximum number of top-scoring documents to return.
scorer
The scoring component to use to compute new document weights.
sortOrder
Controls the order of the documents on output. Lingo4G sorts the documents using the weight computed by the
scorer
you provide.
See sortOrder
for the list of possible sorting orders.
documents:vectorFieldNearestNeighbors
Selects documents which are most similar to the vector you provide. This stage uses vector fields for which vector data must be provided from outside Lingo4G and indexed together with other document fields.
{
"type": "documents:vectorFieldNearestNeighbors",
"fieldName": null,
"filterQuery": {
"type": "query:all"
},
"limit": 100,
"vector": {
"type": "vector:reference",
"auto": true
}
}
fieldName
Document field containing external vector data added during indexing.
filterQuery
Narrows down the set of returned documents to those similar to the input vector and the filtering query.
limit
The maximum number of documents to return.
vector
The input vector for the nearest-neighbor similarity search.
You can use the following vector sources for this property:
-
vector:fromVectorField
returns the content of the vector field from one or more documents matching the provided query
documents:*
Consumers of
The following stages and components take documents:*
as
input: