documents

The documents:​* stages group various ways of producing lists of documents. You can feed the results of the documents stage as input to another stage, such as document content retrieval, collecting labels from documents or creating a matrix of similarities between documents.

You can use the following documents stages in your analysis requests:

documents:​by​Id

Selects documents by their internal identifiers.

documents:​by​Query

Selects documents matching the query you provide.

documents:​by​Weight

Given a list of documents, filters out documents with weights smaller than the threshold you provide.

documents:​by​Weight​Mass

Given a list of documents, selects the top scoring documents that account for the specified percentage of the total weight of the documents.

documents:​composite

Takes a union or intersection of the document lists you provide.

documents:​contrast​Score

Computes contrast score for the documents you provide. For certain collections, contrast score may reveal documents that introduce novel concepts, not seen in preceding documents.

documents:​embedding​Nearest​Neighbors

Selects documents that are most similar to the multidimensional vector you provide.

documents:​from​Cluster​Exemplars

Collects exemplars of the document clusters you provide into a flat document list.

documents:​from​Cluster​Members

Collects members of the document clusters you provide into a flat document list.

documents:​from​Document​Pairs

Converts the list of document pairs you provide into a flat list of unique documents occurring in at least one input pair.

documents:​from​Matrix​Columns

Given a matrix with columns corresponding to documents, selects the top-scoring columns and returns them as a document list.

documents:​rwmd

Computes an approximation of the Relaxed Word Movers Distance between the documents and labels you provide.

documents:​sample

Applies random sampling to the documents you provide.

documents:​scored

Computes new weights for the set of documents you provide based on the scoring component of your choice.

documents:​vector​Field​Nearest​Neighbors

Selects documents which are most similar to the vector you provide (using externally provided vector field).


documents:​reference

References the results of another documents:​* stage defined in the request.


The JSON output of a documents stage has the following outline structure:

{
  "matches": {
    "value": 1276,
    "relation": "GREATER_OR_EQUAL"
  },
  "documents": [
    {
      "id": 482237,
      "weight": 6.6440954
    },
    {
      "id": 298152,
      "weight": 6.631213
    },
    {
      "id": 275187,
      "weight": 6.5777597
    }
  ]
}

The documents array is mandatory for all implementations and consists of objects with the following fields:

id
Internal identifier of the document. Internal document identifiers may change between subsequent commits and reindexing runs.
weight
Weight of the document. The semantics of the weight depends on the specific stage.

Different implementations may return additional properties. Here, the documents:​by​Query stage returned an additional matches element which contains the number of matching documents and whether this number is approximate or exact.

documents:​by​Id

Returns documents matching the provided internal identifiers. This component can be helpful for debugging or when internal identifiers are returned from another source. Note that internal document identifiers need not be contiguous and can change after document or label reindexing.

{
  "type": "documents:byId",
  "documents": []
}

documents

Type
array of object
Default
[]
Required
no

An array of objects, each with an id property pointing at an internal document identifier. For example:

{
  "stages": {
    "documentsById": {
      "type": "documents:byId",
      "documents": [
        { "id": 1 },
        { "id": 2 },
        { "id": 10221 },
        { "id": 18548 }
      ]
    }
  }
}

Identifiers corresponding to non-existing or deleted documents will cause an error.

documents:​by​Query

Returns documents matching the provided query.

{
  "type": "documents:byQuery",
  "accurateHitCount": false,
  "limit": 10000,
  "query": null,
  "requireScores": true
}

Use the documents:​by​Query stage to select documents matching a query provided by the user. One common query type is query:​string, which parses the Apache Lucene query syntax.

documents:​by​Query returns the following JSON structure:

{
  "matches": {
    "value": 1276,
    "relation": "GREATER_OR_EQUAL"
  },
  "documents": [
    {
      "id": 482237,
      "weight": 6.6440954
    },
    {
      "id": 298152,
      "weight": 6.631213
    },
    {
      "id": 275187,
      "weight": 6.5777597
    }
  ]
}
matches
Information about the total number of documents matching the query.
value
Total number of documents matching the query. Note that this number may be approximate and larger than the number of documents actually returned in the documents array.
relation
Indicates whether the total number of documents is exact or an approximation.
E​X​A​C​T
value contains the exact number of matches.
G​R​E​A​T​E​R_​O​R_​E​Q​U​A​L
value is a lower bound on the number of matches. To force Lingo4G to compute the exact number of matches, set the accurate​Hit​Count property to true.
documents
Array of selected documents. The weight property contains the search score returned by Lucene for the specific document. Length of the array is not greater than limit.

accurate​Hit​Count

Type
boolean
Default
false
Required
no

If true, the returned number of matching documents is guaranteed to be accurate, otherwise it may be an approximation. Accurate results are typically more costly to compute.

Here is an example stage requesting approximate total:

{
  "type": "documents:byQuery",
  "limit": 3,
  "query": {
    "type": "query:string",
    "query": "photon"
  },
  "accurateHitCount": false
}

The output for the above stage is:

{
  "matches": {
    "value": 1276,
    "relation": "GREATER_OR_EQUAL"
  },
  "documents": [
    {
      "id": 482237,
      "weight": 6.6440954
    },
    {
      "id": 298152,
      "weight": 6.631213
    },
    {
      "id": 275187,
      "weight": 6.5777597
    }
  ]
}

Compare the above to the result below, when an accurate hit count is requested:

{
  "matches": {
    "value": 13718,
    "relation": "EXACT"
  },
  "documents": [
    {
      "id": 482237,
      "weight": 6.6440954
    },
    {
      "id": 298152,
      "weight": 6.631213
    },
    {
      "id": 275187,
      "weight": 6.5777597
    }
  ]
}

limit

Type
limit
Default
10000
Required
no

The maximum number of documents to select.

Value must be an integer >= 0 or the string unlimited, in which case the stage returns all matches.

If limit is smaller than the number of documents matching the query, the stage returns the top-scoring documents.

If limit is zero, Lingo4G computes the number of documents matching the query and returns the result in the matches section of the response, leaving the documents array empty. Counting the number of matches is often faster than selecting the identifiers of matching documents, so if you only want to count query matches, set limit to 0.

query

Type
query
Default
null
Required
yes

The query to execute and retrieve matching documents for. The following example uses the query:​string component:

{
  "type": "documents:byQuery",
  "limit": 3,
  "query": {
    "type": "query:string",
    "query": "photon"
  },
  "accurateHitCount": false
}

require​Scores

Type
boolean
Default
true
Required
no

If false, the selector will query document identifiers only (without scores or score-implied sort order). This can be used to accelerate large queries where scores are not used or are irrelevant to the result.

Document order is not guaranteed and may be random in scoreless query mode.

documents:​by​Weight

Given a list of documents, filters out documents with weights smaller than the threshold you provide.

{
  "type": "documents:byWeight",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "minWeight": 0.7
}

You can use this stage to select documents with a dynamically-changing limit based on the minimum document weight you provide. For example, the following request selects documents whose embedding vectors are similar to the embedding vector of the clustering algorithm label and where the similarity between the document and search label vector is not smaller than 0.7.

{
  "stages": {
    "documents":{
      "type": "documents:byWeight",
      "documents": {
        "type": "documents:embeddingNearestNeighbors",
        "vector": {
          "type": "vector:labelEmbedding",
          "labels": {
            "type": "labels:direct",
            "labels": [
              {
                "label": "clustering algorithm"
              }
            ]
          }
        },
        "limit": 20000
      },
      "minWeight": 0.7
    }
  }
}

To make sure that the pool of candidates contains document with large enough weights, we set the limit on the underlying documents:​embedding​Nearest​Neighbors stage to 20000.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents to which to apply filtering.

min​Weight

Type
number
Default
0.7
Constraints
value >= 0 and value <= 1
Required
no

The minimum weight each document must have to be included in the result.

documents:​by​Weight​Mass

Given a list of documents, selects the top scoring documents that account for the specified percentage of the total weight of the documents.

{
  "type": "documents:byWeightMass",
  "applyToEqualScores": false,
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "minWeightMass": 1
}

apply​To​Equal​Scores

Type
boolean
Default
false
Required
no

Determines if filtering should also apply if all input documents have equal scores.

If false and all input documents have equal scores, Lingo4G does not apply filtering and returns all documents.

If true and all input documents have equal scores, Lingo4G applies the filtering, which effectively results in returning the first min​Weight​Mass * 100 percent of the input documents.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The input documents to filter.

min​Weight​Mass

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

The accumulated document weight threshold at which to filter the input documents.

To perform the filtering, Lingo4G computes the sum of the weights of all input documents. Then, it passes to the output the first top-scoring documents that account for at least min​Weight​Mass of the total weight of the documents.

documents:​composite

Takes a union or intersection of the nested document selectors, aggregating weights of identical documents in multiple selectors and sorting documents by those weights.

{
  "type": "documents:composite",
  "operator": "OR",
  "selectors": [],
  "sortOrder": "DESCENDING",
  "weightAggregation": "SUM"
}

operator

Type
string
Default
"OR"
Constraints
one of [OR, AND]
Required
no

Declares the way documents from selectors are combined. The operator property supports the following values:

O​R

Produces the union of all unique documents from all selectors.

A​N​D

Produces the intersection of all documents from all selectors. A document must appear in all selectors to appear in the output.

selectors

Type
array of documents
Default
[]
Required
no

An array of nested document:​* selectors to combine.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Controls the order of the documents on output. Documents are sorted by their weight. See sort​Order for the list of possible sorting orders.

weight​Aggregation

Type
weightAggregation
Default
"SUM"
Required
no

Controls how document weights (scores) are aggregated for documents that exist in more than one selector.

See weight​Aggregation in the documentation of common types for the list of possible values.

documents:​contrast​Score

Computes contrast score for the documents you provide. For certain collections, contrast score may reveal documents that introduce novel concepts, not seen in preceding documents.

{
  "type": "documents:contrastScore",
  "contextTimestamps": null,
  "documentTimestamps": null,
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "forceSymmetricalContext": true,
  "limit": 10000,
  "matrixRows": {
    "type": "matrixRows:reference",
    "auto": true
  },
  "minSimilarDocuments": 0,
  "sortOrder": "DESCENDING"
}

Contrast score of a document depends on how many similar documents precede and follow the document. For example, an arXiv paper that is similar to very few papers published earlier, and at the same time similar to very many papers published later, will have a high contrast score, which may indicate that the paper introduces some novel ideas that inspired follow-up research. Similarly, a paper similar to many preceding papers and at the same similar to very few succeeding papers will have a low contrast score, which may indicate it does not introduce any novel ideas.

Algorithm

Lingo4G requires the following pieces of data to compute contrast scores:

documents
The list of input documents for which to compute contrast scores.
Context documents

The pool of documents to use as the before / after context. For each input document, Lingo4G computes similarities to all context documents to determine the number of similar documents that precede and follow the document being scored.

You provide the context documents indirectly through the document similarity matrix​Rows property.

matrix​Rows
Rows of the similarity matrix between input and context documents. Rows in the matrix must correspond to input documents, columns must correspond to context documents.
document​Timestamps context​Timestamps

Time stamps, such as dates, for input and context documents. Lingo4G uses them to determine which context document were written before and which after the input documents.

Time stamps must be of string or number type. Lingo4G compares document time stamps using natural and lexicographic order, respectively.

To compute contrast scores Lingo4G performs the following steps:

  1. For each row from matrix​Rows, which contains context documents that are most similar to one input document, split the context documents into those that were written before and after the input document. Use natural order numeric time stamps and lexicographic order for string timestamps.

  2. If force​Symmetrical​Context is true, truncate the larger of the "before" and "after" context pools to make the numbers of documents in both pools equal.

  3. If the total size of the "before" and "after" context document pools is less than min​Similar​Documents, do not compute contrast score for this input document.

    The min​Similar​Documents threshold prevents computation of contrast scores based on very few context documents. In an extreme case, for min​Similar​Documents equal to 0, you could receive a perfect contrast score of 1.0 for just one "after" context document and zero "before" context documents.

  4. Add up similarities of the input document to the "before" and "after" context documents to form the similarity before and similarity after values.

  5. Compute the input document's contrast score as:

    score = similarity after similarity before similarity after + similarity before

    The contrast score can take values in the -1...1 range. Contrast score of -1 means there are no succeeding context documents similar to the input document, so the input document probably does not introduce any novel ideas. Conversely, a contrast score of +1 means that all similar context documents were written after the input document, which may suggest that the input document introduces novel ideas.

  6. Sort the results by contrast score and return the top limit highest-scoring documents.

Results format

The documents:​contrast​Score stage produces the following JSON structure:

{
  "documents": [
    {
      "id": 167063,
      "score": 1,
      "weight": 1,
      "confidence": 0.51,
      "balance": 1,
      "before": {
        "similarity": 0,
        "similar": 0,
        "context": 35637
      },
      "after": {
        "similarity": 91.219955,
        "similar": 102,
        "context": 35637
      }
    }
  ],
  "matches": {
    "value": 10000,
    "relation": "EXACT"
  }
}

JSON response of the documents:​contrast​Score stage.

The documents array contains up to limit input documents, sorted decreasingly by contrast score. Each object in the array corresponds to one input document and contains the following properties:

id
Internal identifier of the document.
score, weight
Contrast score of the document. Both the score and weight properties contain the same value.
confidence

Summarizes the quality of the contrast score of this document. Confidence is 1.0 if all the available context documents were eligible for contrast score computation. If force​Symmetrical​Context is true, confidence may be lower than 1.0 to indicate that Lingo4G had to ignore some of the context documents to make the "before" and "after" context pools contain equal numbers of documents.

Lingo4G computes the confidence factor using the following formula:

confidence = similar before + similar after context before
balance

Summarizes the quality of the "before" and "after" context of this document. Balance is 1.0 if the pools of "before" and "after" context documents for this input document are equal. If the pools are not equal, balance is less than 1.0 and falls to 0.0 if any of the context document pools is empty.

Lingo4G computes the balance factor using the following formula:

balance = 1 | context before context after | context before + context after
before, after

Statistics about the "before" and "after" pools of documents for this input document.

similar

The number of similar documents in the respective context pool.

similarity

The sum of input-to-context document similarities in the respective context pool

context

The total number of context documents in the respective pool.

Note that this number will most of the time be larger than the similar property because not all documents in the context pool are similar to the input documents. (In fact, most documents in the context pool are not similar to the input document.)

Example request

The following request computes contrast scores arXiv papers published in 2014.

{
  "name": "Contrast scores (embeddings)",
  "components": {
    "similarities": {
      "type": "matrixRows:knnVectorsSimilarity",
      "vectors": {
        "rows": {
          "type": "vectors:precomputedDocumentEmbeddings",
          "documents": {
            "type": "documents:reference",
            "use": "documents"
          }
        },
        "columns": {
          "type": "vectors:precomputedDocumentEmbeddings",
          "documents": {
            "type": "documents:reference",
            "use": "context"
          }
        }
      },
      "maxNeighbors": 200
    }
  },
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "created:[2014-01-01 TO 2014-12-31]"
      }
    },
    "context": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "created:[2010-01-01 TO 2018-12-31]"
      },
      "limit": "unlimited"
    },
    "documentTimestamps": {
      "type": "values:fromDocumentField",
      "documents": {
        "type": "documents:reference",
        "use": "documents"
      },
      "fieldName": "created"
    },
    "contextTimestamps": {
      "type": "values:fromDocumentField",
      "documents": {
        "type": "documents:reference",
        "use": "context"
      },
      "fieldName": "created"
    },
    "scores": {
      "type": "documents:contrastScore",
      "documents": {
        "type": "documents:reference",
        "use": "documents"
      },
      "matrixRows": {
        "type": "matrixRows:reference",
        "use": "similarities"
      },
      "documentTimestamps": {
        "type": "values:reference",
        "use": "documentTimestamps"
      },
      "contextTimestamps": {
        "type": "values:reference",
        "use": "contextTimestamps"
      },
      "minSimilarDocuments": 50,
      "forceSymmetricalContext": true,
      "limit": 20
    },
    "content": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "scores"
      },
      "fields": {
        "type": "contentFields:grouped",
        "groups": [
          {
            "fields": [
              "title",
              "abstract"
            ]
          }
        ]
      }
    }
  },
  "output": {
    "stages": [
      "scores",
      "content"
    ]
  }
}

The documents stage selects documents for which to compute the contrast score. Our request uses a range query to select all documents created in 2014. The context stage selects the context documents, which the contrast score computation algorithm splits into "before" and "after" pools. Our request uses a window of +/- 4 years for the context, so it selects papers created between 2011 and 2018.

The similarities component defines the rows of the similarity matrix between the input and context document. It uses matrix​Rows:​knn​Vectors​Similarity to select the 200 most similar context documents for each input document. Our request passes to the component a reference to the input documents to be used as rows of the similarity matrix, and a reference to the context documents to be used as matrix columns. As an alternative to embedding-based similarities, you could use the matrix​Rows:​keyword​Document​Similarity, which does not require document embeddings present in the index, but takes much longer to compute.

The document​Timestamps and context​Timestamps use values:​from​Document​Field to retrieve values of the created field for input and context documents. The created field contains paper creation dates in the Y​Y​Y​Y-​M​M-​D​D format, which is suitable for lexicographic comparisons.

The scores stage uses documents:​contrast​Score to compute the scores. The request passes most required data as references. One exception is the similarity component, which Lingo4G resolves as an automatic reference.

Finally, the content stage retrieves titles and abstracts for the documents with the highest contrast scores.

If you run the request in the JSON Sandbox app, you should receive a response similar to the following JSON:

{
  "result" : {
    "scores" : {
      "documents" : [
        {
          "id" : 167063,
          "score" : 1.0,
          "weight" : 1.0,
          "confidence" : 0.51,
          "balance" : 1.0,
          "before" : {
            "similarity" : 0.0,
            "similar" : 0,
            "context" : 35637
          },
          "after" : {
            "similarity" : 91.219955,
            "similar" : 102,
            "context" : 35637
          }
        }
      ],
      "matches" : {
        "value" : 10000,
        "relation" : "EXACT"
      }
    },
    "content" : {
      "documents" : [
        {
          "id" : 167063,
          "fields" : {
            "title" : {
              "values" : [
                "Face Detection with a 3D Model"
              ]
            },
            "abstract" : {
              "values" : [
                " This paper presents a part-based face detection approach where the spatial relationship between the face parts is represented by a hidden 3D model with six parameters. The computational complexity of the search in the six dimensional pose space is…"
              ]
            }
          }
        },
        {
          "id" : 199806,
          "fields" : {
            "title" : {
              "values" : [
                "Multiple Object Tracking: A Literature Review"
              ]
            },
            "abstract" : {
              "values" : [
                " Multiple Object Tracking (MOT) has gained increasing attention due to its academic and commercial potential. Although different approaches have been proposed to tackle this problem, it still remains challenging due to factors like abrupt appearance…"
              ]
            }
          }
        },
        {
          "id" : 227568,
          "fields" : {
            "title" : {
              "values" : [
                "Ambiguous Proximity Distribution"
              ]
            },
            "abstract" : {
              "values" : [
                " Proximity Distribution Kernel is an effective method for bag-of-featues based image representation. In this paper, we investigate the soft assignment of visual words to image features for proximity distribution. Visual word contribution function is…"
              ]
            }
          }
        }
      ]
    }
  }
}

JSON response to the contrast score computation query.

The documents stage result contains information about the contrast score and other related statistics. The above sample response contains only one document in the array, real-world responses contain up to limit results.

The content stage result shows titles and abstract of the three documents with top scores. Notice how they revolve around deep learning, which was a new hot topic around that time.

Notes

  • Not for real-time trend detection. Currently, Lingo4G can compute contrast scores only when it has access to documents that both precede and follow the document in question. Due to this, the method is useful only ex post: it is not suitable for novelty detection in real-time.

  • Provide a suitable window of context documents. For best results, ensure that the context documents fall in a symmetrical window centered around the period of input documents. For example, if you compute contrast score for papers written in 2015, make the context documents cover a period of +/- 3, 4 or 5 years.

  • Examine contrast score confidence. A contrast score close to 1.0 does not always mean a document contains innovative ideas. For example, when there is only, one "before" document available in the context pool, a score close to 1.0 is ill-founded.

    Therefore, always examine the confidence of the score. As a rule of thumb, if the confidence is below 0.2, this means the high contrast score is probably ill-founded. Consider setting a non-zero min​Similar​Documents to filter out such documents from scoring. Alternatively, increase the period of time covered by the context documents and see if this improves the confidence of the contrast scores.

context​Timestamps

Type
values
Default
null
Required
yes

Time stamps of the context documents to use for contrast score computation.

In typical cases, you can use the values:​from​Document​Field stage to collect values of a specific document field, such as creation date, to serve as context time stamps.

Use the same set of context documents to compute the time stamp values and the columns of the similarity matrix​Rows. If time stamps don't match similarity matrix columns, Lingo4G throws an error.

The time stamp values must be strings or numbers. Lingo4G compares document time stamps using natural and lexicographic order, respectively.

document​Timestamps

Type
values
Default
null
Required
yes

Time stamps of the input documents to use for contrast score computation.

In typical cases, you can use the values:​from​Document​Field stage to collect values of a specific document field, such as creation date, to serve as input document time stamps.

Use the same set of context documents to compute the time stamp values and the rows of the similarity matrix​Rows. If time stamps don't match similarity matrix rows, Lingo4G throws an error.

The time stamp values must be strings or numbers. Lingo4G compares document time stamps using natural and lexicographic order, respectively.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The input documents for which to compute contrast scores.

Provide the same documents as rows in the similarity matrix​Rows computation and in the document​Timestamps value collection. If the document sets don't match, Lingo4G throws an error.

force​Symmetrical​Context

Type
boolean
Default
true
Required
no

Ignores certain context documents to keep the window of "before" and "after" context documents symmetrical and centered around the input document.

When you compute contrast scores for a set of documents spanning, for example, one year, you will likely use queries similar to created:​[2014-01-01 ​T​O 2014-12-31] and created:​[2010-01-01 ​T​O 2018-12-31] for input and context documents. If you take a specific input document published on 2014-01-01, the entire context document window will not be perfectly centered around that document – the "after" part of the window is larger than the "before" part.

If you set force​Symmetrical​Context to true, Lingo4G discards some of the context documents to keep the context window symmetrical. Note that this may lower the contrast score confidence.

limit

Type
limit
Default
10000
Required
no

The number of top-scoring documents to return.

matrix​Rows

Type
matrixRows
Default
{
  "type": "matrixRows:reference",
  "auto": true
}
Required
no

Defines similarities between input and context documents for contrast score computation.

In most cases, you can use the matrix​Rows:​knn​Vectors​Similarity or matrix​Rows:​keyword​Document​Similarity to compute the similarities.

Provide input documents as rows and the context documents as columns for the similarity matrix rows computation.

Regardless of which matrix​Rows component you choose, set its max​Neighbors property to at least 100 for meaningful contrast scores.

min​Similar​Documents

Type
integer
Default
0
Constraints
value >= 0
Required
no

The minimum number of documents in the context window required for contrast score computation.

Lingo4G ignores input documents that have fewer than min​Similar​Documents of context documents in their context window. If you see high contrast score documents with low confidence, increase min​Similar​Documents above zero to filter out documents with such low-quality scores. A good starting point is setting min​Similar​Documents to equal half of the number of the max​Neighbors value you used for the similarity matrix rows computation.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Controls the order of documents on output.

Lingo4G sorts the output documents based on their contrast score. The default value of D​E​S​C​E​N​D​I​N​G puts the documents with the highest contrast score first. To see the documents with the lowest contrast score, set sort​Order to A​S​C​E​N​D​I​N​G. Finally, if you set sort​Order to U​N​S​P​E​C​I​F​I​E​D, Lingo4G returns the input documents in their original order.

documents:​embedding​Nearest​Neighbors

Selects documents whose embedding vectors are most similar to the vector you provide.

{
  "type": "documents:embeddingNearestNeighbors",
  "failIfEmbeddingsNotAvailable": true,
  "filterQuery": {
    "type": "query:all"
  },
  "limit": 100,
  "searcher": "AUTO",
  "vector": {
    "type": "vector:reference",
    "auto": true
  }
}

See the document selection tutorial for examples of document- and label-based selection of embedding-wise similar documents.

fail​If​Embeddings​Not​Available

Type
boolean
Default
true
Required
no

Determines the behavior of this stage if the index does not contain document embeddings.

If the index does not contain document embeddings and fail​If​Embeddings​Not​Available is:

true
this stage fails and logs an error.
false
this stage returns an empty set of document embeddings.

If your request combines keyword- and embedding-based processing, you can set fail​If​Embeddings​Not​Available to false to have Lingo4G degrade gently to keyword-based processing if the index does not contain document embeddings.

filter​Query

Type
query
Default
{
  "type": "query:all"
}
Required
no

Narrows down the returned documents to those matching the query you provide.

If you provide the query property, Lingo4G narrows down the results of this stage to documents that match the query.

For example, the following request limits the results of embedding-based document selection to arXiv papers in the cs.* category.

{
  "stages": {
    "similarDocuments": {
      "type": "documents:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:documentEmbedding",
        "documents": {
          "type": "documents:byQuery",
          "query": {
            "type": "query:string",
            "query": "photon"
          },
          "limit": 1
        }
      },
      "filterQuery": {
        "type": "query:string",
        "query": "category:cs.*"
      }
    }
  }
}

Using the filter property to narrow down the results of the documents:​embedding​Nearest​Neighbors stage to documents matching a query.

limit

Type
limit
Default
100
Required
no

The maximum number of documents to select.

searcher

Type
string
Default
"AUTO"
Constraints
one of [AUTO, APPROXIMATE, COMPLETE]
Required
no

Determines the document searching algorithm.

Lingo4G can use one of two algorithms to find documents whose embedding vectors lie closely to the input vector you provide. The searcher property determines the algorithm to use.

A​U​T​O

Automatic algorithm choice based on the number of documents to select and the number of documents matching the filter​Query. Use automatic algorithm selection unless you notice this stage performs slowly for a specific search.

A​P​P​R​O​X​I​M​A​T​E

Forces Lingo4G to use the approximate search algorithm, which traverses a graph of similar vectors. This algorithm is only efficient for searches with low limit values or searches without results filtering.

C​O​M​P​L​E​T​E

Forces Lingo4G to perform a complete search of all document embedding vectors. If you notice slow performance of a search under the A​U​T​O searcher, try the C​O​M​P​L​E​T​E searcher, which may offer better performance for that particular search.

vector

Type
vector
Default
{
  "type": "vector:reference",
  "auto": true
}
Required
no

The input vector for the similar document search.

You can use the following vector sources for this property:

See the document selection tutorial for examples of document- and label-based selection of embedding-wise similar documents.

documents:​from​Cluster​Exemplars

Collects highest-weight top-level exemplars of the document clusters you provide into a flat document list.

{
  "type": "documents:fromClusterExemplars",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "limit": 10000,
  "sortOrder": "DESCENDING"
}

You can use this stage, combined with clusters:​ap, which clusters documents into related groups, to reduce a large collection of documents into a much smaller set of salient documents representing different themes present in the original collection.

Another use case of this stage is with combination with the clusters:​from​Matrix​Columns stage to process the result of synthetic clustering of matrix columns.

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

The clusters from which to collect exemplars.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents that gave rise to the input clusters.

The input clusters and documents must be compatible: the clusters must have been created based on the documents. Otherwise, Lingo4G logs an error.

limit

Type
limit
Default
10000
Required
no

The maximum number of exemplar documents to collect.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Determines the order in which to collect document exemplars.

A​S​C​E​N​D​I​N​G

Collects up to limit of exemplar documents with the lowest exemplar weight values.

D​E​S​C​E​N​D​I​N​G

Collects up to limit of exemplar documents with the highest exemplar weight values.

U​N​S​P​E​C​I​F​I​E​D

Collects up to limit of exemplar documents in the order they appear in the cluster list.

documents:​from​Cluster​Members

Collects members of the document clusters you provide into a flat document list.

{
  "type": "documents:fromClusterMembers",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "limit": 10000,
  "sortOrder": "DESCENDING"
}

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

The clusters from which to collect document members.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents that gave rise to the input clusters.

The input clusters and documents must be compatible: the clusters must have been created based on the documents. Otherwise, Lingo4G logs an error.

limit

Type
limit
Default
10000
Required
no

The maximum number of member documents to collect.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Determines the order in which to collect document exemplars.

A​S​C​E​N​D​I​N​G

Collects up to limit of member documents with the lowest member weight values.

D​E​S​C​E​N​D​I​N​G

Collects up to limit of member documents with the highest member weight values.

U​N​S​P​E​C​I​F​I​E​D

Collects up to limit of member documents in the order they appear in the cluster list.

documents:​from​Document​Pairs

Converts the list of document pairs you provide into a flat list of unique documents occurring in at least one input pair.

{
  "type": "documents:fromDocumentPairs",
  "documentPairs": {
    "type": "documentPairs:reference",
    "auto": true
  }
}

You can combine this stage with document​Content to fetch contents of documents involved in at least one of the pairs:

"content": {
  "type": "documentContent",
  "limit": "unlimited",
  "documents": {
    "type": "documents:fromDocumentPairs",
    "documentPairs": {
      "type": "documentPairs:reference",
      "use": "similarPairs"
    }
  },
  "fields": {
    "type": "contentFields:simple",
    "fields": {
      "id": {},
      "title": {},
      "author_name": {},
      "created": {},
      "updated": {},
      "abstract": {
        "maxValueLength": 250
      }
    }
  }
}

document​Pairs

Type
documentPairs
Default
{
  "type": "documentPairs:reference",
  "auto": true
}
Required
no

The document pairs to convert into a flat document list.

documents:​from​Matrix​Columns

Given a matrix with columns corresponding to documents, selects the top-scoring columns and returns them as a document list.

{
  "type": "documents:fromMatrixColumns",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "limit": 10000,
  "matrixRows": {
    "type": "matrixRows:reference",
    "auto": true
  },
  "sortOrder": "DESCENDING",
  "weightAggregation": "SUM"
}

This stage performs the following steps:

  1. For each column of the input matrix​Rows, aggregate the column's values using the weight​Aggregation function.

  2. Sort columns by their aggregated value computed in step 1, according to the sort​Order.

  3. Return a list of documents corresponding to up to limit first columns on the sorted list.

You can use the documents:​from​Matrix​Columns stage to select top-scoring documents where the score is an aggregation of a number of values. For example, if you build matrix​Rows of cross-similarities between a set of cs.* and physics.* arXiv papers, documents:​from​Matrix​Columns can reveal the top physics.* papers that are most similar to cs.* papers, showing where the two areas overlap.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents that correspond to columns of the input matrix rows.

Make sure that the documents you provide in this property also gave rise to the columns of the input matrix​Rows. If the two are incompatible, Lingo4G logs an error.

limit

Type
limit
Default
10000
Required
no

The maximum number of documents to select.

matrix​Rows

Type
matrixRows
Default
{
  "type": "matrixRows:reference",
  "auto": true
}
Required
no

The matrix rows whose columns to aggregate.

Make sure that the documents you provide gave rise to the columns of the input matrix​Rows. If the two are incompatible, Lingo4G logs an error.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Determines the sorting order for the aggregated column values.

A​S​C​E​N​D​I​N​G

Collects up to limit of documents corresponding to columns with the largest aggregated values.

D​E​S​C​E​N​D​I​N​G

Collects up to limit of documents corresponding to columns with the smallest aggregated values.

U​N​S​P​E​C​I​F​I​E​D

Collects up to limit of documents in the order their corresponding columns appear in the input matrix​Rows.

weight​Aggregation

Type
weightAggregation
Default
"SUM"
Required
no

The column value aggregation function.

documents:​rwmd

Computes an approximation of the Relaxed Word Movers Distance between the documents and labels you provide.

{
  "type": "documents:rwmd",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "failIfEmbeddingsNotAvailable": true,
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labels": {
    "type": "labels:reference",
    "auto": true
  }
}

Relaxed Word Movers Distance (RWMD) aims to compute similarities between documents using multidimensional embedding vectors of the words appearing in the documents. Lingo4G's formulation computes the similarity between a list of labels and a list of documents you provide.

For each document in the document list, Lingo4G computes the RWMD similarity in the following way:

  • Collect all labels occurring in the document.

  • For each of the input labels, find the document's label with the highest embedding-wise similarity.

  • Compute the document's RWMD score as the search score of the document against a union of the input labels and labels computed in step 2.

The above formulations makes it possible to compare RWMD scores with the regular keyword search scores you get when combining the documents:​by​Query stage with the query:​for​Labels query component. Therefore, a typical use case for the documents:​rwmd stage is to compute a unified score for keyword- and embedding-based similar document searches (MLT).

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The input documents for which to compute the RWMD score.

fail​If​Embeddings​Not​Available

Type
boolean
Default
true
Required
no

Determines the behavior of this stage if the index does not contain document embeddings.

If the index does not contain document embeddings and fail​If​Embeddings​Not​Available is:

true
this stage fails and logs an error.
false
this stage returns documents with weight values equal to the weights of the input documents.

If your request combines keyword- and embedding-based processing, you can set fail​If​Embeddings​Not​Available to false to have Lingo4G degrade gently to keyword-based processing if the index does not contain document embeddings.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

Determines the document feature fields to use for label collection and document scoring.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

Performs filtering of labels collected from individual documents.

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The labels against which to score the input documents.

documents:​sample

Returns a uniform sample of documents returned by the provided query.

{
  "type": "documents:sample",
  "limit": 10000,
  "query": null,
  "randomSeed": 0,
  "samplingRatio": 1
}

limit

Type
limit
Default
10000
Required
no

The maximum number of documents to select.

Value must be an integer >= 0 or the string unlimited.

query

Type
query
Default
null
Required
yes

One of the query components.

random​Seed

Type
integer
Default
0
Required
no

The random seed to use for sampling.

sampling​Ratio

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

The sampling ratio between 0 (exclusive) and 1 (inclusive). The documents:​sample component will attempt to return a uniform sample of size sampling​Ratio * source​Document​Count documents.

documents:​scored

Computes new weights for the set of documents you provide based on the document​Scorer:​* of your choice.

{
  "type": "documents:scored",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "limit": 10000,
  "scorer": {
    "type": "documentScorer:reference",
    "auto": true
  },
  "sortOrder": "DESCENDING"
}

By default, this stage re-orders the documents in the decreasing order of the score and returns the up to limit of the top-scoring documents.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents for which to compute new weights.

limit

Type
limit
Default
10000
Required
no

The maximum number of top-scoring documents to return.

scorer

Type
documentScorer
Default
{
  "type": "documentScorer:reference",
  "auto": true
}
Required
no

The scoring component to use to compute new document weights.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Controls the order of the documents on output. Lingo4G sorts the documents using the weight computed by the scorer you provide.

See sort​Order for the list of possible sorting orders.

documents:​vector​Field​Nearest​Neighbors

Selects documents which are most similar to the vector you provide. This stage uses vector fields for which vector data must be provided from outside Lingo4G and indexed together with other document fields.

{
  "type": "documents:vectorFieldNearestNeighbors",
  "fieldName": null,
  "filterQuery": {
    "type": "query:all"
  },
  "limit": 100,
  "vector": {
    "type": "vector:reference",
    "auto": true
  }
}

field​Name

Type
project:vectorFields
Default
null
Required
yes

Document field containing external vector data added during indexing.

filter​Query

Type
query
Default
{
  "type": "query:all"
}
Required
no

Narrows down the set of returned documents to those similar to the input vector and the filtering query.

limit

Type
limit
Default
100
Required
no

The maximum number of documents to return.

vector

Type
vector
Default
{
  "type": "vector:reference",
  "auto": true
}
Required
no

The input vector for the nearest-neighbor similarity search.

You can use the following vector sources for this property:

Consumers of documents:​*

The following stages and components take documents:​* as input:

Stage or component Property
clusters:​with​Remapped​Documents
  • exemplars​From
  • exemplars​To
  • members​From
  • members​To
  • debug:​explain
  • documents
  • document​Content
  • documents
  • document​Labels
  • documents
  • document​Pairs:​all
  • documents
  • documents:​by​Weight
  • documents
  • documents:​by​Weight​Mass
  • documents
  • documents:​composite
  • selectors
  • documents:​contrast​Score
  • documents
  • documents:​from​Cluster​Exemplars
  • documents
  • documents:​from​Cluster​Members
  • documents
  • documents:​from​Matrix​Columns
  • documents
  • documents:​rwmd
  • documents
  • documents:​scored
  • documents
  • label​Clusters:​document​Cluster​Labels
  • documents
  • label​Scorer:​df
  • scope
  • label​Scorer:​idf
  • scope
  • label​Scorer:​probability​Ratio
  • base​Scope
  • reference​Scope
  • label​Scorer:​tf
  • scope
  • labels:​from​Documents
  • documents
  • matrix:​cooccurrence​Label​Similarity
  • documents
  • matrix:​keyword​Document​Similarity
  • documents
  • matrix:​keyword​Label​Document​Similarity
  • documents
  • matrix​Rows:​by​Query
  • rows
  • columns
  • matrix​Rows:​keyword​Document​Similarity
  • documents
  • documents
  • query:​for​Document​Fields
  • documents
  • query:​from​Documents
  • documents
  • stats:​documents
  • documents
  • values:​from​Document​Field
  • documents
  • vector:​document​Embedding
  • documents
  • vector:​from​Vector​Field
  • documents
  • vectors:​from​Vector​Field
  • documents
  • vectors:​precomputed​Document​Embeddings
  • documents