documents

The documents:​* stages group various ways of producing lists of documents. You can feed the results of the documents stage as input to another stage, such as document content retrieval, collecting labels from documents or creating a matrix of similarities between documents.

You can use the following documents stages in your analysis requests:

documents:​by​Id

Selects documents by their internal identifiers.

documents:​by​Query

Selects documents matching the query you provide.

documents:​composite

Takes a union or intersection of the document lists you provide.

documents:​contrast​Score

Computes contrast score for the documents you provide. For certain collections, contrast score may reveal documents that introduce novel concepts, not seen in preceding documents.

documents:​embedding​Nearest​Neighbors

Selects documents that are most similar to the multidimensional vector you provide.

documents:​from​Cluster​Exemplars

Collects exemplars of the document clusters you provide into a flat document list.

documents:​from​Cluster​Members

Collects members of the document clusters you provide into a flat document list.

documents:​from​Document​Pairs

Converts the list of document pairs you provide into a flat list of unique documents occurring in at least one input pair.

documents:​from​Matrix​Columns

Given a matrix with columns corresponding to documents, selects the top-scoring columns and returns them as a document list.

documents:​rwmd

Computes an approximation of the Relaxed Word Movers Distance between the documents and labels you provide.

documents:​sample

Applies random sampling to the documents you provide.


documents:​reference

References the results of another documents:​* stage defined in the request.


The JSON output of a documents stage has the following outline structure:

{
  "matches": {
    "value": 1008,
    "relation": "GREATER_OR_EQUAL"
  },
  "documents": [
    {
      "id": 188201,
      "weight": 6.808791
    },
    {
      "id": 62168,
      "weight": 6.7709503
    },
    {
      "id": 252264,
      "weight": 6.6507382
    }
  ]
}

The documents array is mandatory for all implementations and consists of objects with the following fields:

id
Internal identifier of the document. Internal document identifiers may change between subsequent commits and reindexing runs.
weight
Weight of the document. The semantics of the weight depends on the specific stage.

Different implementations may return additional properties. Here, the documents:​by​Query stage returned an additional matches element which contains the number of matching documents and whether this number is approximate or exact.

documents:​by​Id

Returns documents matching the provided internal identifiers. This component can be helpful for debugging or when internal identifiers are returned from another source. Note that internal document identifiers need not be contiguous and can change after document or label reindexing.

{
  "type": "documents:byId",
  "documents": []
}

documents

Type
array of object
Default
[]
Required
no

An array of objects, each with an id property pointing at an internal document identifier. For example:

{
  "stages": {
    "documentsById": {
      "type": "documents:byId",
      "documents": [
        { "id": 1 },
        { "id": 2 },
        { "id": 10221 },
        { "id": 18548 }
      ]
    }
  }
}

Identifiers corresponding to non-existing or deleted documents will cause an error.

documents:​by​Query

Returns documents matching the provided query.

{
  "type": "documents:byQuery",
  "accurateHitCount": false,
  "limit": 10000,
  "query": null,
  "requireScores": true
}

Use the documents:​by​Query stage to select documents matching a query provided by the user. One common query type is query:​string, which parses the Apache Lucene query syntax.

documents:​by​Query returns the following JSON structure:

{
  "matches": {
    "value": 1008,
    "relation": "GREATER_OR_EQUAL"
  },
  "documents": [
    {
      "id": 188201,
      "weight": 6.808791
    },
    {
      "id": 62168,
      "weight": 6.7709503
    },
    {
      "id": 252264,
      "weight": 6.6507382
    }
  ]
}
matches
Information about the total number of documents matching the query.
value
Total number of documents matching the query. Note that this number may be approximate and larger than the number of documents actually returned in the documents array.
relation
Indicates whether the total number of documents is exact or an approximation.
E​X​A​C​T
value contains the exact number of matches.
G​R​E​A​T​E​R_​O​R_​E​Q​U​A​L
value is a lower bound on the number of matches. To force Lingo4G to compute the exact number of matches, set the accurate​Hit​Count property to true.
documents
Array of selected documents. The weight property contains the search score returned by Lucene for the specific document. Length of the array is not greater than limit.

accurate​Hit​Count

Type
boolean
Default
false
Required
no

If true, the returned number of matching documents is guaranteed to be accurate, otherwise it may be an approximation. Accurate results are typically more costly to compute.

Here is an example stage requesting approximate total:

{
  "type": "documents:byQuery",
  "limit": 3,
  "query": {
    "type": "query:string",
    "query": "photon"
  },
  "accurateHitCount": false
}

The output for the above stage is:

{
  "matches": {
    "value": 1008,
    "relation": "GREATER_OR_EQUAL"
  },
  "documents": [
    {
      "id": 188201,
      "weight": 6.808791
    },
    {
      "id": 62168,
      "weight": 6.7709503
    },
    {
      "id": 252264,
      "weight": 6.6507382
    }
  ]
}

Compare the above to the result below, when an accurate hit count is requested:

{
  "matches": {
    "value": 14288,
    "relation": "EXACT"
  },
  "documents": [
    {
      "id": 188201,
      "weight": 6.808791
    },
    {
      "id": 62168,
      "weight": 6.7709503
    },
    {
      "id": 252264,
      "weight": 6.6507382
    }
  ]
}

limit

Type
limit
Default
10000
Required
no

The maximum number of documents to select.

Value must be an integer >= 0 or the string unlimited, in which case the stage returns all matches.

If limit is smaller than the number of documents matching the query, the stage returns the top-scoring documents.

If limit is zero, Lingo4G computes the number of documents matching the query and returns the result in the matches section of the response, leaving the documents array empty. Counting the number of matches is often faster than selecting the identifiers of matching documents, so if you only want to count query matches, set limit to 0.

query

Type
query
Default
null
Required
yes

The query to execute and retrieve matching documents for. The following example uses the query:​string component:

{
  "type": "documents:byQuery",
  "limit": 3,
  "query": {
    "type": "query:string",
    "query": "photon"
  },
  "accurateHitCount": false
}

require​Scores

Type
boolean
Default
true
Required
no

If false, the selector will query document identifiers only (without scores or score-implied sort order). This can be used to accelerate large queries where scores are not used or are irrelevant to the result.

Document order is not guaranteed and may be random in scoreless query mode.

documents:​composite

Takes a union or intersection of the nested document selectors, aggregating weights of identical documents in multiple selectors and sorting documents by those weights.

{
  "type": "documents:composite",
  "operator": "OR",
  "selectors": [],
  "sortOrder": "DESCENDING",
  "weightAggregation": "SUM"
}

operator

Type
string
Default
"OR"
Constraints
one of [OR, AND]
Required
no

Declares the way documents from selectors are combined. The operator property supports the following values:

O​R

Produces the union of all unique documents from all selectors.

A​N​D

Produces the intersection of all documents from all selectors. A document must appear in all selectors to appear in the output.

selectors

Type
array of documents
Default
[]
Required
no

An array of nested document:​* selectors to combine.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Controls the sort order for the output list of documents. Documents are sorted by their weight (score), the sort order can be ascending or descending.

See sort​Order in the documentation of common types for the list of possible values.

weight​Aggregation

Type
weightAggregation
Default
"SUM"
Required
no

Controls how document weights (scores) are aggregated for documents that exist in more than one selector.

See weight​Aggregation in the documentation of common types for the list of possible values.

documents:​contrast​Score

Computes contrast score for the documents you provide. For certain collections, contrast score may reveal documents that introduce novel concepts, not seen in preceding documents.

{
  "type": "documents:contrastScore",
  "contextTimestamps": null,
  "documentTimestamps": null,
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "forceSymmetricalContext": true,
  "limit": 10000,
  "matrixRows": {
    "type": "matrixRows:reference",
    "auto": true
  },
  "minSimilarDocuments": 0,
  "sortOrder": "DESCENDING"
}

Contrast score of a document depends on how many similar documents precede and follow the document. For example, an arXiv paper that is similar to very few papers published earlier, and at the same time similar to very many papers published later, will have a high contrast score, which may indicate that the paper introduces some novel ideas that inspired follow-up research. Similarly, a paper similar to many preceding papers and at the same similar to very few succeeding papers will have a low contrast score, which may indicate it does not introduce any novel ideas.

Algorithm

Lingo4G requires the following pieces of data to compute contrast scores:

documents
The list of input documents for which to compute contrast scores.
Context documents

The pool of documents to use as the before / after context. For each input document, Lingo4G computes similarities to all context documents to determine the number of similar documents that precede and follow the document being scored.

You provide the context documents indirectly through the document similarity matrix​Rows property.

matrix​Rows
Rows of the similarity matrix between input and context documents. Rows in the matrix must correspond to input documents, columns must correspond to context documents.
document​Timestamps context​Timestamps

Time stamps, such as dates, for input and context documents. Lingo4G uses them to determine which context document were written before and which after the input documents.

Time stamps must be of string or number type. Lingo4G compares document time stamps using natural and lexicographic order, respectively.

To compute contrast scores Lingo4G performs the following steps:

  1. For each row from matrix​Rows, which contains context documents that are most similar to one input document, split the context documents into those that were written before and after the input document. Use natural order numeric time stamps and lexicographic order for string timestamps.

  2. If force​Symmetrical​Context is true, truncate the larger of the "before" and "after" context pools to make the numbers of documents in both pools equal.

  3. If the total size of the "before" and "after" context document pools is less than min​Similar​Documents, do not compute contrast score for this input document.

    The min​Similar​Documents threshold prevents computation of contrast scores based on very few context documents. In an extreme case, for min​Similar​Documents equal to 0, you could receive a perfect contrast score of 1.0 for just one "after" context document and zero "before" context documents.

  4. Add up similarities of the input document to the "before" and "after" context documents to form the similarity before and similarity after values.

  5. Compute the input document's contrast score as:

    score = similarity after similarity before similarity after + similarity before

    The contrast score can take values in the -1...1 range. Contrast score of -1 means there are no succeeding context documents similar to the input document, so the input document probably does not introduce any novel ideas. Conversely, a contrast score of +1 means that all similar context documents were written after the input document, which may suggest that the input document introduces novel ideas.

  6. Sort the results by contrast score and return the top limit highest-scoring documents.

Results format

The documents:​contrast​Score stage produces the following JSON structure:

{
  "documents": [
    {
      "id": 232694,
      "score": 1,
      "weight": 1,
      "confidence": 0.76,
      "balance": 1,
      "before": {
        "similarity": 0,
        "similar": 0,
        "context": 75631
      },
      "after": {
        "similarity": 134.00526,
        "similar": 152,
        "context": 75631
      }
    }
  ],
  "matches": {
    "value": 10000,
    "relation": "EXACT"
  }
}

JSON response of the documents:​contrast​Score stage.

The documents array contains up to limit input documents, sorted decreasingly by contrast score. Each object in the array corresponds to one input document and contains the following properties:

id
Internal identifier of the document.
score, weight
Contrast score of the document. Both the score and weight properties contain the same value.
confidence

Summarizes the quality of the contrast score of this document. Confidence is 1.0 if all the available context documents were eligible for contrast score computation. If force​Symmetrical​Context is true, confidence may be lower than 1.0 to indicate that Lingo4G had to ignore some of the context documents to make the "before" and "after" context pools contain equal numbers of documents.

Lingo4G computes the confidence factor using the following formula:

confidence = similar before + similar after context before
balance

Summarizes the quality of the "before" and "after" context of this document. Balance is 1.0 if the pools of "before" and "after" context documents for this input document are equal. If the pools are not equal, balance is less than 1.0 and falls to 0.0 if any of the context document pools is empty.

Lingo4G computes the balance factor using the following formula:

balance = 1 | context before context after | context before + context after
before, after

Statistics about the "before" and "after" pools of documents for this input document.

similar

The number of similar documents in the respective context pool.

similarity

The sum of input-to-context document similarities in the respective context pool

context

The total number of context documents in the respective pool.

Note that this number will most of the time be larger than the similar property because not all documents in the context pool are similar to the input documents. (In fact, most documents in the context pool are not similar to the input document.)

Example request

The following request computes contrast scores arXiv papers published in 2014.

{
  "name": "Contrast scores (embeddings)",
  "components": {
    "similarities": {
      "type": "matrixRows:knnVectorsSimilarity",
      "vectors": {
        "rows": {
          "type": "vectors:precomputedDocumentEmbeddings",
          "documents": {
            "type": "documents:reference",
            "use": "documents"
          }
        },
        "columns": {
          "type": "vectors:precomputedDocumentEmbeddings",
          "documents": {
            "type": "documents:reference",
            "use": "context"
          }
        }
      },
      "maxNeighbors": 200
    }
  },
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "created:[2014-01-01 TO 2014-12-31]"
      }
    },
    "context": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "created:[2010-01-01 TO 2018-12-31]"
      },
      "limit": "unlimited"
    },
    "documentTimestamps": {
      "type": "values:fromDocumentField",
      "documents": {
        "type": "documents:reference",
        "use": "documents"
      },
      "fieldName": "created"
    },
    "contextTimestamps": {
      "type": "values:fromDocumentField",
      "documents": {
        "type": "documents:reference",
        "use": "context"
      },
      "fieldName": "created"
    },
    "scores": {
      "type": "documents:contrastScore",
      "documents": {
        "type": "documents:reference",
        "use": "documents"
      },
      "matrixRows": {
        "type": "matrixRows:reference",
        "use": "similarities"
      },
      "documentTimestamps": {
        "type": "values:reference",
        "use": "documentTimestamps"
      },
      "contextTimestamps": {
        "type": "values:reference",
        "use": "contextTimestamps"
      },
      "minSimilarDocuments": 50,
      "forceSymmetricalContext": true,
      "limit": 20
    },
    "content": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "scores"
      },
      "fields": {
        "type": "contentFields:grouped",
        "groups": [
          {
            "fields": [
              "title",
              "abstract"
            ]
          }
        ]
      }
    }
  },
  "output": {
    "stages": [
      "scores",
      "content"
    ]
  }
}

The documents stage selects documents for which to compute the contrast score. Our request uses a range query to select all documents created in 2014. The context stage selects the context documents, which the contrast score computation algorithm splits into "before" and "after" pools. Our request uses a window of +/- 4 years for the context, so it selects papers created between 2011 and 2018.

The similarities component defines the rows of the similarity matrix between the input and context document. It uses matrix​Rows:​knn​Vectors​Similarity to select the 200 most similar context documents for each input document. Our request passes to the component a reference to the input documents to be used as rows of the similarity matrix, and a reference to the context documents to be used as matrix columns. As an alternative to embedding-based similarities, you could use the matrix​Rows:​keyword​Document​Similarity, which does not require document embeddings present in the index, but takes much longer to compute.

The document​Timestamps and context​Timestamps use values:​from​Document​Field to retrieve values of the created field for input and context documents. The created field contains paper creation dates in the Y​Y​Y​Y-​M​M-​D​D format, which is suitable for lexicographic comparisons.

The scores stage uses documents:​contrast​Score to compute the scores. The request passes most required data as references. One exception is the similarity component, which Lingo4G resolves as an automatic reference.

Finally, the content stage retrieves titles and abstracts for the documents with the highest contrast scores.

If you run the request in the JSON Sandbox app, you should receive a response similar to the following JSON:

{
  "result" : {
    "scores" : {
      "documents" : [
        {
          "id" : 232694,
          "score" : 1.0,
          "weight" : 1.0,
          "confidence" : 0.76,
          "balance" : 1.0,
          "before" : {
            "similarity" : 0.0,
            "similar" : 0,
            "context" : 75631
          },
          "after" : {
            "similarity" : 134.00526,
            "similar" : 152,
            "context" : 75631
          }
        }
      ],
      "matches" : {
        "value" : 10000,
        "relation" : "EXACT"
      }
    },
    "content" : {
      "documents" : [
        {
          "id" : 232694,
          "fields" : {
            "title" : {
              "values" : [
                "Training a Multilingual Sportscaster: Using Perceptual Context to Learn Language"
              ]
            },
            "abstract" : {
              "values" : [
                " We present a novel framework for learning to interpret and generate language using only perceptual context as supervision. We demonstrate its capabilities by developing a system that learns to sportscast simulated robot soccer games in both English…"
              ]
            }
          }
        },
        {
          "id" : 254162,
          "fields" : {
            "title" : {
              "values" : [
                "Automatic Tracker Selection w.r.t Object Detection Performance"
              ]
            },
            "abstract" : {
              "values" : [
                " The tracking algorithm performance depends on video content. This paper presents a new multi-object tracking approach which is able to cope with video content variations. First the object detection is improved using Kanade- Lucas-Tomasi (KLT)…"
              ]
            }
          }
        },
        {
          "id" : 254186,
          "fields" : {
            "title" : {
              "values" : [
                "Learning Deep Convolutional Features for MRI Based Alzheimer's Disease Classification"
              ]
            },
            "abstract" : {
              "values" : [
                " Effective and accurate diagnosis of Alzheimer's disease (AD) or mild cognitive impairment (MCI) can be critical for early treatment and thus has attracted more and more attention nowadays. Since first introduced, machine learning methods have been…"
              ]
            }
          }
        }
      ]
    }
  }
}

JSON response to the contrast score computation query.

The documents stage result contains information about the contrast score and other related statistics. The above sample response contains only one document in the array, real-world responses contain up to limit results.

The content stage result shows titles and abstract of the three documents with top scores. Notice how they revolve around deep learning, which was a new hot topic around that time.

Notes

  • Not for real-time trend detection. Currently, Lingo4G can compute contrast scores only when it has access to documents that both precede and follow the document in question. Due to this, the method is useful only ex post: it is not suitable for novelty detection in real-time.

  • Provide a suitable window of context documents. For best results, ensure that the context documents fall in a symmetrical window centered around the period of input documents. For example, if you compute contrast score for papers written in 2015, make the context documents cover a period of +/- 3, 4 or 5 years.

  • Examine contrast score confidence. A contrast score close to 1.0 does not always mean a document contains innovative ideas. For example, when there is only, one "before" document available in the context pool, a score close to 1.0 is ill-founded.

    Therefore, always examine the confidence of the score. As a rule of thumb, if the confidence is below 0.2, this means the high contrast score is probably ill-founded. Consider setting a non-zero min​Similar​Documents to filter out such documents from scoring. Alternatively, increase the period of time covered by the context documents and see if this improves the confidence of the contrast scores.

context​Timestamps

Type
values
Default
null
Required
yes

Time stamps of the context documents to use for contrast score computation.

In typical cases, you can use the values:​from​Document​Field stage to collect values of a specific document field, such as creation date, to serve as context time stamps.

Use the same set of context documents to compute the time stamp values and the columns of the similarity matrix​Rows. If time stamps don't match similarity matrix columns, Lingo4G throws an error.

The time stamp values must be strings or numbers. Lingo4G compares document time stamps using natural and lexicographic order, respectively.

document​Timestamps

Type
values
Default
null
Required
yes

Time stamps of the input documents to use for contrast score computation.

In typical cases, you can use the values:​from​Document​Field stage to collect values of a specific document field, such as creation date, to serve as input document time stamps.

Use the same set of context documents to compute the time stamp values and the rows of the similarity matrix​Rows. If time stamps don't match similarity matrix rows, Lingo4G throws an error.

The time stamp values must be strings or numbers. Lingo4G compares document time stamps using natural and lexicographic order, respectively.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The input documents for which to compute contrast scores.

Provide the same documents as rows in the similarity matrix​Rows computation and in the document​Timestamps value collection. If the document sets don't match, Lingo4G throws an error.

force​Symmetrical​Context

Type
boolean
Default
true
Required
no

Ignores certain context documents to keep the window of "before" and "after" context documents symmetrical and centered around the input document.

When you compute contrast scores for a set of documents spanning, for example, one year, you will likely use queries similar to created:​[2014-01-01 ​T​O 2014-12-31] and created:​[2010-01-01 ​T​O 2018-12-31] for input and context documents. If you take a specific input document published on 2014-01-01, the entire context document window will not be perfectly centered around that document – the "after" part of the window is larger than the "before" part.

If you set force​Symmetrical​Context to true, Lingo4G discards some of the context documents to keep the context window symmetrical. Note that this may lower the contrast score confidence.

limit

Type
limit
Default
10000
Required
no

The number of top-scoring documents to return.

matrix​Rows

Type
matrixRows
Default
{
  "type": "matrixRows:reference",
  "auto": true
}
Required
no

Defines similarities between input and context documents for contrast score computation.

In most cases, you can use the matrix​Rows:​knn​Vectors​Similarity or matrix​Rows:​keyword​Document​Similarity to compute the similarities.

Provide input documents as rows and the context documents as columns for the similarity matrix rows computation.

Regardless of which matrix​Rows component you choose, set its max​Neighbors property to at least 100 for meaningful contrast scores.

min​Similar​Documents

Type
integer
Default
0
Constraints
value >= 0
Required
no

The minimum number of documents in the context window required for contrast score computation.

Lingo4G ignores input documents that have fewer than min​Similar​Documents of context documents in their context window. If you see high contrast score documents with low confidence, increase min​Similar​Documents above zero to filter out documents with such low-quality scores. A good starting point is setting min​Similar​Documents to equal half of the number of the max​Neighbors value you used for the similarity matrix rows computation.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Controls the order of documents on output.

Lingo4G sorts the output documents based on their contrast score. The default value of D​E​S​C​E​N​D​I​N​G puts the documents with the highest contrast score first. To see the documents with the lowest contrast score, set sort​Order to A​S​C​E​N​D​I​N​G. Finally, if you set sort​Order to U​N​S​P​E​C​I​F​I​E​D, Lingo4G returns the input documents in their original order.

documents:​embedding​Nearest​Neighbors

Selects documents whose embedding vectors are most similar to the vector you provide.

{
  "type": "documents:embeddingNearestNeighbors",
  "failIfEmbeddingsNotAvailable": true,
  "filterQuery": {
    "type": "query:all"
  },
  "limit": 100,
  "searcher": "AUTO",
  "vector": {
    "type": "vector:reference",
    "auto": true
  }
}

See the document selection tutorial for examples of document- and label-based selection of embedding-wise similar documents.

fail​If​Embeddings​Not​Available

Type
boolean
Default
true
Required
no

Determines the behavior of this stage if the index does not contain document embeddings.

If the index does not contain document embeddings and fail​If​Embeddings​Not​Available is:

true
this stage fails and logs an error.
false
this stage returns an empty set of document embeddings.

If your request combines keyword- and embedding-based processing, you can set fail​If​Embeddings​Not​Available to false to have Lingo4G degrade gently to keyword-based processing if the index does not contain document embeddings.

filter​Query

Type
query
Default
{
  "type": "query:all"
}
Required
no

Narrows down the returned documents to those matching the query you provide.

If you provide the query property, Lingo4G narrows down the results of this stage to documents that match the query.

For example, the following request limits the results of embedding-based document selection to arXiv papers in the cs.* category.

{
  "stages": {
    "similarDocuments": {
      "type": "documents:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:documentEmbedding",
        "documents": {
          "type": "documents:byQuery",
          "query": {
            "type": "query:string",
            "query": "photon"
          },
          "limit": 1
        }
      },
      "filterQuery": {
        "type": "query:string",
        "query": "category:cs.*"
      }
    }
  }
}

Using the filter property to narrow down the results of the documents:​embedding​Nearest​Neighbors stage to documents matching a query.

limit

Type
limit
Default
100
Required
no

The maximum number of documents to select.

searcher

Type
string
Default
"AUTO"
Constraints
one of [AUTO, APPROXIMATE, COMPLETE]
Required
no

Determines the document searching algorithm.

Lingo4G can use one of two algorithms to find documents whose embedding vectors lie closely to the input vector you provide. The searcher property determines the algorithm to use.

A​U​T​O

Automatic algorithm choice based on the number of documents to select and the number of documents matching the filter​Query. Use automatic algorithm selection unless you notice this stage performs slowly for a specific search.

A​P​P​R​O​X​I​M​A​T​E

Forces Lingo4G to use the approximate search algorithm, which traverses a graph of similar vectors. This algorithm is only efficient for searches with low limit values or searches without results filtering.

C​O​M​P​L​E​T​E

Forces Lingo4G to perform a complete search of all document embedding vectors. If you notice slow performance of a search under the A​U​T​O searcher, try the C​O​M​P​L​E​T​E searcher, which may offer better performance for that particular search.

vector

Type
vector
Default
{
  "type": "vector:reference",
  "auto": true
}
Required
no

The input vector for the similar document search.

You can use the following vector sources for this property:

See the document selection tutorial for examples of document- and label-based selection of embedding-wise similar documents.

documents:​from​Cluster​Exemplars

Collects highest-weight top-level exemplars of the document clusters you provide into a flat document list.

{
  "type": "documents:fromClusterExemplars",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "limit": 10000,
  "sortOrder": "DESCENDING"
}

You can use this stage, combined with clusters:​ap, which clusters documents into related groups, to reduce a large collection of documents into a much smaller set of salient documents representing different themes present in the original collection.

Another use case of this stage is with combination with the clusters:​from​Matrix​Columns stage to process the result of synthetic clustering of matrix columns.

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

The clusters from which to collect exemplars.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents that gave rise to the input clusters.

The input clusters and documents must be compatible: the clusters must have been created based on the documents. Otherwise, Lingo4G logs an error.

limit

Type
limit
Default
10000
Required
no

The maximum number of exemplar documents to collect.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Determines the order in which to collect document exemplars.

A​S​C​E​N​D​I​N​G

Collects up to limit of exemplar documents with the lowest exemplar weight values.

D​E​S​C​E​N​D​I​N​G

Collects up to limit of exemplar documents with the highest exemplar weight values.

U​N​S​P​E​C​I​F​I​E​D

Collects up to limit of exemplar documents in the order they appear in the cluster list.

documents:​from​Cluster​Members

Collects members of the document clusters you provide into a flat document list.

{
  "type": "documents:fromClusterMembers",
  "clusters": {
    "type": "clusters:reference",
    "auto": true
  },
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "limit": 10000,
  "sortOrder": "DESCENDING"
}

clusters

Type
clusters
Default
{
  "type": "clusters:reference",
  "auto": true
}
Required
no

The clusters from which to collect document members.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents that gave rise to the input clusters.

The input clusters and documents must be compatible: the clusters must have been created based on the documents. Otherwise, Lingo4G logs an error.

limit

Type
limit
Default
10000
Required
no

The maximum number of member documents to collect.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Determines the order in which to collect document exemplars.

A​S​C​E​N​D​I​N​G

Collects up to limit of member documents with the lowest member weight values.

D​E​S​C​E​N​D​I​N​G

Collects up to limit of member documents with the highest member weight values.

U​N​S​P​E​C​I​F​I​E​D

Collects up to limit of member documents in the order they appear in the cluster list.

documents:​from​Document​Pairs

Converts the list of document pairs you provide into a flat list of unique documents occurring in at least one input pair.

{
  "type": "documents:fromDocumentPairs",
  "documentPairs": {
    "type": "documentPairs:reference",
    "auto": true
  }
}

You can combine this stage with document​Content to fetch contents of documents involved in at least one of the pairs:

"content": {
  "type": "documentContent",
  "limit": "unlimited",
  "documents": {
    "type": "documents:fromDocumentPairs",
    "documentPairs": {
      "type": "documentPairs:reference",
      "use": "similarPairs"
    }
  },
  "fields": {
    "type": "contentFields:simple",
    "fields": {
      "id": {},
      "title": {},
      "author_name": {},
      "created": {},
      "updated": {},
      "abstract": {
        "maxValueLength": 250
      }
    }
  }
}

document​Pairs

Type
documentPairs
Default
{
  "type": "documentPairs:reference",
  "auto": true
}
Required
no

The document pairs to convert into a flat document list.

documents:​from​Matrix​Columns

Given a matrix with columns corresponding to documents, selects the top-scoring columns and returns them as a document list.

{
  "type": "documents:fromMatrixColumns",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "limit": 10000,
  "matrixRows": {
    "type": "matrixRows:reference",
    "auto": true
  },
  "sortOrder": "DESCENDING",
  "weightAggregation": "SUM"
}

This stage performs the following steps:

  1. For each column of the input matrix​Rows, aggregate the column's values using the weight​Aggregation function.

  2. Sort columns by their aggregated value computed in step 1, according to the sort​Order.

  3. Return a list of documents corresponding to up to limit first columns on the sorted list.

You can use the documents:​from​Matrix​Columns stage to select top-scoring documents where the score is an aggregation of a number of values. For example, if you build matrix​Rows of cross-similarities between a set of cs.* and physics.* arXiv papers, documents:​from​Matrix​Columns can reveal the top physics.* papers that are most similar to cs.* papers, showing where the two areas overlap.

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The documents that correspond to columns of the input matrix rows.

Make sure that the documents you provide in this property also gave rise to the columns of the input matrix​Rows. If the two are incompatible, Lingo4G logs an error.

limit

Type
limit
Default
10000
Required
no

The maximum number of documents to select.

matrix​Rows

Type
matrixRows
Default
{
  "type": "matrixRows:reference",
  "auto": true
}
Required
no

The matrix rows whose columns to aggregate.

Make sure that the documents you provide gave rise to the columns of the input matrix​Rows. If the two are incompatible, Lingo4G logs an error.

sort​Order

Type
sortOrder
Default
"DESCENDING"
Required
no

Determines the sorting order for the aggregated column values.

A​S​C​E​N​D​I​N​G

Collects up to limit of documents corresponding to columns with the largest aggregated values.

D​E​S​C​E​N​D​I​N​G

Collects up to limit of documents corresponding to columns with the smallest aggregated values.

U​N​S​P​E​C​I​F​I​E​D

Collects up to limit of documents in the order their corresponding columns appear in the input matrix​Rows.

weight​Aggregation

Type
weightAggregation
Default
"SUM"
Required
no

The column value aggregation function.

documents:​rwmd

Computes an approximation of the Relaxed Word Movers Distance between the documents and labels you provide.

{
  "type": "documents:rwmd",
  "documents": {
    "type": "documents:reference",
    "auto": true
  },
  "failIfEmbeddingsNotAvailable": true,
  "fields": {
    "type": "featureFields:reference",
    "auto": true
  },
  "labelFilter": {
    "type": "labelFilter:reference",
    "auto": true
  },
  "labels": {
    "type": "labels:reference",
    "auto": true
  }
}

Relaxed Word Movers Distance (RWMD) aims to compute similarities between documents using multidimensional embedding vectors of the words appearing in the documents. Lingo4G's formulation computes the similarity between a list of labels and a list of documents you provide.

For each document in the document list, Lingo4G computes the RWMD similarity in the following way:

  • Collect all labels occurring in the document.

  • For each of the input labels, find the document's label with the highest embedding-wise similarity.

  • Compute the document's RWMD score as the search score of the document against a union of the input labels and labels computed in step 2.

The above formulations makes it possible to compare RWMD scores with the regular keyword search scores you get when combining the documents:​by​Query stage with the query:​for​Labels query component. Therefore, a typical use case for the documents:​rwmd stage is to compute a unified score for keyword- and embedding-based similar document searches (MLT).

documents

Type
documents
Default
{
  "type": "documents:reference",
  "auto": true
}
Required
no

The input documents for which to compute the RWMD score.

fail​If​Embeddings​Not​Available

Type
boolean
Default
true
Required
no

Determines the behavior of this stage if the index does not contain document embeddings.

If the index does not contain document embeddings and fail​If​Embeddings​Not​Available is:

true
this stage fails and logs an error.
false
this stage returns documents with weight values equal to the weights of the input documents.

If your request combines keyword- and embedding-based processing, you can set fail​If​Embeddings​Not​Available to false to have Lingo4G degrade gently to keyword-based processing if the index does not contain document embeddings.

fields

Type
featureFields
Default
{
  "type": "featureFields:reference",
  "auto": true
}
Required
no

Determines the document feature fields to use for label collection and document scoring.

label​Filter

Type
labelFilter
Default
{
  "type": "labelFilter:reference",
  "auto": true
}
Required
no

Performs filtering of labels collected from individual documents.

labels

Type
labels
Default
{
  "type": "labels:reference",
  "auto": true
}
Required
no

The labels against which to score the input documents.

documents:​sample

Returns a uniform sample of documents returned by the provided query.

{
  "type": "documents:sample",
  "limit": 10000,
  "query": null,
  "randomSeed": 0,
  "samplingRatio": 1
}

limit

Type
limit
Default
10000
Required
no

The maximum number of documents to select.

Value must be an integer >= 0 or the string unlimited.

query

Type
query
Default
null
Required
yes

One of the query components.

random​Seed

Type
integer
Default
0
Required
no

The random seed to use for sampling.

sampling​Ratio

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

The sampling ratio between 0 (exclusive) and 1 (inclusive). The documents:​sample component will attempt to return a uniform sample of size sampling​Ratio * source​Document​Count documents.

Consumers of documents:​*

The following stages and components take documents:​* as input:

Stage or component Property
clusters:​with​Remapped​Documents
  • exemplars​From
  • exemplars​To
  • members​From
  • members​To
  • debug:​explain
  • documents
  • document​Content
  • documents
  • document​Labels
  • documents
  • document​Pairs:​all
  • documents
  • documents:​composite
  • selectors
  • documents:​contrast​Score
  • documents
  • documents:​from​Cluster​Exemplars
  • documents
  • documents:​from​Cluster​Members
  • documents
  • documents:​from​Matrix​Columns
  • documents
  • documents:​rwmd
  • documents
  • label​Clusters:​document​Cluster​Labels
  • documents
  • label​Scorer:​df
  • scope
  • label​Scorer:​idf
  • scope
  • label​Scorer:​probability​Ratio
  • base​Scope
  • reference​Scope
  • label​Scorer:​tf
  • scope
  • labels:​from​Documents
  • documents
  • matrix:​cooccurrence​Label​Similarity
  • documents
  • matrix:​keyword​Document​Similarity
  • documents
  • matrix:​keyword​Label​Document​Similarity
  • documents
  • matrix​Rows:​keyword​Document​Similarity
  • documents
  • documents
  • query:​from​Documents
  • documents
  • stats:​documents
  • documents
  • values:​from​Document​Field
  • documents
  • vector:​document​Embedding
  • documents
  • vectors:​precomputed​Document​Embeddings
  • documents