Document content retrieval
Most Lingo4G-based applications will ultimately need to display the contents of some documents from the index. This is where the content and label retrieval stages come in handy.
stage retrieves values of stored fields, such as title, abstract or list of authors, for each document in the
document set you provide.
The following request selects top 10 documents matching the
photon query and retrieves their
For the above request, Lingo4G produces a result JSON with two arrays:
an array of document identifiers and search scores produced by the
an array of document field values produced by the
Following the general principle of Lingo4G analysis API, the two arrays are index-aligned:
entries at index
n in both arrays correspond to the same document.
To see a visual representation of the document content, execute the request in the JSON sandbox app and switch to the documents list tab.
Note that when the
start property is greater than 0, the documents and the document content arrays
are aligned with an offset: entry at index
n in the document content array
corresponds to entry at index
n + start in the documents array.
The default value of the
unlimited. Therefore, if you don't provide an explicit lower limit value, Lingo4G will retrieve
the content of all the documents on input. Make sure your requests don't accidentally retrieve the content of
tens of thousands of documents, as this will be resource-intensive both on the server and on the client side.
Field output configuration
property to specify which fields Lingo4G should return for each document and how Lingo4G should format the
You can use any of the
components to provide the above specification. The request below returns a full set of complete values of the
abstract fields (limited to the first two documents matching the
"twin photon" correlations query ).
The above request returns the following:
In most scenarios, the full content of long fields is not really needed and a lead line of certain length is
sufficient. In the request below, the
title field is configured to always return the full value,
author_name fields are limited to at most two values, each truncated
to at most 160 characters.
Compare the result below to the full content of those fields retrieved in the previous request. Note ellipsis marks where values have been truncated.
Query in context is a standard technique of presenting search results by highlighting short fragments of text that directly correspond to the search query issued by the user. For example, for the query "twin photon" correlations we would expect those phrases to be highlighted in the returned set of fields for each document.
property of the
documentContent stage to specify one (or more) queries for which Lingo4G should
highlight their corresponding matching text regions. Typically the
queries element will contain an
identical query as that issued by the user, but it is not limited to just one (or even the same) query.
In the example below, we request two documents matching
"twin photon" correlations and configure the
queries property to highlight text fragments
matching two queries: "twin photon" correlations and interference:
Note there is no guarantee that all matching text regions will be included in the response (this depends on how the field value limits are configured). Lingo4G will try to return those regions within each document field's value that contain a maximum number of hits. For the query above, the returned response includes marked-up passages as shown below:
stage retrieves labels contained in each document of the document set you provide. You can combine it with the
stage to present the content and labels contained in a set of documents.
The following request selects the top 10 documents matching the photon query and retrieves up to 5 most frequent labels contained in each document.
Run the request in JSON sandbox to see what the label retrieval JSON response looks like. If you switch to the documents list view, you should see a graphical representation of the documents and their labels.
Label and document retrieval stages are similar and complementary:
Both stages produce an array that is index-aligned with the input documents array: entries at index
nin the documents and the content or labels array refer to the same document.
Both stages support the
limitproperties for paged retrieval.
documentLabelsstage results only for presentation purposes.
If you need to collect an aggregate list of labels occurring in a set of documents, use the
Label frequency thresholds
To apply frequency thresholds to the labels the
documentLabels collects, override properties of the stage's underlying
In its default configuration, the
documentLabels stage does not apply any filtering to the list of
labels it retrieves (except the label filter
default component). One common label retrieval
scenario is to collect a list of salient labels from a larger set of documents and then retrieve the occurrences
of those labels in individual documents:
The above request consists of three stages:
documentsstage selects the top 1000 documents matching the photon query.
labelsstage collects a set of labels that best describe the documents from the
documentLabelsstage retrieves the occurrences of the salient labels for each document. We achieve this by applying the
labelFilter:acceptLabelsfilter configured to accept only the salient labels from the