Document content retrieval
Most Lingo4G-based applications will ultimately need to display the contents of some documents from the index. This is where the content and label retrieval stages come in handy.
Content retrieval
The
documentContent
stage retrieves values of stored fields, such as title, abstract or list of authors, for each document in the
document set you provide.
The following request selects top 10 documents matching the
photon query and retrieves their title
and
abstract
fields.
For the above request, Lingo4G produces a result JSON with two arrays:
-
an array of document identifiers and search scores produced by the
documents
stage -
an array of document field values produced by the
documentContent
stage
Following the general principle of Lingo4G analysis API, the two arrays are index-aligned:
entries at index n
in both arrays correspond to the same document.
To see a visual representation of the document content, execute the request in the JSON sandbox app and switch to the documents list tab.
Results paging
You can use the
start
and
limit
properties of the documentContent
stage to retrieve document content in a paged fashion:
Note that when the start
property is greater than 0, the documents and the document content arrays
are aligned with an offset: entry at index n
in the document content array
corresponds to entry at index n + start
in the documents array.
The default value of the
limit
property is
unlimited
. Therefore, if you don't provide an explicit lower limit value, Lingo4G will retrieve
the content of all the documents on input. Make sure your requests don't accidentally retrieve the content of
tens of thousands of documents, as this will be resource-intensive both on the server and on the client side.
Field output configuration
Use the
fields
property to specify which fields Lingo4G should return for each document and how Lingo4G should format the
fields' values.
You can use any of the
contentFields:*
components to provide the above specification. The request below returns a full set of complete values of the
title
and abstract
fields (limited to the first two documents matching the
"twin photon" correlations query ).
The above request returns the following:
In most scenarios, the full content of long fields is not really needed and a lead line of certain length is
sufficient. In the request below, the title
field is configured to always return the full value,
but abstract
and author_name
fields are limited to at most two values, each truncated
to at most 160 characters.
Compare the result below to the full content of those fields retrieved in the previous request. Note ellipsis marks where values have been truncated.
Query highlighting
Query in context is a standard technique of presenting search results by highlighting short fragments of text that directly correspond to the search query issued by the user. For example, for the query "twin photon" correlations we would expect those phrases to be highlighted in the returned set of fields for each document.
Use the
queries
property of the documentContent
stage to specify one (or more) queries for which Lingo4G should
highlight their corresponding matching text regions. Typically the queries
element will contain an
identical query as that issued by the user, but it is not limited to just one (or even the same) query.
In the example below, we request two documents matching
"twin photon" correlations and configure the queries
property to highlight text fragments
matching two queries: "twin photon" correlations and interference:
Note there is no guarantee that all matching text regions will be included in the response (this depends on how the field value limits are configured). Lingo4G will try to return those regions within each document field's value that contain a maximum number of hits. For the query above, the returned response includes marked-up passages as shown below:
Label retrieval
The
documentLabels
stage retrieves labels contained in each document of the document set you provide. You can combine it with the
documentContent
stage to present the content and labels contained in a set of documents.
The following request selects the top 10 documents matching the photon query and retrieves up to 5 most frequent labels contained in each document.
Run the request in JSON sandbox to see what the label retrieval JSON response looks like. If you switch to the documents list view, you should see a graphical representation of the documents and their labels.
Label and document retrieval stages are similar and complementary:
-
Both stages produce an array that is index-aligned with the input documents array: entries at index
n
in the documents and the content or labels array refer to the same document. -
Both stages support the
start
andlimit
properties for paged retrieval.
documentLabels
stage results only for presentation purposes.
If you need to collect an aggregate list of labels occurring in a set of documents, use the
labels:fromDocuments
stage.
Label frequency thresholds
To apply frequency thresholds to the labels the
documentLabels
collects, override properties of the stage's underlying
labelCollector
:
Label filtering
In its default configuration, the documentLabels
stage does not apply any filtering to the list of
labels it retrieves (except the label filter
default component). One common label retrieval
scenario is to collect a list of salient labels from a larger set of documents and then retrieve the occurrences
of those labels in individual documents:
The above request consists of three stages:
-
The
documents
stage selects the top 1000 documents matching the photon query. -
The
labels
stage collects a set of labels that best describe the documents from thedocuments
stage. -
The
documentLabels
stage retrieves the occurrences of the salient labels for each document. We achieve this by applying thelabelFilter:acceptLabels
filter configured to accept only the salient labels from thelabels
stage.