Document content retrieval
Most Lingo4G-based applications will ultimately need to display the contents of some documents from the index. This is where the content and label retrieval stages come in handy.
Content retrieval
The
documentâContent
stage retrieves values of stored fields, such as title, abstract or list of authors, for each document in the
document set you provide.
The following request selects top 10 documents matching the
photon query and retrieves their title
and
abstract
fields.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "photon"
},
"limit": 10
},
"documentContent": {
"type": "documentContent",
"fields":{
"type": "contentFields:simple",
"fields": {
"title": {},
"abstract": {
"maxValueLength": 512
}
}
}
}
},
"output": {
"stages": [
"documents",
"documentContent"
]
}
}
Retrieving the content of the title
and
abstract
fields for a set of documents matching the photon query.
For the above request, Lingo4G produces a result JSON with two arrays:
-
an array of document identifiers and search scores produced by the
documents
stage -
an array of document field values produced by the
documentâContent
stage
Following the general principle of Lingo4G analysis API, the two arrays are index-aligned:
entries at index n
in both arrays correspond to the same document.
To see a visual representation of the document content, execute the request in the JSON sandbox app and switch to the documents list tab.


Lingo4G JSON Sandbox app showing document content retrieval analysis request (on the left) and the retrieved fields (on the right).
Results paging
You can use the
start
and
limit
properties of the documentâContent
stage to retrieve document content in a paged fashion:
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "photon"
},
"limit": 100
},
"documentContent": {
"type": "documentContent",
"fields":{
"type": "contentFields:simple",
"fields": {
"title": {},
"abstract": {
"maxValueLength": 512
}
}
},
"start": 50,
"limit": 25
}
},
"output": {
"stages": [
"documents",
"documentContent"
]
}
}
Paged retrieval of document content. The request divides the 100 search results into 25-result pages and retrieves field values for page 3 of the results.
Note that when the start
property is greater than 0, the documents and the document content arrays
are aligned with an offset: entry at index n
in the document content array
corresponds to entry at index n + start
in the documents array.
The default value of the
limit
property is
unlimited
. Therefore, if you don't provide an explicit lower limit value, Lingo4G will retrieve
the content of all the documents on input. Make sure your requests don't accidentally retrieve the content of
tens of thousands of documents, as this will be resource-intensive both on the server and on the client side.
Field output configuration
Use the
fields
property to specify which fields Lingo4G should return for each document and how Lingo4G should format the
fields' values.
You can use any of the
contentâFields:â*
components to provide the above specification. The request below returns a full set of complete values of the
title
and abstract
fields (limited to the first two documents matching the
"twin photon" correlations query ).
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "\"twin photon\" correlations"
},
"limit": 2
},
"documentContent": {
"type": "documentContent",
"fields":{
"type": "contentFields:simple",
"fields": {
"title": {
"maxValues": "unlimited",
"maxValueLength": "unlimited"
},
"abstract": {
"maxValues": "unlimited",
"maxValueLength": "unlimited"
}
}
}
}
},
"output": {
"stages": [
"documentContent"
]
}
}
Retrieving the full content of title
and abstract
fields.
The above request returns the following:
{
"result" : {
"documentContent" : {
"documents" : [
{
"id" : 5442,
"fields" : {
"title" : {
"values" : [
"âqâCorrelationâ\\qâ Imaging through a scattering medium: experiment and comparison with simulations of the biphoton wave function"
]
},
"abstract" : {
"values" : [
" We first extend our recent experiments of âqâcorrelationâ\\qâ imaging through scattering media to the case of a thick medium, composed of two phase scatterers placed respectively in the image and the Fourier planes of the crystal. The spatial âqâcorrelationsâ\\qâ between âqâtwin photonsâ\\qâ are still detected but no more in the form of a speckle. Second, a numerical simulation of the biphoton wave function is developed and applied to our experimental situation, with a good agreement. "
]
}
}
},
{
"id" : 347972,
"fields" : {
"title" : {
"values" : [
"Disclosing the spatio-temporal structure of PDC entanglement through frequency up-conversion"
]
},
"abstract" : {
"values" : [
" In this work we propose and analyse a scheme where the full spatio-temporal âqâcorrelationâ\\qâ of âqâtwin photonsâ\\qâ/beams generated by parametric down-conversion is detected by using its inverse process, i.e. sum frequency generation. Our main result is that, by imposing independently a temporal delay Î t and a transverse spatial shift Î x between two twin components of PDC light, the up-converted light intensity provides information on the âqâcorrelationâ\\qâ of the PDC light in the full spatio-temporal domain, and should enable the reconstruction of the peculiar X-shaped structure of the âqâcorrelationâ\\qâ predicted in [gatti2009,caspani2010,brambilla2010]. Through both a semi-analytical and a numerical modeling of the proposed optical system, we analyse the feasibility of the experiment and identify the best conditions to implement it. In particular, the tolerance of the phase-sensitive measurement against the presence of dispersive elements, imperfect imaging conditions and possible misalignments of the two crystals is evaluated. "
]
}
}
}
]
}
}
}
The result of retrieving the full content of title
and abstract
fields.
In most scenarios, the full content of long fields is not really needed and a lead line of certain length is
sufficient. In the request below, the title
field is configured to always return the full value,
but abstract
and author_name
fields are limited to at most two values, each truncated
to at most 160 characters.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "\"twin photon\" correlations"
},
"limit": 2
},
"documentContent": {
"type": "documentContent",
"fields": {
"type": "contentFields:grouped",
"groups": [
{
"fields": ["title"],
"config": {
"maxValues": "unlimited",
"maxValueLength": "unlimited"
}
},
{
"fields": ["abstract", "author_name"],
"config": {
"maxValues": 2,
"maxValueLength": 160
}
}
]
}
}
},
"output": {
"stages": [
"documentContent"
]
}
}
Limiting and truncating the content of selected fields.
Compare the result below to the full content of those fields retrieved in the previous request. Note ellipsis marks where values have been truncated.
{
"result" : {
"documentContent" : {
"documents" : [
{
"id" : 5442,
"fields" : {
"title" : {
"values" : [
"âqâCorrelationâ\\qâ Imaging through a scattering medium: experiment and comparison with simulations of the biphoton wave function"
]
},
"abstract" : {
"values" : [
"âŠin the image and the Fourier planes of the crystal. The spatial âqâcorrelationsâ\\qâ between âqâtwin photonsâ\\qâ are still detected but no more in the form of aâŠ"
]
},
"author_name" : {
"values" : [
"Soro, Gnatiessoro",
"Lantz, Eric",
"âŠ"
]
}
}
},
{
"id" : 347972,
"fields" : {
"title" : {
"values" : [
"Disclosing the spatio-temporal structure of PDC entanglement through frequency up-conversion"
]
},
"abstract" : {
"values" : [
"âŠthis work we propose and analyse a scheme where the full spatio-temporal âqâcorrelationâ\\qâ of âqâtwin photonsâ\\qâ/beams generated by parametric down-conversion is detectedâŠ"
]
},
"author_name" : {
"values" : [
"Brambilla, Enrico",
"Jedrkiewicz, Ottavia",
"âŠ"
]
}
}
}
]
}
}
}
This response includes a subset of author_name
field values and truncated long strings in the
abstract
field.
Query highlighting
Query in context is a standard technique of presenting search results by highlighting short fragments of text that directly correspond to the search query issued by the user. For example, for the query "twin photon" correlations we would expect those phrases to be highlighted in the returned set of fields for each document.
Use the
queries
property of the documentâContent
stage to specify one (or more) queries for which Lingo4G should
highlight their corresponding matching text regions. Typically the queries
element will contain an
identical query as that issued by the user, but it is not limited to just one (or even the same) query.
In the example below, we request two documents matching
"twin photon" correlations and configure the queries
property to highlight text fragments
matching two queries: "twin photon" correlations and interference:
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "\"twin photon\" correlations"
},
"limit": 2
},
"documentContent": {
"type": "documentContent",
"fields":{
"type": "contentFields:grouped",
"groups": [
{
"fields": ["title"],
"config": {
"maxValues": "unlimited",
"maxValueLength": "unlimited"
}
},
{
"fields": ["abstract", "author_name"],
"config": {
"maxValues": 2,
"maxValueLength": 160
}
}
]
},
"queries": {
"q1": {
"type": "query:string",
"query": "\"twin photon\" correlations"
},
"q2": {
"type": "query:string",
"query": "interference"
}
}
}
},
"output": {
"stages": [
"documentContent"
]
}
}
queries
property used to highlight text regions matching two independent queries. We name the
two queries q1
and q2
so that we can identify their match regions in the response.
Note there is no guarantee that all matching text regions will be included in the response (this depends on how the field value limits are configured). Lingo4G will try to return those regions within each document field's value that contain a maximum number of hits. For the query above, the returned response includes marked-up passages as shown below:
{
"result" : {
"documentContent" : {
"documents" : [
{
"id" : 5442,
"fields" : {
"title" : {
"values" : [
"âq1âCorrelationâ\\q1â Imaging through a scattering medium: experiment and comparison with simulations of the biphoton wave function"
]
},
"abstract" : {
"values" : [
"âŠin the image and the Fourier planes of the crystal. The spatial âq1âcorrelationsâ\\q1â between âq1âtwin photonsâ\\q1â are still detected but no more in the form of aâŠ"
]
},
"author_name" : {
"values" : [
"Soro, Gnatiessoro",
"Lantz, Eric",
"âŠ"
]
}
}
},
{
"id" : 347972,
"fields" : {
"title" : {
"values" : [
"Disclosing the spatio-temporal structure of PDC entanglement through frequency up-conversion"
]
},
"abstract" : {
"values" : [
"âŠthis work we propose and analyse a scheme where the full spatio-temporal âq1âcorrelationâ\\q1â of âq1âtwin photonsâ\\q1â/beams generated by parametric down-conversion is detectedâŠ"
]
},
"author_name" : {
"values" : [
"Brambilla, Enrico",
"Jedrkiewicz, Ottavia",
"âŠ"
]
}
}
}
]
}
}
}
The returned, highlighted field values contain the default
highlight markers
(âq1â..â\q1â
, âq2â..â\q2â
) for each query specified in the
queries
property.
Label retrieval
The
documentâLabels
stage retrieves labels contained in each document of the document set you provide. You can combine it with the
documentâContent
stage to present the content and labels contained in a set of documents.
The following request selects the top 10 documents matching the photon query and retrieves up to 5 most frequent labels contained in each document.
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "photon"
},
"limit": 10
},
"documentLabels": {
"type": "documentLabels",
"maxLabels": 5
}
}
}
Retrieving labels for a set of documents matching the photon query.
Run the request in JSON sandbox to see what the label retrieval JSON response looks like. If you switch to the documents list view, you should see a graphical representation of the documents and their labels.


Lingo4G JSON Sandbox app showing document label retrieval analysis request (on the left) and the retrieved labels (on the right).
Label and document retrieval stages are similar and complementary:
-
Both stages produce an array that is index-aligned with the input documents array: entries at index
n
in the documents and the content or labels array refer to the same document. -
Both stages support the
start
andlimit
properties for paged retrieval.
documentâLabels
stage results only for presentation purposes.
If you need to collect an aggregate list of labels occurring in a set of documents, use the
labels:âfromâDocuments
stage.
Label frequency thresholds
To apply frequency thresholds to the labels the
documentâLabels
collects, override properties of the stage's underlying
labelâCollector
:
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "photon"
},
"limit": 10
},
"documentLabels": {
"type": "documentLabels",
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"minTf": 2
}
}
}
}
Document label retrieval with customized label frequency thresholds.
Label filtering
In its default configuration, the documentâLabels
stage does not apply any filtering to the list of
labels it retrieves (except the label filter
default component). One common label retrieval
scenario is to collect a list of salient labels from a larger set of documents and then retrieve the occurrences
of those labels in individual documents:
{
"stages": {
"documents": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": "photon"
},
"limit": 10
},
"labels": {
"type": "labels:fromDocuments"
},
"documentLabels": {
"type": "documentLabels",
"labelCollector": {
"type": "labelCollector:topFromFeatureFields",
"labelFilter": {
"type": "labelFilter:acceptLabels",
"labels": {
"type": "labels:reference",
"use": "labels"
}
}
}
}
}
}
Limiting document label retrieval to a closed set of labels.
The above request consists of three stages:
-
The
documents
stage selects the top 1000 documents matching the photon query. -
The
labels
stage collects a set of labels that best describe the documents from thedocuments
stage. -
The
documentâLabels
stage retrieves the occurrences of the salient labels for each document. We achieve this by applying thelabelâFilter:âacceptâLabels
filter configured to accept only the salient labels from thelabels
stage.