Document content retrieval

Most Lingo4G-based applications will ultimately need to display the contents of some documents from the index. This is where the content and label retrieval stages come in handy.

Content retrieval

The document​Content stage retrieves values of stored fields, such as title, abstract or list of authors, for each document in the document set you provide.

The following request selects top 10 documents matching the photon query and retrieves their title and abstract fields.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 10
    },
    "documentContent": {
      "type": "documentContent",
      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "title": {},
          "abstract": {
            "maxValueLength": 512
          }
        }
      }
    }
  },
  "output": {
    "stages": [
      "documents",
      "documentContent"
    ]
  }
}

Retrieving the content of the title and abstract fields for a set of documents matching the photon query.

For the above request, Lingo4G produces a result JSON with two arrays:

  • an array of document identifiers and search scores produced by the documents stage
  • an array of document field values produced by the document​Content stage

Following the general principle of Lingo4G analysis API, the two arrays are index-aligned: entries at index n in both arrays correspond to the same document.

To see a visual representation of the document content, execute the request in the JSON sandbox app and switch to the documents list tab.

Lingo4G JSON sandbox app, document content (light theme).
Lingo4G JSON sandbox app, document content (dark theme).

Lingo4G JSON Sandbox app showing document content retrieval analysis request (on the left) and the retrieved fields (on the right).

Results paging

You can use the start and limit properties of the document​Content stage to retrieve document content in a paged fashion:

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 100
    },
    "documentContent": {
      "type": "documentContent",
      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "title": {},
          "abstract": {
            "maxValueLength": 512
          }
        }
      },
      "start": 50,
      "limit": 25
    }
  },
  "output": {
    "stages": [
      "documents",
      "documentContent"
    ]
  }
}

Paged retrieval of document content. The request divides the 100 search results into 25-result pages and retrieves field values for page 3 of the results.

Note that when the start property is greater than 0, the documents and the document content arrays are aligned with an offset: entry at index n in the document content array corresponds to entry at index n + start in the documents array.

Heads up, unlimited retrieval by default!

The default value of the limit property is unlimited. Therefore, if you don't provide an explicit lower limit value, Lingo4G will retrieve the content of all the documents on input. Make sure your requests don't accidentally retrieve the content of tens of thousands of documents, as this will be resource-intensive both on the server and on the client side.

Field output configuration

Use the fields property to specify which fields Lingo4G should return for each document and how Lingo4G should format the fields' values.

You can use any of the content​Fields:​* components to provide the above specification. The request below returns a full set of complete values of the title and abstract fields (limited to the first two documents matching the "twin photon" correlations query ).

{
  "stages": {    
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "\"twin photon\" correlations"
      },
      "limit": 2
    },
    "documentContent": {
      "type": "documentContent",
      "fields":{
        "type": "contentFields:simple",
        "fields": {
          "title": {
            "maxValues": "unlimited",
            "maxValueLength": "unlimited"
          },
          "abstract": {        
            "maxValues": "unlimited",
            "maxValueLength": "unlimited"
          }
        }
      }
    }
  },
  "output": {
    "stages": [
      "documentContent"
    ]
  }
}

Retrieving the full content of title and abstract fields.

The above request returns the following:

{
  "result" : {
    "documentContent" : {
      "documents" : [
        {
          "id" : 5442,
          "fields" : {
            "title" : {
              "values" : [
                "⁌q⁍Correlation⁌\\q⁍ Imaging through a scattering medium: experiment and comparison with simulations of the biphoton wave function"
              ]
            },
            "abstract" : {
              "values" : [
                " We first extend our recent experiments of ⁌q⁍correlation⁌\\q⁍ imaging through scattering media to the case of a thick medium, composed of two phase scatterers placed respectively in the image and the Fourier planes of the crystal. The spatial ⁌q⁍correlations⁌\\q⁍ between ⁌q⁍twin photons⁌\\q⁍ are still detected but no more in the form of a speckle. Second, a numerical simulation of the biphoton wave function is developed and applied to our experimental situation, with a good agreement. "
              ]
            }
          }
        },
        {
          "id" : 347972,
          "fields" : {
            "title" : {
              "values" : [
                "Disclosing the spatio-temporal structure of PDC entanglement through frequency up-conversion"
              ]
            },
            "abstract" : {
              "values" : [
                " In this work we propose and analyse a scheme where the full spatio-temporal ⁌q⁍correlation⁌\\q⁍ of ⁌q⁍twin photons⁌\\q⁍/beams generated by parametric down-conversion is detected by using its inverse process, i.e. sum frequency generation. Our main result is that, by imposing independently a temporal delay Δ t and a transverse spatial shift Δ x between two twin components of PDC light, the up-converted light intensity provides information on the ⁌q⁍correlation⁌\\q⁍ of the PDC light in the full spatio-temporal domain, and should enable the reconstruction of the peculiar X-shaped structure of the ⁌q⁍correlation⁌\\q⁍ predicted in [gatti2009,caspani2010,brambilla2010]. Through both a semi-analytical and a numerical modeling of the proposed optical system, we analyse the feasibility of the experiment and identify the best conditions to implement it. In particular, the tolerance of the phase-sensitive measurement against the presence of dispersive elements, imperfect imaging conditions and possible misalignments of the two crystals is evaluated. "
              ]
            }
          }
        }
      ]
    }
  }
}

The result of retrieving the full content of title and abstract fields.

In most scenarios, the full content of long fields is not really needed and a lead line of certain length is sufficient. In the request below, the title field is configured to always return the full value, but abstract and author_name fields are limited to at most two values, each truncated to at most 160 characters.

{
  "stages": {    
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "\"twin photon\" correlations"
      },
      "limit": 2
    },
    "documentContent": {
      "type": "documentContent",
      "fields": {
        "type": "contentFields:grouped",
        "groups": [
          {
            "fields": ["title"],
            "config": {
              "maxValues": "unlimited",
              "maxValueLength": "unlimited"
            }
          },
          {
            "fields": ["abstract", "author_name"],
            "config": {
              "maxValues": 2,
              "maxValueLength": 160
            }
          }
        ]
      }
    }
  },
  "output": {
    "stages": [
      "documentContent"
    ]
  }
}

Limiting and truncating the content of selected fields.

Compare the result below to the full content of those fields retrieved in the previous request. Note ellipsis marks where values have been truncated.

{
  "result" : {
    "documentContent" : {
      "documents" : [
        {
          "id" : 5442,
          "fields" : {
            "title" : {
              "values" : [
                "⁌q⁍Correlation⁌\\q⁍ Imaging through a scattering medium: experiment and comparison with simulations of the biphoton wave function"
              ]
            },
            "abstract" : {
              "values" : [
                "…in the image and the Fourier planes of the crystal. The spatial ⁌q⁍correlations⁌\\q⁍ between ⁌q⁍twin photons⁌\\q⁍ are still detected but no more in the form of a…"
              ]
            },
            "author_name" : {
              "values" : [
                "Soro, Gnatiessoro",
                "Lantz, Eric",
                "…"
              ]
            }
          }
        },
        {
          "id" : 347972,
          "fields" : {
            "title" : {
              "values" : [
                "Disclosing the spatio-temporal structure of PDC entanglement through frequency up-conversion"
              ]
            },
            "abstract" : {
              "values" : [
                "…this work we propose and analyse a scheme where the full spatio-temporal ⁌q⁍correlation⁌\\q⁍ of ⁌q⁍twin photons⁌\\q⁍/beams generated by parametric down-conversion is detected…"
              ]
            },
            "author_name" : {
              "values" : [
                "Brambilla, Enrico",
                "Jedrkiewicz, Ottavia",
                "…"
              ]
            }
          }
        }
      ]
    }
  }
}

This response includes a subset of author_name field values and truncated long strings in the abstract field.

Query highlighting

Query in context is a standard technique of presenting search results by highlighting short fragments of text that directly correspond to the search query issued by the user. For example, for the query "twin photon" correlations we would expect those phrases to be highlighted in the returned set of fields for each document.

Use the queries property of the document​Content stage to specify one (or more) queries for which Lingo4G should highlight their corresponding matching text regions. Typically the queries element will contain an identical query as that issued by the user, but it is not limited to just one (or even the same) query.

In the example below, we request two documents matching "twin photon" correlations and configure the queries property to highlight text fragments matching two queries: "twin photon" correlations and interference:

{
  "stages": {    
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "\"twin photon\" correlations"
      },
      "limit": 2
    },
    "documentContent": {
      "type": "documentContent",
      "fields":{
        "type": "contentFields:grouped",
        "groups": [
          {
            "fields": ["title"],
            "config": {
              "maxValues": "unlimited",
              "maxValueLength": "unlimited"
            }
          },
          {
            "fields": ["abstract", "author_name"],
            "config": {
              "maxValues": 2,
              "maxValueLength": 160
            }
          }
        ]
      },
      "queries": {
        "q1": {
          "type": "query:string",
          "query": "\"twin photon\" correlations"
        },
        "q2": {
          "type": "query:string",
          "query": "interference"
        }
      }
    }
  },
  "output": {
    "stages": [
      "documentContent"
    ]
  }
}

queries property used to highlight text regions matching two independent queries. We name the two queries q1 and q2 so that we can identify their match regions in the response.

Note there is no guarantee that all matching text regions will be included in the response (this depends on how the field value limits are configured). Lingo4G will try to return those regions within each document field's value that contain a maximum number of hits. For the query above, the returned response includes marked-up passages as shown below:

{
  "result" : {
    "documentContent" : {
      "documents" : [
        {
          "id" : 5442,
          "fields" : {
            "title" : {
              "values" : [
                "⁌q1⁍Correlation⁌\\q1⁍ Imaging through a scattering medium: experiment and comparison with simulations of the biphoton wave function"
              ]
            },
            "abstract" : {
              "values" : [
                "…in the image and the Fourier planes of the crystal. The spatial ⁌q1⁍correlations⁌\\q1⁍ between ⁌q1⁍twin photons⁌\\q1⁍ are still detected but no more in the form of a…"
              ]
            },
            "author_name" : {
              "values" : [
                "Soro, Gnatiessoro",
                "Lantz, Eric",
                "…"
              ]
            }
          }
        },
        {
          "id" : 347972,
          "fields" : {
            "title" : {
              "values" : [
                "Disclosing the spatio-temporal structure of PDC entanglement through frequency up-conversion"
              ]
            },
            "abstract" : {
              "values" : [
                "…this work we propose and analyse a scheme where the full spatio-temporal ⁌q1⁍correlation⁌\\q1⁍ of ⁌q1⁍twin photons⁌\\q1⁍/beams generated by parametric down-conversion is detected…"
              ]
            },
            "author_name" : {
              "values" : [
                "Brambilla, Enrico",
                "Jedrkiewicz, Ottavia",
                "…"
              ]
            }
          }
        }
      ]
    }
  }
}

The returned, highlighted field values contain the default highlight markers (⁌q1⁍..⁌\q1⁍, ⁌q2⁍..⁌\q2⁍) for each query specified in the queries property.

Label retrieval

The document​Labels stage retrieves labels contained in each document of the document set you provide. You can combine it with the document​Content stage to present the content and labels contained in a set of documents.

The following request selects the top 10 documents matching the photon query and retrieves up to 5 most frequent labels contained in each document.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 10
    },
    "documentLabels": {
      "type": "documentLabels",
      "maxLabels": 5
    }
  }
}

Retrieving labels for a set of documents matching the photon query.

Run the request in JSON sandbox to see what the label retrieval JSON response looks like. If you switch to the documents list view, you should see a graphical representation of the documents and their labels.

Lingo4G JSON sandbox app, document labels (light theme).
Lingo4G JSON sandbox app, document labels (dark theme).

Lingo4G JSON Sandbox app showing document label retrieval analysis request (on the left) and the retrieved labels (on the right).

Label and document retrieval stages are similar and complementary:

  • Both stages produce an array that is index-aligned with the input documents array: entries at index n in the documents and the content or labels array refer to the same document.

  • Both stages support the start and limit properties for paged retrieval.

Use document​Labels stage results only for presentation purposes.

If you need to collect an aggregate list of labels occurring in a set of documents, use the labels:​from​Documents stage.

Label frequency thresholds

To apply frequency thresholds to the labels the document​Labels collects, override properties of the stage's underlying label​Collector:

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 10
    },
    "documentLabels": {
      "type": "documentLabels",
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "minTf": 2
      }
    }
  }
}

Document label retrieval with customized label frequency thresholds.

Label filtering

In its default configuration, the document​Labels stage does not apply any filtering to the list of labels it retrieves (except the label filter default component). One common label retrieval scenario is to collect a list of salient labels from a larger set of documents and then retrieve the occurrences of those labels in individual documents:

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "photon"
      },
      "limit": 10
    },
    "labels": {
      "type": "labels:fromDocuments"
    },
    "documentLabels": {
      "type": "documentLabels",
      "labelCollector": {
        "type": "labelCollector:topFromFeatureFields",
        "labelFilter": {
          "type": "labelFilter:acceptLabels",
          "labels": {
            "type": "labels:reference",
            "use": "labels"
          }
        }
      }
    }
  }
}

Limiting document label retrieval to a closed set of labels.

The above request consists of three stages:

  • The documents stage selects the top 1000 documents matching the photon query.

  • The labels stage collects a set of labels that best describe the documents from the documents stage.

  • The document​Labels stage retrieves the occurrences of the salient labels for each document. We achieve this by applying the label​Filter:​accept​Labels filter configured to accept only the salient labels from the labels stage.