Document selection

Document selection produces a set of documents you can use at later stages of your analyses, for example for clustering or 2d mapping.

Most of your analysis requests will operate on some subset of documents stored in the Lingo4G index. The documents:​* category of analysis stages groups various ways of specifying which documents to select for processing.

This article is an overview of the available document selector stages and their typical use cases. For in-depth descriptions of specific selectors and their properties, see the document selector API reference.

This article assumes you are familiar with the structure and concepts behind Lingo4G analysis request JSONs.

Common document selectors

The following stages should cover the document selection needs of most typical analysis requests.

documents:​by​Query

The documents:​by​Query stage selects documents that match the query you provide. Coupled with the query:​string component, which parses Lucene-like query syntax, documents:​by​Query is the most likely source of documents in your analyses.

Let's use the documents:​by​Query stage to select the top 100 arXiv abstracts containing the dark energy phrase and created in 2016 or later.

{
  "stages": {
    "documents": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "\"dark energy\" AND created:[2016-01-01 TO *]"
      },
      "limit": 100
    }
  }
}

Using the documents:​by​Query stage and the query:​string to select the top 100 documents matching a string query.

If you execute the above request in the JSON Sandbox app, you should see a result similar to the following JSON. Note that we present only the top 5 results for brevity. In the real response, the number of elements in the documents array is not greater than the limit value you set in the request.

{
  "result" : {
    "documents" : {
      "matches" : {
        "value" : 1238,
        "relation" : "GREATER_OR_EQUAL"
      },
      "documents" : [
        {
          "id" : 366765,
          "weight" : 10.426125
        },
        {
          "id" : 402100,
          "weight" : 10.3851595
        },
        {
          "id" : 338037,
          "weight" : 10.305012
        },
        {
          "id" : 289242,
          "weight" : 10.225209
        },
        {
          "id" : 316110,
          "weight" : 10.218446
        }
      ]
    }
  }
}

Document selection JSON response.

Document selection JSON output contains the documents array, which holds the internal identifier and weight (importance) of each selected document. The semantics of the document's weight property depends on the specific document selection stage. In case of the documents:​by​Query stage, each document's weight is the search score returned by Apache Lucene, which Lingo4G uses to perform query-based searches.

Some document selection stages may add extra information on top of the list of selected documents. In our case, documents:​by​Query adds the matches section, which shows the total number of documents matching the query. The number of matches may be larger than the document selection limit you provide in the request. See the documents:​by​Query reference documentation for a detailed description of its output JSON.

documents:embeddingNearestNeighbors

The documents:​embedding​Nearest​Neighbors stage selects the documents that are most semantically-similar to the multidimensional embedding vector you provide. In contrast to documents:​by​Query, which requires certain words to be present in the selected documents, the embedding-based document selection performs a more "fuzzy", semantics-based matching. You may use the embedding-based document selection to discover documents that are hard to find using keyword-based methods.

Heads up, embedding learning required.

To use embedding-based selectors, your index must contain label and document embedding vectors. See the learning embeddings article for detailed instructions.

Document-based selection

Let's select documents that are semantically similar to one specific seed document. We'll break the request down into three stages:

  1. Selecting the seed document using the documents:​by​Query stage.

  2. Retrieving the embedding vector of the seed document using the vector:​document​Embedding stage.

  3. Retrieving the semantically similar documents using the documents:​embedding​Nearest​Neighbors stage.

{
  "name": "Selecting documents semantically-similar to another document",
  "stages": {
    "seedDocuments": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": "id:1703.01028"
      }
    },
    "seedVector": {
      "type": "vector:documentEmbedding",
      "documents": {
        "type": "documents:reference",
        "use": "seedDocuments"
      }
    },
    "similarDocuments": {
      "type": "documents:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:reference",
        "use": "seedVector"
      },
      "limit": 20
    }
  },
  "output": {
    "stages": [
      "seedDocuments",
      "similarDocuments"
    ]
  }
}

Selecting documents that are semantically-similar to one specific seed document (request).

The request contains three named stages corresponding to the above list: seed​Documents, seed​Vector and similar​Document. Note how the request uses stage references to pass results from one stage to another. We use the $.output.stages property to output only the results of document stages, the output of the vector stage is not relevant.

If you run the above request in the JSON sandbox app, you should see a response similar to the following JSON.

{
  "result" : {
    "seedDocuments" : {
      "matches" : {
        "value" : 1,
        "relation" : "EXACT"
      },
      "documents" : [
        {
          "id" : 0,
          "weight" : 5.78041
        }
      ]
    },
    "similarDocuments" : {
      "documents" : [
        {
          "id" : 0,
          "weight" : 0.9999999
        },
        {
          "id" : 337110,
          "weight" : 0.84642214
        },
        {
          "id" : 396695,
          "weight" : 0.81967276
        },
        {
          "id" : 43114,
          "weight" : 0.80386704
        },
        {
          "id" : 420151,
          "weight" : 0.80064183
        }
      ]
    }
  }
}

Selecting documents that are semantically-similar to one specific seed document (response).

The $.result.seed​Documents section contains the identifier of the seed document, while the $.result.similar​Documents section contains the documents whose embedding vectors lie close to the seed document's vector. The documents:​embedding​Nearest​Neighbors stage computes document weights as the dot product between the search vector and the result document's vector, normalized to the 0...1 range. In our case, the first document on the list is the same as the seed document, hence the weight of 1.0. Also note, that the embedding-based documents selector does not output the matches section.

Looking at the response to the original request, it is not possible to tell if the selected documents are indeed semantically-similar to the seed document. You can use the document​Content stage to retrieve the contents, such as title or abstract, of the documents returned by the document selection stages. See the document content retrieval tutorial for detailed explanations and example requests.

Label-based selection

Lingo4G can also learn multidimensional embedding vectors for labels. Therefore, instead of the seed document vector, you can pass a label vector to the documents:​embedding​Nearest​Neighbors stage. In this arrangement, Lingo4G selects documents that are semantically similar to one or more labels you provide.

The following request returns 20 documents whose embedding vectors lie closest to the embedding vector of the LIGO label (which stands for Laser Interferometer Gravitational-Wave Observatory).

{
  "name": "Selecting documents semantically-similar to a label but not containing that label",
  "stages": {
    "similarDocuments": {
      "type": "documents:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:labelEmbedding",
        "labels": {
          "type": "labels:direct",
          "labels": [
            {
              "label": "LIGO"
            }
          ]
        }
      },
      "limit": 100
    }
  }
}

Selecting documents that are semantically-similar to the LIGO label.

This request in-lines all the necessary dependencies into the documents:​embedding​Nearest​Neighbors stage. The vector property contains the vector:​label​Embedding stage, which in turn uses the labels:​direct stage to provide a literal label.

If you run the above request in the JSON Sandbox app, you should see a list of matching document identifiers, along with the 0...1 similarity scores. Again, to verify that the resulting documents relate to the seed label, you can a document​Content stage to retrieve the titles and abstracts of the papers. See the Document content retrieval article for a complete tutorial.

For the curious: demonstrating the utility of embedding-based selection.

We can extend the above request to return the embedding-wise similar documents, but only those that do not contain the LIGO keyword. These would be the related documents that are impossible to find using the traditional keyword-based method.

{
  "name": "Selecting documents semantically-similar to a label but not containing that label",
  "stages": {
    "similarDocuments": {
      "type": "documents:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:labelEmbedding",
        "labels": {
          "type": "labels:direct",
          "labels": [
            {
              "label": "LIGO"
            }
          ]
        }
      },
      "limit": 100
    },
    "similarDocumentsWithoutKeywords": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:composite",
        "queries": [
          {
            "type": "query:fromDocuments",
            "documents": {
              "type": "documents:reference",
              "use": "similarDocuments"
            }
          },
          {
            "type": "query:complement",
            "query": {
              "type": "query:string",
              "query": "LIGO"
            }
          }
        ],
        "operator": "AND"
      }
    },
    "similarDocumentsWithoutKeywordsContent": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "similarDocumentsWithoutKeywords"
      },
      "fields": {
        "type": "contentFields:grouped",
        "groups": [
          {
            "fields": ["title", "abstract"],
            "config": {
              "maxValueLength": 2048
            }
          }
        ]
      }
    }
  },
  "output": {
    "stages": [
      "similarDocumentsWithoutKeywordsContent",
      "similarDocuments"
    ]
  }
}

Selecting documents that are semantically-similar to the LIGO label but do not contain the LIGO word.

The similar​Documents stage is the same as in the previous request, with retrieval limit increased to 100.

The similar​Documents​Without​Keywords stage uses thedocuments:​by​Query stage to remove from similar​Documents those documents that contain the LIGO word. To this end, the request uses the query:​composite, query:​from​Documents and query:​complement components to intersect the list of all similar documents with those documents that do not contain the LIGO word.

Finally, the similar​Documents​Without​Keywords​Content stage retrieves the titles and abstracts of the selected documents to confirm that they are related to the seed label but do not contain it.

Let's examine the top results Lingo4G returns for this request.

{
  "result" : {
    "similarDocumentsWithoutKeywordsContent" : {
      "documents" : [
        {
          "id" : 54520,
          "fields" : {
            "title" : {
              "values" : [
                "Searching for the full symphony of black hole binary mergers"
              ]
            },
            "abstract" : {
              "values" : [
                " Current searches for the gravitational-wave signature of compact binary mergers rely on matched-filtering data from interferometric observatories with sets of modelled gravitational waveforms. These searches currently use model waveforms that do not include the higher-order mode content of the gravitational-wave signal. Higher-order modes are important for many compact binary mergers and their omission reduces the sensitivity to such sources. In this work we explore the sensitivity loss incurred from omitting higher-order modes. We present a new method for searching for compact binary mergers using waveforms that include higher-order mode effects, and evaluate the sensitivity increase that using our new method would allow. We find that, when evaluating sensitivity at a constant rate-of-false alarm, and when including the fact that signal-consistency tests can reject some signals that include higher-order mode content, we observe a sensitivity increase of up to a factor of 2 in volume for high mass ratio, high total-mass systems. For systems with equal mass, or with total mass ∼ 50 M_(⊙), we see more modest sensitivity increases, < 10%, which indicates that the existing search is already performing well. Our new search method is also directly applicable in searches for generic compact binaries. "
              ]
            }
          }
        },
        {
          "id" : 111245,
          "fields" : {
            "title" : {
              "values" : [
                "Gravitational Wave Detector Sites"
              ]
            },
            "abstract" : {
              "values" : [
                " Locations and orientations of current and proposed laser-interferometric gravitational wave detectors are given in tabular form. "
              ]
            }
          }
        },
        {
          "id" : 120423,
          "fields" : {
            "title" : {
              "values" : [
                "Observation results by the TAMA300 detector on gravitational wave bursts from stellar-core collapses"
              ]
            },
            "abstract" : {
              "values" : [
                " We present data-analysis schemes and results of observations with the TAMA300 gravitational-wave detector, targeting burst signals from stellar-core collapse events. In analyses for burst gravitational waves, the detection and fake-reduction schemes are different from well-investigated ones for a chirp-wave analysis, because precise waveform templates are not available. We used an excess-power filter for the extraction of gravitational-wave candidates, and developed two methods for the reduction of fake events caused by non-stationary noises of the detector. These analysis schemes were applied to real data from the TAMA300 interferometric gravitational wave detector. As a result, fake events were reduced by a factor of about 1000 in the best cases. The resultant event candidates were interpreted from an astronomical viewpoint. We set an upper limit of 2.2x10³ events/sec on the burst gravitational-wave event rate in our Galaxy with a confidence level of 90%. This work sets a milestone and prospects on the search for burst gravitational waves, by establishing an analysis scheme for the observation data from an interferometric gravitational wave detector. "
              ]
            }
          }
        },
        {
          "id" : 133925,
          "fields" : {
            "title" : {
              "values" : [
                "Sidereal time analysis as a toll for the study of the space distribution of sources of gravitational waves"
              ]
            },
            "abstract" : {
              "values" : [
                " Gravitational wave (GW) detectors operating on a long time range can be used for the study of space distribution of sources of GW bursts or to put strong upper limits on the GW signal of a wide class of source candidates. For this purpose we propose here a sidereal time analysis to analyze the output signal of GW detectors. Using the characteristics of some existing detectors, we demonstrate the capability of the sidereal time analysis to give a clear signature of different localizations of GW sources: the Galactic Center, the Galactic Plane, the Supergalactic plane, the Great Attractor. On the contrary, a homogeneous 3D-distribution of GW sources gives a signal without features. In this paper we consider tensor gravitational waves with randomly oriented polarization. We consider GW detectors at fixed positions on the Earth, and a fixed orientation of the antenna. "
              ]
            }
          }
        },
        {
          "id" : 220685,
          "fields" : {
            "title" : {
              "values" : [
                "Optimal statistic for detecting gravitational wave signals from binary inspirals with LISA"
              ]
            },
            "abstract" : {
              "values" : [
                " A binary compact object early in its inspiral phase will be picked up by its nearly monochromatic gravitational radiation by LISA. But even this innocuous appearing candidate poses interesting detection challenges. The data that will be scanned for such sources will be a set of three functions of LISA's twelve data streams obtained through time-delay interferometry, which is necessary to cancel the noise contributions from laser-frequency fluctuations and optical-bench motions to these data streams. We call these three functions pseudo-detectors. The sensitivity of any pseudo-detector to a given sky position is a function of LISA's orbital position. Moreover, at a given point in LISA's orbit, each pseudo-detector has a different sensitivity to the same sky position. In this work, we obtain the optimal statistic for detecting gravitational wave signals, such as from compact binaries early in their inspiral stage, in LISA data. We also present how the sensitivity of LISA, defined by this optimal statistic, varies as a function of sky position and LISA's orbital location. Finally, we show how a real-time search for inspiral signals can be implemented on the LISA data by constructing a bank of templates in the sky positions. "
              ]
            }
          }
        }
      ]
    },
    "similarDocuments" : {
      "documents" : [
        {
          "id" : 363250,
          "weight" : 0.9303515
        },
        {
          "id" : 14350,
          "weight" : 0.92936116
        },
        {
          "id" : 10587,
          "weight" : 0.9269646
        },
        {
          "id" : 75149,
          "weight" : 0.925859
        },
        {
          "id" : 462524,
          "weight" : 0.92563987
        }
      ]
    }
  }
}

Documents that are semantically-similar to the LIGO label but do not contain the LIGO word.

Document 29454 talks about "Next Generation Gravitational Wave Detectors", so it's very much related – LIGO, is also a gravitational wave detector. Document 40400 does not contain the LIGO word, but does contain the acronym spelled out. Further documents talk about various aspecs of gravitation wave detection, which is again close related to what LIGO does.

{
  "result" : {
    "similarDocumentsWithoutKeywordsContent" : {
      "documents" : [
        {
          "id" : 54520,
          "fields" : {
            "title" : {
              "values" : [
                "Searching for the full symphony of black hole binary mergers"
              ]
            },
            "abstract" : {
              "values" : [
                " Current searches for the gravitational-wave signature of compact binary mergers rely on matched-filtering data from interferometric observatories with sets of modelled gravitational waveforms. These searches currently use model waveforms that do not include the higher-order mode content of the gravitational-wave signal. Higher-order modes are important for many compact binary mergers and their omission reduces the sensitivity to such sources. In this work we explore the sensitivity loss incurred from omitting higher-order modes. We present a new method for searching for compact binary mergers using waveforms that include higher-order mode effects, and evaluate the sensitivity increase that using our new method would allow. We find that, when evaluating sensitivity at a constant rate-of-false alarm, and when including the fact that signal-consistency tests can reject some signals that include higher-order mode content, we observe a sensitivity increase of up to a factor of 2 in volume for high mass ratio, high total-mass systems. For systems with equal mass, or with total mass ∼ 50 M_(⊙), we see more modest sensitivity increases, < 10%, which indicates that the existing search is already performing well. Our new search method is also directly applicable in searches for generic compact binaries. "
              ]
            }
          }
        },
        {
          "id" : 111245,
          "fields" : {
            "title" : {
              "values" : [
                "Gravitational Wave Detector Sites"
              ]
            },
            "abstract" : {
              "values" : [
                " Locations and orientations of current and proposed laser-interferometric gravitational wave detectors are given in tabular form. "
              ]
            }
          }
        },
        {
          "id" : 120423,
          "fields" : {
            "title" : {
              "values" : [
                "Observation results by the TAMA300 detector on gravitational wave bursts from stellar-core collapses"
              ]
            },
            "abstract" : {
              "values" : [
                " We present data-analysis schemes and results of observations with the TAMA300 gravitational-wave detector, targeting burst signals from stellar-core collapse events. In analyses for burst gravitational waves, the detection and fake-reduction schemes are different from well-investigated ones for a chirp-wave analysis, because precise waveform templates are not available. We used an excess-power filter for the extraction of gravitational-wave candidates, and developed two methods for the reduction of fake events caused by non-stationary noises of the detector. These analysis schemes were applied to real data from the TAMA300 interferometric gravitational wave detector. As a result, fake events were reduced by a factor of about 1000 in the best cases. The resultant event candidates were interpreted from an astronomical viewpoint. We set an upper limit of 2.2x10³ events/sec on the burst gravitational-wave event rate in our Galaxy with a confidence level of 90%. This work sets a milestone and prospects on the search for burst gravitational waves, by establishing an analysis scheme for the observation data from an interferometric gravitational wave detector. "
              ]
            }
          }
        },
        {
          "id" : 133925,
          "fields" : {
            "title" : {
              "values" : [
                "Sidereal time analysis as a toll for the study of the space distribution of sources of gravitational waves"
              ]
            },
            "abstract" : {
              "values" : [
                " Gravitational wave (GW) detectors operating on a long time range can be used for the study of space distribution of sources of GW bursts or to put strong upper limits on the GW signal of a wide class of source candidates. For this purpose we propose here a sidereal time analysis to analyze the output signal of GW detectors. Using the characteristics of some existing detectors, we demonstrate the capability of the sidereal time analysis to give a clear signature of different localizations of GW sources: the Galactic Center, the Galactic Plane, the Supergalactic plane, the Great Attractor. On the contrary, a homogeneous 3D-distribution of GW sources gives a signal without features. In this paper we consider tensor gravitational waves with randomly oriented polarization. We consider GW detectors at fixed positions on the Earth, and a fixed orientation of the antenna. "
              ]
            }
          }
        },
        {
          "id" : 220685,
          "fields" : {
            "title" : {
              "values" : [
                "Optimal statistic for detecting gravitational wave signals from binary inspirals with LISA"
              ]
            },
            "abstract" : {
              "values" : [
                " A binary compact object early in its inspiral phase will be picked up by its nearly monochromatic gravitational radiation by LISA. But even this innocuous appearing candidate poses interesting detection challenges. The data that will be scanned for such sources will be a set of three functions of LISA's twelve data streams obtained through time-delay interferometry, which is necessary to cancel the noise contributions from laser-frequency fluctuations and optical-bench motions to these data streams. We call these three functions pseudo-detectors. The sensitivity of any pseudo-detector to a given sky position is a function of LISA's orbital position. Moreover, at a given point in LISA's orbit, each pseudo-detector has a different sensitivity to the same sky position. In this work, we obtain the optimal statistic for detecting gravitational wave signals, such as from compact binaries early in their inspiral stage, in LISA data. We also present how the sensitivity of LISA, defined by this optimal statistic, varies as a function of sky position and LISA's orbital location. Finally, we show how a real-time search for inspiral signals can be implemented on the LISA data by constructing a bank of templates in the sky positions. "
              ]
            }
          }
        }
      ]
    },
    "similarDocuments" : {
      "documents" : [
        {
          "id" : 363250,
          "weight" : 0.9303515
        },
        {
          "id" : 14350,
          "weight" : 0.92936116
        },
        {
          "id" : 10587,
          "weight" : 0.9269646
        },
        {
          "id" : 75149,
          "weight" : 0.925859
        },
        {
          "id" : 462524,
          "weight" : 0.92563987
        }
      ]
    }
  }
}

Documents that are semantically-similar to the LIGO label but do not contain the LIGO word.

documents:sample

The documents:​sample stage takes a random sample of the documents matching the query you provide. In many cases you can save time and resources by processing a random subset of a large document set instead of the whole set.

One natural use case for documents:​sample is computing the occurrence statistics for a list of labels. The following request computes the numbers of occurrences of the photon, electron and proton labels across papers published between 2006 and 2008.

{
  "name": "Frequency estimates for a list of labels",
  "components": {
    "scope": {
      "type": "query:string",
      "query": "created:[2006-01-01 TO 2008-12-31]"
    }
  },
  "stages": {
    "labels": {
      "type": "labels:direct",
      "labels": [
        {
          "label": "photon"
        },
        {
          "label": "electron"
        },
        {
          "label": "proton"
        }
      ]
    },
    "tfSample": {
      "type": "labels:scored",
      "scorer": {
        "type": "labelScorer:tf",
        "scope": {
          "type": "documents:sample",
          "samplingRatio": 0.1,
          "query": {
            "type": "query:reference",
            "use": "scope"
          }
        }
      },
      "labels": {
        "type": "labels:reference",
        "use": "labels"
      }
    },
    "tf": {
      "type": "labels:scored",
      "scorer": {
        "type": "labelScorer:tf",
        "scope": {
          "type": "documents:byQuery",
          "query": {
            "type": "query:reference",
            "use": "scope"
          },
          "limit": "unlimited"
        }
      },
      "labels": {
        "type": "labels:reference",
        "use": "labels"
      }
    }
  },
  "output": {
    "stages": [
      "labels",
      "tfSample",
      "tf"
    ]
  }
}

Computing the numbers of occurrences of the photon, electron and proton labels across papers published between 2006 and 2008.

The scope component is the query defining the subset of documents for which to compute the occurrence frequencies. The labels stage uses the labels:​direct stage to provide the list of labels for which to compute frequencies. Finally, the tf​Sample stage computes the estimated occurrence counts. Notice how we use the documents:​sample stage in the scope property to take a 10% sample of all the documents matched by the scope query. For comparison, the request also computes the same statistics using all documents in scope.

If you run the above request in JSON Sandbox, you should see a result similar to the following JSON.

{
  "result" : {
    "labels" : {
      "labels" : [
        {
          "label" : "photon"
        },
        {
          "label" : "electron"
        },
        {
          "label" : "proton"
        }
      ]
    },
    "tfSample" : {
      "labels" : [
        {
          "label" : "photon",
          "weight" : 1950.2227
        },
        {
          "label" : "electron",
          "weight" : 3220.3677
        },
        {
          "label" : "proton",
          "weight" : 730.0833
        }
      ]
    },
    "tf" : {
      "labels" : [
        {
          "label" : "photon",
          "weight" : 1777.0
        },
        {
          "label" : "electron",
          "weight" : 3176.0
        },
        {
          "label" : "proton",
          "weight" : 689.0
        }
      ]
    }
  },
  "status" : {
    "status" : "AVAILABLE",
    "elapsedMs" : 1628,
    "tasks" : [
      {
        "name" : "tf",
        "status" : "DONE",
        "progress" : 1.0,
        "startedAt" : 1705010276760,
        "elapsedMs" : 1470,
        "tasks" : [
          {
            "name" : "Computing term frequencies",
            "status" : "DONE",
            "progress" : 1.0,
            "startedAt" : 1705010276777,
            "elapsedMs" : 1454,
            "tasks" : [ ],
            "attributes" : [ ]
          }
        ],
        "attributes" : [ ]
      },
      {
        "name" : "tfSample",
        "status" : "DONE",
        "progress" : 1.0,
        "startedAt" : 1705010278231,
        "elapsedMs" : 157,
        "tasks" : [
          {
            "name" : "Computing term frequencies",
            "status" : "DONE",
            "progress" : 1.0,
            "startedAt" : 1705010278234,
            "elapsedMs" : 154,
            "tasks" : [ ],
            "attributes" : [ ]
          }
        ],
        "attributes" : [ ]
      }
    ]
  }
}

The numbers of occurrences of the photon, electron and proton labels across papers published between 2006 and 2008.

The tf​Sample section contains estimated numbers of occurrences (the weight property), while the tf section shows the accurate values computed using all documents in scope. Notice that the estimates can be either larger or smaller than the actual value, sometimes by a noticeable margin as it is the case with the proton label. Also, in most cases the estimates will contain fractional parts due to the scaling Lingo4G applies as part of the sampling process.

The response also contains the status section, which describes the specific tasks Lingo4G performed to process the request. The elapsed​Ms property shows the time Lingo4G took to complete the specific task. Notice that computing estimated frequencies was 8 times faster than computing the accurate result. For large scopes this may be a reduction of minutes to seconds.

Common use cases

The output of a document selection stage contains very limited information on its own: just a list of internal document identifiers and their weights. Practical requests will usually combine document selection with other stages to obtain a results meaningful to end users.

Source of documents for other stages

Typically, the documents:​* stages provide input for other types of stages, for example:

Counting documents

You can use the documents:​by​Query stage with its limit property set to 0 to count the numbers of documents based on different criteria.

The following request computes the numbers of documents containing the deep learning phrase in arXiv articles published in 2012, 2014, 2016 and 2018.

{
  "name": "Computing the number of papers containing the 'deep learing' phrase in 2012, 2014, 2016 and 2018.",
  "components": {
    "query": {
      "type": "query:string",
      "query": "\"deep learning\""
    }
  },
  "stages": {
    "2012": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:filter",
        "query": {
          "type": "query:reference",
          "use": "query"
        },
        "filter": {
          "type": "query:string",
          "query": "created:2012*"
        }
      },
      "limit": 0
    },
    "2014": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:filter",
        "query": {
          "type": "query:reference",
          "use": "query"
        },
        "filter": {
          "type": "query:string",
          "query": "created:2014*"
        }
      },
      "limit": 0
    },
    "2016": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:filter",
        "query": {
          "type": "query:reference",
          "use": "query"
        },
        "filter": {
          "type": "query:string",
          "query": "created:2016*"
        }
      },
      "limit": 0
    },
    "2018": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:filter",
        "query": {
          "type": "query:reference",
          "use": "query"
        },
        "filter": {
          "type": "query:string",
          "query": "created:2018*"
        }
      },
      "limit": 0
    }
  }
}

Counting the papers containing the deep learning phrase and published in 2012, 2014, 2016 and 2018.

The request defines the search phrase part of the query, which is common to all counting periods, in the components section, so that all stages can reuse it. In the stages section, the request defines four stages corresponding to the annual periods in which we want to count documents. Each such stage uses the query:​filter to intersect the phrase part of the query with the counting period. Each stage sets the limit property to zero, so that Lingo4G only counts the matches, which is usually faster than selecting the identifiers of the matching documents.

If you run the request in the JSON Sandbox app, you should get a response similar to the following JSON.

{
  "result" : {
    "2012" : {
      "matches" : {
        "value" : 0,
        "relation" : "EXACT"
      },
      "documents" : [ ]
    },
    "2014" : {
      "matches" : {
        "value" : 16,
        "relation" : "EXACT"
      },
      "documents" : [ ]
    },
    "2016" : {
      "matches" : {
        "value" : 155,
        "relation" : "EXACT"
      },
      "documents" : [ ]
    },
    "2018" : {
      "matches" : {
        "value" : 791,
        "relation" : "EXACT"
      },
      "documents" : [ ]
    }
  }
}

Numbers of papers containing the deep learning phrase and published in 2012, 2014, 2016 and 2018.

As expected, the number of papers containing the deep learning phrase grows exponentially after 2012.