Stages and components
This reference lists the stage and component types you can use to build Lingo4G analysis requests.
You can use the following types of stages in your analysis requests. Each stage type groups stages returning the same kind of result, such as a list of documents, labels or a matrix.
-
documents:​*
-
Returns a list of documents. Specific stages return documents based on search queries, nearest-neighbor searches in vector space and many other criteria.
Document list is usually an input to other types of stages, such as document content retrieval, label collection or similarity matrix computation, and ultimately clustering and 2D embedding.
-
labels:​*
-
Returns a list of labels. Specific stages return labels collected from documents, arbitrary pieces of text or nearest-neighbor searches in the vector space.
-
matrix:​*
-
Returns a matrix of similarities between documents or labels. Matrices are typically inputs for clustering and 2D embedding stages.
-
embedding2d:​*
-
Lays out entities, such as documents or labels, on a 2D map, putting similar entities close to each other. Requires a similarity matrix on input.
-
clusters:​*
-
Organizes entities, such as documents or labels, into clusters based on a matrix of similarities between those entities.
-
label​Clusters:​*
-
For the document clusters you provide, produces aligned clusters of labels that describe each document cluster.
-
document​Content
-
Retrieves contents of the documents you provide, optionally highlighting the occurrences of a list of search queries.
-
document​Labels
-
For each document on the document list you provide, retrieves the document's labels.
-
document​Pairs:​*
-
Returns a stream of paired documents, such as pairs of near-duplicates or otherwise very similar documents.
-
document​Overlap
-
Returns information about duplicate text regions in the document pairs you provide.
-
vector:​*
-
Retrieves multidimensional vectors corresponding to the documents or labels you provide. You can use those vectors to search for other semantically-similar documents or labels.
-
vectors:​*
-
Defines the multidimensional embedding to use for building vector-based similarity matrices.
-
values:​*
-
Returns a list of values of an arbitrary type. For example, for the document list you provide, retrieves an aligned list of the values of one document field.
-
stats:​*
-
Returns statistics about a specific analysis result, such as a document list.
-
debug:​*
-
Various utility stages useful for debugging purposes.
The following component types are available for use in analysis request JSONs. Components do not return any results, but rather provide a piece of configuration for a stage or another component.
-
query:​*
-
Defines a query that selects a list of documents. There are specific query components for string-based Lucene queries, queries based on the labels you provide or composite queries.
-
label​Collector:​*
-
Collects labels from a single document. Label collector plays a crucial role when fetching labels from documents or computing similarities between documents.
-
label​Aggregator:​*
-
Aggregates labels collected from multiple individual documents into a single list of labels.
-
label​Filter:​*
-
Accepts or rejects an individual label. You can use label filters to shape the label lists returned by label collectors. Specific filters take into account the number of characters or words in a label, use a dictionary or a list of automatically discovered meaningless labels.
-
label​List​Filter:​*
-
Accepts or rejects labels based their relation to other labels on the label list. You can use label list filters to shape the label lists returned by label collectors.
-
label​Scorer:​*
-
Computes a numerical score for a label, such as document or term frequency.
-
feature​Fields:​*
-
Specifies a list of fields from which to collect labels.
-
content​Fields:​*
-
Specifies a list of fields whose content to retrieve for display purposes.
-
fields:​*
-
Specifies a general list of fields, allowing both feature and label fields.
-
feature​Source:​*
-
Converts fields into a stream of comparable and hashable objects. Feature sources play an important role in duplicate document and overlapping region detection.
-
pairwise​Similarity:​*
-
Computes the degree of overlap between a pair of documents for the purposes of duplicate document detection.
-
matrix​Rows:​*
-
Computes individual rows of a similarity matrix. Matrix row sources allow certain stages, such as the document contrast scorer, to perform calculations without materializing a large similarity matrix in the main memory.
-
dictionary:​*
-
Ad-hoc dictionaries used in label filtering, for example.
-
label​Count:​*
-
Label count limit used by label retrieval from documents , for example.
-
query​Parser:​*
-
Request-time query parser definitions or references to the project descriptor's query parsers.
-
document​Scorer:​*
-
Computes scores for documents based on different criteria.
-
query​Builder:​*
-
Builds search queries based on dynamically-changing inputs, such as values of document content fields.
-
embedding​Service:​*
-
An external service capable of returning embedding vectors for a snippet of text.