Glossary
This chapter provides basic definitions of the terms used throughout Lingo4G documentation.
- Analysis
-
The process in which Lingo4G performs the text mining tasks you request.
More information:
See also:
- Analysis request
-
A JSON-based specification of the text mining tasks for Lingo4G to execute.
More information:
- Document
-
A basic unit of content processed by Lingo4G, such as a scientific paper, business or legal document, blog post or a social media message. Each document can consist of one or more fields, which correspond to the natural parts of the document such as the title, summary, publication date, user-generated tags.
- Embeddings
-
A representation of labels and documents based on multidimensional vectors. Embedding vectors aim to capture semantic relationships between labels and documents, so that Lingo4G can, for example, cluster similar documents even if they don't share common labels.
More information:
See also:
- Field
-
A natural part of a document. Typically, each document consists of multiple fields, such as title, abstract, body, creation date, human-assigned keywords.
Lingo4G distinguishes two types of fields:
- Indexed field
- A representation of a document field in Lingo4G index. You can use indexed fields in document selection queries.
- Feature field
- A representation of document's labels in Lingo4G index. Lingo4G uses feature fields during analysis to, for example, collect labels that describe a set of documents.
- Index
-
Stores all the data Lingo4G needs to perform analysis: documents, features and additional data structures, such as multidimensional embeddings.
A single project contains exactly one index.
- Indexing
-
The process in which Lingo4G imports your document collection into the internal storage and creates data structures required to analyze the documents.
More information:
See also:
- Incremental indexing
-
Adding or updating documents in an existing Lingo4G index.
More information:
See also:
- kNN index
-
k-Nearest-Neighbors index: a data structure Lingo4G uses to quickly find the list of multidimensional embedding vectors that are most similar to a specific vector. Lingo4G builds separate dedicated kNN indices for label and document embedding vectors during embedding learning. Lingo4G stores kNN indices on disk as an integral part of its index.
See also:
- Label
-
A textual representation of a feature characterizing a document. In typical configurations, Lingo4G chooses labels from the set of non-trivial words and phrases contained in documents.
- Project descriptor
-
Defines all the necessary information to index and analyze one collection of documents. Project descriptor is a JSON file that contains the definition of fields, document source, indexing configuration and defaults for running analyses.
More information:
- Re-indexing
-
The process of bringing the document features up to date after adding, updating or deleting documents from an existing Lingo4G index.
More information:
- Stop label
-
A label that carries no significant meaning in the context of the project's collection of documents. Stop labels usually include common function words, such as the or for, but also domain-specific stop labels from processing. For example, in the context of medical articles these could be the studies suggest or control group prhases.
Lingo4G tries to detect stop labels automatically during indexing. Later on, during analysis, Lingo4G by default excludes stop labels from processing.