Indexing

During indexing, Lingo4G imports your document collection into the internal storage and creates data structures required to analyze the documents.

To index your documents, Lingo4G performs a number of steps described in the following sections.

Importing documents, creation of inverted indexes

In the initial step of indexing, Lingo4G requests the document source defined in your project descriptor to provide documents for indexing. Lingo4G copies all the documents into an internal storage (a custom Lucene index). The internal storage serves, amongst others, as an inverted index allowing slicing and dicing your documents using text queries.

Feature discovery

In this step, Lingo4G tags each document with labels, that is textual representations of features characterizing each document. Labels based on frequently-occurring, non-trivial words and phrases serve very well for typical analytical tasks. Using the features section of the project descriptor, you can configure the details of the feature extraction process or add a different feature extractor, such as the dictionary extractor.

Lingo4G performs feature discovery automatically during indexing. You can also request Lingo4G to recompute features for an existing index using the reindex command. This is useful when you change the parameters or dictionaries involved in feature extraction.

Stop label extraction

After feature discovery is complete, Lingo4G attempts to identify stop labels specific to your collection of documents. Stop labels are those labels that do not differentiate documents very well. For example, when indexing e-mails, the stop labels could include phrases like "kind regards" or "attachment". In medical articles, the set of meaningless labels would likely include words and phrases like "indicate", "studies suggest" or "control group".

Stop label extraction has access to all documents and all label statistics after feature extraction, so it can make more intelligent choices compared to feature extractors alone.

To improve the quality of analysis results, Lingo4G ignores stop labels during analysis.

Learning embeddings

The last step of indexing is to learn multidimensional embedding vectors, which help to capture semantic relationships between labels and documents.

Embedding vectors can improve the traditional keyword-based analysis in a number of ways:

  • Unsupervised discovery of synonyms. Embedding-based search for similar words and phrases may uncover new relationships that domain experts may not be aware of.

  • Improved clustering of documents. Based on multidimensional embeddings, Lingo4G can connect documents that don't share common words or phrases, but do share similar concepts. Embedding-based clustering and 2d-mapping of documents produces better-defined and tighter groupings of documents, especially when clustering tens or hundreds of thousands of documents.

Learning embeddings is optional because it can take longer than all other indexing steps combined, and Lingo4G can perform many kinds of analyses without embedding vectors. You can have Lingo4G learn embeddings at a later time using the learn-embeddings command.