Incremental indexing

If you want to add or update documents in an existing Lingo4G index, you don't have to perform the time-consuming full indexing. Instead, you can perform incremental indexing, which is fast and efficient for small updates.

Overview

You can add, update and remove documents from an existing Lingo4G index. Your project must meet two requirements for incremental indexing to work:

  • the document source must support incremental indexing,

  • the project descriptor's fields section must declare exactly one field as a unique document identifier by setting the field's id attribute to true.

An incremental document source is able to determine which documents have been changed or added since the last indexing. Only documents that have been updated or added are part of an incremental indexing cycle.

You can remove documents from the index using the delete command. That command can take an arbitrary query: any documents selected by that query will be marked as deleted and eventually removed from the index.

The built-in json-records document source implements incremental processing out of the box. Any incremental indexing run will only look at json files that have file modification timestamps after the timestamp of the previous indexing run.

For example, the initial indexing command may look as follows:

l4g index -p datasets/dataset-json-records --incremental

Since there are no existing documents in the index, this run would be equivalent to a full indexing cycle. An additional bookmark file stored within the index keeps track of the most recently modified file's timestamp. A subsequent invocation of the same command will result in no changes to the index:

l4g index -p datasets/dataset-json-records --incremental
...
> Processed 0 documents, the index contains 251 documents.
> Done. Total time: 163ms.

If we modify the timestamp on any of the input files, documents from that file will be updated in the index:

touch datasets/dataset-json-records/records-00.json
l4g index -p datasets/dataset-json-records --incremental
...
> Incremental indexing based on the features created on [...]
1/4 Opening source                                                    done      4ms
2/4 Indexing documents                                                done    267ms
3/4 Index flushing                                                    done    451ms
4/4 Updating features                                                 done    469ms
> Processed 57 documents, the index contains 251 documents.
> Done. Total time: 1s 275ms.

The reported 57 processed documents are only updates. The total number of documents has not changed because all these documents had identifiers that were already present in the index.

Feature drift

Lingo4G does not perform feature discovery automatically after an incremental indexing run. This behavior is intentional. Feature discovery and embedding computation are the most time-consuming part of the indexing process — adding a few documents to a large index would be time-prohibitive if it entailed a full reconstruction of all features. Instead, Lingo4G looks at the set of features computed during the last "full" indexing run and applies these features to any new (or updated) documents. The information printed in the console output states exactly which set of features is used, for example:

> Incremental indexing based on the features created on [...timestamp...]

When many incremental updates stack up, the information used to compute the features (in the last full indexing run) may no longer reflect the set of updated documents — this is called feature drift. The only way to bring the features up to date is to periodically refresh them using the reindex command. The benefit of using reindex compared to a full indexing cycle (using l4g index --force ...) is that the reindex command only recomputes the features of documents already in the index; any search indexes and other auxiliary data structures remain the same and are reused.

l4g reindex -p datasets/dataset-json-records
...
17/17 Stop label extraction                                             done    185ms
> Done. Total time: 1s 955ms.
REST server and incremental updates

Both incremental indexing and feature computation can run in parallel with a working HTTP REST server (or command-line analyses). The HTTP server keeps using the index state it was started with until you tell it to advance to any newer updates using the index reload API v2 request.