If you want to add or update documents in an existing Lingo4G index, you don't have to perform the time-consuming full indexing. Instead, you can perform incremental indexing, which is fast and efficient for small updates.
You can add, update and remove documents from an existing Lingo4G index. Your project must meet two requirements for incremental indexing to work:
the document source must support incremental indexing,
An incremental document source is able to determine which documents have been changed or added since the last indexing. Only documents that have been updated or added are part of an incremental indexing cycle.
You can remove documents from the index using the delete command. That command can take an arbitrary query: any documents selected by that query will be marked as deleted and eventually removed from the index.
document source implements incremental processing out of the box. Any incremental indexing run will only look at
json files that have file modification timestamps after the timestamp of the previous indexing run.
For example, the initial indexing command may look as follows:
l4g index -p datasets/dataset-json-records --incremental
Since there are no existing documents in the index, this run would be equivalent to a full indexing cycle. An additional bookmark file stored within the index keeps track of the most recently modified file's timestamp. A subsequent invocation of the same command will result in no changes to the index:
l4g index -p datasets/dataset-json-records --incremental ... Processed 0 documents, the index contains 251 documents. Done. Total time: 163ms.
If we modify the timestamp on any of the input files, documents from that file will be updated in the index:
touch datasets/dataset-json-records/records-00.json l4g index -p datasets/dataset-json-records --incremental ... Incremental indexing based on the features created on [...] 1/4 Opening source done 4ms 2/4 Indexing documents done 267ms 3/4 Index flushing done 451ms 4/4 Updating features done 469ms Processed 57 documents, the index contains 251 documents. Done. Total time: 1s 275ms.
The reported 57 processed documents are only updates. The total number of documents has not changed because all these documents had identifiers that were already present in the index.
Lingo4G does not perform feature discovery automatically after an incremental indexing run. This behavior is intentional. Feature discovery and embedding computation are the most time-consuming part of the indexing process — adding a few documents to a large index would be time-prohibitive if it entailed a full reconstruction of all features. Instead, Lingo4G looks at the set of features computed during the last "full" indexing run and applies these features to any new (or updated) documents. The information printed in the console output states exactly which set of features is used, for example:
Incremental indexing based on the features created on [...timestamp...]
When many incremental updates stack up, the information used to compute the features (in the last full indexing
run) may no longer reflect the set of updated documents — this is called
feature drift. The only way to bring the features up to date is to periodically refresh them using the
reindex command. The benefit of using
reindex compared to a
full indexing cycle (using
l4g index --force ...) is that the
command only recomputes the features of documents already in the index; any search indexes and other auxiliary
data structures remain the same and are reused.
l4g reindex -p datasets/dataset-json-records ... 17/17 Stop label extraction done 185ms Done. Total time: 1s 955ms.
Both incremental indexing and feature computation can run in parallel with a working HTTP REST server (or command-line analyses). The HTTP server keeps using the index state it was started with until you tell it to advance to any newer updates using the index reload API v2 request.