Architecture
To efficiently handle millions of documents and gigabytes of text, Lingo4G splits processing into two phases: indexing and analysis.
During indexing, Lingo4G imports documents from an external source into a local persistent index and then extracts and stores text features that best describe the imported documents.
Once Lingo4G completes indexing, you can request Lingo4G to apply various text processing operations to the indexed documents. We call this process analysis. Analysis operations range from simple query-based document searches, through clustering and 2D mapping to finding duplicate documents and time series analysis. You can change analysis parameters, such as the subset of documents to analyze or clustering similarity thresholds without indexing the documents again.
The two-phase operation of Lingo4G is analogous to the workflow of enterprise search platforms, such as Apache Solr or Elasticsearch. The documents first need to be indexed and only then can the whole documents be searched and retrieved.
With the two-phase processing model, Lingo4G is particularly suited for processing fairly "static" collections of documents where Lingo4G can access all documents for indexing. Therefore, the natural use case for Lingo4G is analyzing large volumes of human-readable text, such as scientific papers, business or legal documents, news articles, blog or social media posts. While Lingo4G offers an incremental indexing workflow where you can add, update or delete documents from an existing index, Lingo4G is not suitable for processing continuous streams of new content.
The entry point to all Lingo4G capabilities is the Java-based
l4g
command-line application. You can use it to
initiate document indexing and
start Lingo4G REST API server, which
accepts analysis requests.