Architecture

To efficiently handle millions of documents and gigabytes of text, Lingo4G splits processing into two phases: indexing and analysis.

During indexing, Lingo4G imports documents from an external source into a local persistent index and then extracts and stores text features that best describe the imported documents.

Once Lingo4G completes indexing, you can request Lingo4G to apply various text processing operations to the indexed documents. We call this process analysis. Analysis operations range from simple query-based document searches, through clustering and 2D mapping to finding duplicate documents and time series analysis. You can change analysis parameters, such as the subset of documents to analyze or clustering similarity thresholds without indexing the documents again.

Project Index Document source Lingo4G Indexing Analysis Source documents Internal persistent representation of source documents Analyst Analysis scope Themes, clusters

Lingo4G architecture, the two-phase processing paradigm. During the indexing phase, shown in blue, Lingo4G digests the documents returned by the document source and creates an internal persistent representation of those documents. Once indexing is complete, Lingo4G can accept analysis requests, shown in red, that apply various text processing operations to the indexed documents.

The two-phase operation of Lingo4G is analogous to the workflow of enterprise search platforms, such as Apache Solr or Elasticsearch. The documents first need to be indexed and only then can the whole documents be searched and retrieved.

With the two-phase processing model, Lingo4G is particularly suited for processing fairly "static" collections of documents where Lingo4G can access all documents for indexing. Therefore, the natural use case for Lingo4G is analyzing large volumes of human-readable text, such as scientific papers, business or legal documents, news articles, blog or social media posts. While Lingo4G offers an incremental indexing workflow where you can add, update or delete documents from an existing index, Lingo4G is not suitable for processing continuous streams of new content.

The entry point to all Lingo4G capabilities is the Java-based l4g command-line application. You can use it to initiate document indexing and start Lingo4G REST API server, which accepts analysis requests.