Requirements

Lingo4G can run on almost any hardware. In this chapter we discuss practical considerations that may affect performance.

For data sets up to a few dozen gigabytes of text, any modern computer will be sufficient, even a laptop. Larger data sets will benefit greatly from larger RAM and SSD-based storage.

Storage technology

The type of storage technology and its size are the key factors that influence Lingo4G performance. Lingo4G is designed to take full advantage of CPU parallelism: all CPU cores can write and read data to the index at the same time.

We highly recommend Solid-state drives (SSD) for storing Lingo4G index and temporary files, especially if the files are too large to fit the operating system's disk cache. With SSD storage, Lingo4G will be able to effectively use multiple CPU cores for processing and thus significantly decrease the processing time.

We do not recommend spinning drives due to severe degradation of performance when multiple threads access the index.

Lingo4G does not support network drives. Lingo4G projects and their work directories should not be stored on network shares due to the limitations of underlying technology (Lucene file locks do not work on network shares).

Storage space

Lingo4G index storage requirements are typically 2x–3x the total size (in bytes) of the text in your collection. The example data sets chapter shows empirical sizes of Lingo4G index for the data sets included in the distribution.

In addition to the space occupied by the index itself, Lingo4G requires additional disk space for temporary files while indexing. Lingo4G deletes these temporary files after indexing is complete.

CPU and memory

CPU: more than 4 hardware threads. Lingo4G can perform processing in parallel on multiple CPU cores, which can greatly decrease the latency. Performance will vary, depending on the size of the collection and type of analysis, but in general the more CPU cores you make available to Lingo4G, the better. Please note that high-end CPUs with dozens of CPU cores are very likely to saturate other parts of the system, such as memory or the I/O.

Also note that Lingo4G has a built-in dynamic mechanisms of adjusting the number of threads for optimal performance, so CPU usage during indexing or analyses may fluctuate and is not an indicator of underused resources.

RAM: the more, the better. During document analysis, Lingo4G frequently reaches to its persistent index data store created during indexing. For the highest multi-threaded processing performance, the amount of RAM available to the operating system should ideally be large enough for the OS to cache most of Lingo4G index files, so that the number of disk accesses is minimized. Note that system memory pool is distinctively different from Java heap size, discussed below.

JVM heap size: the default 4 GB should be enough in most scenarios. The default JVM heap size should be enough to perform indexing regardless of the size of the input data. It should also be enough for typical document analysis scenarios.

If you plan to analyze very large subsets of the index or issue multiple concurrent analyses, consider increasing the JVM heap size. Note that needlessly increasing the JVM heap may have an adverse effect on performance as it may decrease the amount of memory that would be otherwise available for disk caches.

If you run Lingo4G on massively multicore machines (32 cores and more), consider increasing the JVM heap size beyond the default 4 GB heap during indexing to give more room to each indexing thread. This is not a requirement, however.

Java Virtual Machine

Lingo4G requires 64-bit Java 17 or later. Other JVM settings like the garbage collector settings play a minor role in overall performance compared to disk speed and memory availability.