Generic questions

This chapter answers most commonly asked generic questions concerning Lingo3G.

What is Carrot2 and how does it relate to Lingo3G?

Carrot2 is an open source project developed by Carrot Search (yes, us!). Lingo3G and Carrot2 live in symbiosis: Lingo3G is a commercial algorithm that lives on top of Carrot2-provided infrastructure and provides capabilities beyond those offered by the open source project.

If you'd like to only work with open-source clustering algorithms then we can recommend Carrot2's Lingo algorithm - the predecessor of Lingo3G. It is fast, accurate and free (and works very well too).

Can Lingo3G crawl my website?

No. Lingo3G is not a document indexing service or a search engine. You need to feed documents (text) to it. You can integrate Lingo3G with other software (search engine, document retrieval system) using its Java or REST APIs but it requires programming skills.

How does Lingo3G clustering scale?

There are no strict limits to how many documents Lingo3G can process. The algorithm has been designed to perform clustering on the fly and in memory (for speed reasons) so available memory (Java heap) constitutes one constraint. The other constraint is processing time which will grow with the overall size of the input (the combined length of all documents) as well as with the number of unique documents and lexicon size (number of unique terms). Whether any of these are relevant is very application-specific (some applications require sub-second results, others can live with offline processing).

Can I use Lingo3G with an existing taxonomy?

No. Assigning predefined categories to documents is a different algorithmic problem (called text classification) and Lingo3G was not designed to solve it.

What happened to Lingo3G Workbench?

The Lingo3G Workbench tool has been removed from Lingo3G distribution package. There are two reasons for this:

  1. When we decided to simplify and clean up the API of Lingo3G, we dropped the concept of document sources. The API is now purely request-response based and the user is in full charge of where the text of input documents comes from. The Workbench relied on document sources as inputs and we couldn't find an easy way to combine these two approaches.
  2. There was no easy upgrade path for the underlying graphical user interface (which was based on Eclipse's RCP project). The technology has moved forward: many operating systems warn or flatly refuse to run desktop Java applications.