This section explains the basic concepts behind Lingo3G clustering, such as input content requirements, characteristics of clusters, quality tuning options and the available tools and APIs.
This section is optional, feel free to skip to Installation and come back in case of any doubts.
Lingo3G organizes collections of documents into groups called clusters.
Clusters created by Lingo3G share the following characteristics:
Described by labels. To each document cluster Lingo3G attaches one or more labels that summarize the contents of the cluster. Labels consist of single words or phrases, such as Python or Hybrid Hierarchical Clustering.
Containing related documents. Lingo3G ensures that documents inside the cluster contain at least one of the cluster's labels.
If a document contains a different grammatical form of a cluster's label (for example, Tools and Tool), Lingo3G still puts the document in the cluster.
For phrase labels, such as Hybrid Hierarchical Clustering, each document in the cluster contains all words of the label, but not necessarily in order or close proximity. You can change this behavior if needed.
Overlapping. Lingo3G may put the same document into multiple clusters. For example, Lingo3G may put a document containing the k-means clustering research paper in the k-means and Research clusters.
Hierarchical. Lingo3G tries to divide large clusters into subclusters. For specific inputs, such as when the number of documents is small, Lingo3G may not be able to generate subclusters.
Some documents not clustered. In most cases, Lingo3G will not be able to put all documents into clusters, leaving the unclustered documents in the Other topics group. You can tune Lingo3G to reduce the number of unclustered documents, but bringing the size of Other topics to zero will not be possible in most cases.
Quality may need tuning. Lingo3G should produce good clusters for reasonably noise-free English inputs. In case of other languages or noisy content, you may need to tune the parameters or filtering dictionaries.
Documents for clustering
The quality and performance of clustering depends on the characteristics of documents you provide to Lingo3G for clustering. Here are some recommendations about the perfect input for Lingo3G:
Lingo3G works best clustering 100 – 1000 documents at once. If you pass fewer than 20 documents, Lingo3G may fail to produce meaningful clusters.
Minimize "noise" in the input documents. Truncated sentences or random alphanumerical strings may decrease the quality of cluster labels. If your documents consist of several fields and some fields seem "noisier" than the others, try excluding the noisy fields from clustering.
Documents with a maximum of 1000 words work best. For long documents, Lingo3G will may highly-overlapping less informative clusters. If you plan to cluster very long documents, consider splitting them on chapter or paragraph boundaries.
Clustering more than 10,000 documents will require tuning. Lingo3G performs in-memory clustering to ensure fast, real-time processing. The practical upper limit on the combined length of documents Lingo3G can cluster at once is about 10 MB. If you need to process large inputs, consider Lingo4G – a large-scale clustering engine that can cluster millions of documents.
Don't mix documents written in different languages. For best results, split your documents by language and cluster each language separately. There are quite a few open source language identification libraries, if you need one.
Submit only natural text for clustering. Lingo3G cannot use numeric or nominal fields for clustering, but these can be useful for display purposes in your application or Lingo3G Clustering Workbench.
Tools and APIs
Lingo3G comes with a suite of tools and APIs that you can use to quickly try clustering on your own data, tune clustering parameters and integrate Lingo3G with your software.
Document Clustering Server (DCS)
The Document Clustering Server (DCS) is an application you need to run to access the following elements of Lingo3G:
You will find the DCS in the
/dcs directory of Lingo3G
Lingo3G comes with two browser-based apps for end users: Search Results Clustering and Clustering Workbench. We created the apps to help you quickly try Lingo3G clustering on your data. However, if you find the apps useful in your daily text mining work, feel free to use them on a regular basis.
While we plan to add new features to the end-user apps, our priority is the development of the Lingo3G clustering algorithm core.
To access the apps, you need to run the Document Clustering Server.
Search Results Clustering
Lingo3G Search Results Clustering app helps novice users to cluster search results from public search engines. To keep the experience simple, the app only lets the user type the query and browse the clusters in textual and graphical form.
The quickest way to use the Search Results Clustering app is to try the on-line version. For other options, see Trying Lingo3G. See the Search Results Clustering app section for a brief description of the user interface.
If you have some basic IT experience, you can use Clustering Workbench to quickly try Lingo3G clustering on your data. Clustering Workbench may also be suitable as a text mining tool you use on a regular basis.
With Lingo3G Workbench you can accomplish the following:
Cluster results from the following sources:
- web search engine
- PubMed search engine
- Excel, CSV, JSON and XML files
- Apache Solr
Change the parameters of Lingo3G, including the structure of clusters, cluster label characteristics, label filtering dictionaries and more.
Export the clusters and search results as Excel, OpenOffice, CSV or JSON.
Visualize clusters using a pie chart and treemap, including export to JPEG or PNG.
Export the modified Lingo3G parameters as JSON ready for pasting into Lingo3G REST API requests.
The quickest way to use Clustering Workbench is to try the on-line version. For other options, see Trying Lingo3G. See the Clustering Workbench app section for a brief description of the user interface.
Lingo3G comes with two APIs: the REST API for integration with any programming language and the Java API for direct integration with Java software.
Lingo3G REST API exposes clustering as a HTTP service: you make a POST request with documents to cluster in JSON format and get back a JSON response with a list of clusters.
To access the REST API, you need to run the Document Clustering Server application.
If you develop a Java application, you can use Lingo3G as a library. See Java API basics for more details.
REST or Java API?
Use Lingo3G REST API in the following cases:
You write your software in a language other than Java.
In particular, you can call Lingo3G REST API directly from a browser-based app, just like Lingo3G end-user apps do.
You write your software in Java, but you'd like to use the decoupled architecture, which has the following advantages:
You can increase the number of Lingo3G REST API servers dynamically without taking down your application.
You can upgrade Lingo3G to a newer version without taking down your application.
If the JVM running Lingo3G DCS crashes, it will not take down your application.
Use Lingo3G Java API in the following cases:
You would like to avoid the JSON serialization and deserialization overhead. The overhead should be negligible for small collections of documents, but may become significant when clustering larger numbers of documents.
Apart from apps and APIs, Lingo3G comes with plugins that perform search results clustering inside two popular document indexing engines.
An alternative to running Lingo3G inside your indexing engine is having your application fetch search results from the engine and push them to Lingo3G REST API for clustering. In addition to the universal advantages of a decoupled architecture, you will save some time configuring and upgrading Lingo3G indexing engine plugins.
Carrot2 is an open source project developed by Carrot Search. Lingo3G and Carrot2 live in symbiosis: Lingo3G is a commercially licensed clustering algorithm built on top of the Carrot2-defined infrastructure. The source code of all Lingo3G tools and APIs is available in the Carrot2 GitHub repository under a permissive license. The only closed-source part is the Lingo3G clustering algorithm.
If you'd like to work only with open source clustering algorithms, we recommend Carrot2's Lingo, the predecessor of Lingo3G. It is fast, fairly accurate and works very well for small document sets.