Getting started

This chapter shows how to start experimenting with Lingo3G distribution.

This section assumes the distribution archive is unpacked and the required software is installed.

Clustering search results

This example is the simplest one to set up and get started. It demonstrates how Lingo3G HTTP/REST service can be used to cluster top search results (titles and snippets) for a user query.

Go to the DCS sub-folder in the distribution package and start the DCS locally (corresponding variants of all launch scripts exist for Windows systems, we omit them here for previty):

> cd dcs
> ./dcs --port 8080
12:04:27: DCS context initialized [algorithms: [Lingo3G], templates: [frontend-default]]
12:04:27: Service started on port 8080.
12:04:27: The following contexts are available:
  http://localhost:8080/          DCS Root
  http://localhost:8080/doc       Lingo3G Documentation
  http://localhost:8080/frontend  DCS Search frontend
  http://localhost:8080/javadoc   Lingo3G Java API Javadoc
  http://localhost:8080/service   DCS

Once the DCS starts (assuming port 8080 is available), frontend search demo is available at localhost:8080/frontend. The demo is set up to send queries to either ETools meta search engine or to PubMed search API service and clusters results appropriately.

Clustered 'Covid 19' search results from ETools.

Search results from ETools for input query "covid 19", clustered on the fly and visualized with Carrot Search Circles.

Clustered 'Covid 19' search results from PubMed.

Search results from PubMed for input query "covid 19", clustered on the fly and visualized with Carrot Search FoamTree.

Please note that both search services come with per-IP address limits that prevent too many requests from a single address.

An on-line version of this demo is available at search.carrotsearch.com/ for those extra-lazy.

Clustering data from JSON files

You can try Lingo3G clustering on your own documents by either using the Java API directly or by converting your inputs into a JSON file and using the DCS server to cluster this JSON file. In this section, we'll show the latter.

First off, start the DCS so that the clustering API is working:

> cd dcs
> ./dcs --port 8080
12:04:27: DCS context initialized [algorithms: [Lingo3G], templates: [frontend-default]]
12:04:27: Service started on port 8080.
12:04:27: The following contexts are available:
  http://localhost:8080/          DCS Root
  http://localhost:8080/doc       Lingo3G Documentation
  http://localhost:8080/frontend  DCS Search frontend
  http://localhost:8080/javadoc   Lingo3G Java API Javadoc
  http://localhost:8080/service   DCS

Next, compile the examples-dcs so that we can use it to post the content of our documents to DCS clustering API. You could do the same with any HTTP utility (such as wget or aria2c) but the request would have to be assembled into a full DCS API request; the example utility makes it slightly easier as request details can be omitted.

> cd examples-dcs
> ./gradlew assemble
...
BUILD SUCCESSFUL in 10s
4 actionable tasks: 4 executed

Make sure everything works (we'll be interested in cluster command):

> java -jar build/assembled/lingo3g-dcs-examples.jar
INFO  console: Usage: [command] [command options]

    cluster
    clusterWithParams
    configuration
    dataModels

Finally, assemble a JSON file with the content of the documents that should be clustered: an array of objects, each object containing pairs of field name-string value pairs. Something like the following (field names and their number does not matter, each object is considered a separate document):

[
  { "title": "...", "content": "..." },
  { "title": "...", "content": "..." },
  { "title": "...", "content": "..." }
]
    

The distribution package contains an example input file in examples-dcs/data/exampleData.json so we can just cluster that:

> java -jar build/assembled/lingo3g-dcs-examples.jar cluster data/exampleData.json
    

results in (showing top lines of the output):

> Connecting to the default DCS address at: http://localhost:8080/service/ (use --dcs parameter to change)
> No algorithm or language specified. Using the first available template: frontend-default
> Clusters returned for file data/exampleData.json:
["Knowledge Discovery", docs: 0, score: 1.00]
  ["Software", docs: 4, score: 1.00]
  ["Databases", docs: 3, score: 0.94]
  ["Application Areas", docs: 2, score: 0.89]
  ["Process", docs: 2, score: 0.87]
  ["Using", docs: 2, score: 0.86]
["Data-mining Software", docs: 0, score: 0.98]
  ["Model", docs: 3, score: 1.00]
  ["Data Mining and Knowledge Discovery", docs: 4, score: 0.99]
  ["Data Mining Applications", docs: 2, score: 0.95]
  ["Developing", docs: 2, score: 0.90]
  ["Field", docs: 2, score: 0.90]
  ["Including", docs: 2, score: 0.90]
["Applications", docs: 0, score: 0.94]
  ["Application Areas", docs: 3, score: 1.00]
  ["International", docs: 3, score: 0.99]
  ["Algorithms", docs: 2, score: 0.92]
  ["Data Mining Software", docs: 2, score: 0.91]
  ["Trends", docs: 2, score: 0.90]
  ["Upon", docs: 2, score: 0.90]
  ["Will", docs: 2, score: 0.90]
["Text Mining", docs: 8, score: 0.92]
["Data Mining Tools", docs: 9, score: 0.92]
["Techniques", docs: 10, score: 0.92]
["Predictive", docs: 9, score: 0.90]
[...]
    

The cluster utility is really simple. To play with custom algorithm attributes and more complex queries, you'll need to take advantage of the full REST API potential, described in this chapter.