REST API Basics
This article explains preparation steps required to run the document clustering server (DCS).
The document clustering server (DCS) is a HTTP server with HTTP/REST service layer running Lingo3G. The DCS uses Eclipse Jetty HTTP components. More specifically, the DCS contains the following:
- HTTP REST service endpoints (
/service
context) for document clustering, dynamic inspection of service components and an OpenAPI descriptor, - an example search frontend (
/frontend
context) for PubMed and a general-purpose meta search engine utilizing the service, - a few Java examples that make use of model classes and query the service,
- embedded Jetty HTTP server,
- this documentation.
In the examples below we will refer to the DCS and the REST service interchangeably, although the service application context can be separated and deployed on any other web application container, such as Apache Tomcat.
Installation
The DCS is shipped with the Lingo3G distribution under the dcs
folder.
The requirements for the service are consistent with
Lingo3G environment requirements.
A valid license needs to be installed prior to launching the service. Refer to license installation section for details.
The DCS will bind to port 8080
by default. An alternative port can be selected
by passing the --port
option to the launch script, for example:
> dcs --port 8081
13:35:47: DCS context initialized [algorithms: [Bisecting K-Means, Lingo, Lingo3G, STC], templates: [frontend-default]]
13:35:47: Service started on port 8081.
13:35:47: The following contexts are available:
http://localhost:8081/ DCS Root
http://localhost:8081/doc Lingo3G Documentation
http://localhost:8081/frontend DCS Search frontend
http://localhost:8081/javadoc Lingo3G Java API Javadoc
http://localhost:8081/service DCS
Once started, the service is ready to accept requests, by default at the http://localhost:8080/service/ endpoint.
API workflow
The document clustering service is essentially a single, stateless endpoint accepting JSON requests and returning JSON responses.
A full clustering request is a JSON file containing the following elements:
- clustering algorithm identifier,
- language (language components) identifier,
- text fields of documents to be clustered.
Such a request file must be sent using HTTP POST method to the /cluster
service endpoint, which returns either a successful response containing clusters
or an error response with some additional
diagnostic information.
Note that the request contains elements that may require some a priori knowledge, such as
the clustering algorithm's identifier (Lingo3G
) and the documents
language (English
or any other supported by the algorithm). You can assume
that certain components, such as the Lingo3G
algorithm or the English
language, always exist in your DCS distribution. Alternatively, you can
enumerate the available components dynamically using the
/list
service endpoint.
Clustering
This section will go through a very basic example of a full request-response cycle.
Let's start by assembling the request JSON. We need to know the algorithm to be used
for clustering (that the underlying Carrot2 framework supports)
and the language in which our documents are written, so that an appropriate
preprocessing is applied to input text before clustering.
In this example we will use hardcoded values for the Lingo3G
algorithm
and the English
language.
Documents for clustering are composed of one or more fields, where each field is a pair consisting of an identifier (name of the field) and value (a string or an array of strings). You should limit input documents to just those fields that should be clustered. In this example we will have three documents, each with one field:
[
{ "field": "foo bar" },
{ "field": "bar" },
{ "field": "baz" }
]
We have everything we need to put together the entire request body:
{
"language": "English",
"algorithm": "Lingo3G",
"parameters": {
"clusters": {
"maxClusterSize": 1
}
},
"documents": [
{ "field": "foo bar" },
{ "field": "bar" },
{ "field": "baz" }
]
}
You probably noticed that there is one element in the above request we have not discussed yet:
the parameters
block. This block is used to alter algorithm
default parameters.
Our document list is very tiny and we force the algorithm to produce
at least one group, so that we can see what it looks like in the response.
Assuming the DCS is running in the background, the clustering service's
default endpoint is at http://localhost:8080/service/cluster
.
We are ready to send the above JSON for clustering using a command-line tool, such as
curl
:
curl -X POST --header "Content-Type: text/json" --data-binary @cluster-request.json "http://localhost:8080/service/cluster?indent"
Note the MIME type for JSON must be properly set (Content-Type: text/json
). The
response received from the service should be something like this:
{
"clusters" : [
{
"labels" : [
"Bar"
],
"documents" : [
0,
1
],
"clusters" : [ ],
"score" : 1.0
}
]
}
The response is a potentially recursive hierarchy of document clusters, where each cluster has the following properties:
- labels
- Cluster description label or labels.
- documents
- An array of references to documents contained in the cluster. Each reference is a 0-based index of the document within the clustering request.
- clusters
- An array of subclusters of this cluster (if the algorithm supports hierarchical clustering).
- score
- The cluster's quality score. The score is not normalized in any way but represents relative quality of each cluster within this request.
In the response above we see a single cluster of documents 0
and 1
,
labeled Bar.
Request and response models
While it is perfectly fine to assemble the request JSON by hand, the DCS distribution comes with data model Java classes that can be used to build request and parse responses. The example shown in the previous section can be expressed in Java code by the following snippet:
Lingo3GClusteringAlgorithm algorithm = new Lingo3GClusteringAlgorithm();
algorithm.clusters.maxClusterSize.set(1.);
ClusterRequest request = new ClusterRequest();
request.algorithm = Lingo3GClusteringAlgorithm.NAME;
request.language = "English";
request.parameters = Attrs.extract(algorithm);
request.documents =
Stream.of("foo bar", "bar", "baz")
.map(
value -> {
ClusterRequest.Document doc = new ClusterRequest.Document();
doc.setField("field", value);
return doc;
})
.collect(Collectors.toList());
The request can be then serialized into JSON using the Jackson library. The DCS Java examples contain a few command-line applications that make extensive use of these model classes, please refer to them for details.
Alternatively, the OpenAPI descriptor can be used to generate service binding code for Java and many other languages.
Service configuration
The second service endpoint the DCS exposes is called /list
. When invoked
with a single HTTP GET (without any parameters), the service returns information
on the available algorithms and languages. An example response can
look like this:
{
"algorithms" : {
"Lingo3G" : [
"Dutch",
"English"
]
},
"templates" : [
"frontend-default"
]
}
Note that each algorithm has an associated list of language codes it supports.
The templates
block enumerates preconfigured request templates .
OpenAPI service descriptor
The DCS comes with an OpenAPI service specification descriptor, by default accessible at: http://localhost:8080/service/openapi/dcs.yaml This descriptor contains documentation and working examples for all service endpoints and parameters.
The DCS ships with three OpenAPI specification browsers:

RapiDoc's representation of DCS's OpenAPI descriptor.