Carrot Search Lingo4G

clustering engine reference, version 1.2.0

Carrot Search Lingo4G is a next-generation text clustering engine capable of processing tens of gigabytes of text and millions of documents. Lingo4G can both process the whole collection as well as an arbitrary subset of the collection in near-real-time. This makes Lingo4G particularly suitable as a component of text document analysis suites.

Quick start

This section is a 7-minute overview of Lingo4G features and a tutorial on how to apply Lingo4G to the questions and answers posted at superuser.com, a QA site for computer enthusiasts. For a more detailed description of Lingo4G architecture and usage, feel free to skip directly to the Introduction or Basic usage chapter.

To process the StackExchange questions with Lingo4G:

  1. Prerequisites. Make sure Java runtime environment version 8 or later is available in your system.

  2. Installation.

    • Download Lingo4G distribution archive and unpack it to some local directory. We will refer to that directory as Lingo4G home directory or L4G_HOME.
    • Copy your license.zip or license.xml file to L4G_HOME/conf.
    • Make sure there is at least 2.5 GB of free space on the drive.
  3. Indexing. Open command console, change current directory to Lingo4G home directory and run:

    l4g index -p datasets/dataset-stackexchange

    Lingo4G will download superuser.com questions from the Internet (about 187 MB) and then prepare them for clustering. The whole process may take a few minutes, depending on the speed of your machine and Internet connection. When indexing completes successfully, you should see a message similar to:

    ...
    > Lingo4G 1.0.0 (Java HotSpot(TM) 64-Bit Server VM, 25.77-b03)
    > Indexing with 1 feature contributor.
     1/21 Execution graph                                        done     69ms
     2/21 Working index setup                                    done     55ms
     3/21 Opening source                                         done       1m
     4/21 Term accounting                                        done      15s
     5/21 Flushing counters                                      done    154ms
     6/21 Term counter RAM release                               done    342ms
     7/21 Term filtering                                         done    248ms
     8/21 Phrase accounting                                      done      17s
     9/21 Flushing counters                                      done    413ms
    10/21 Phrase counter RAM release                             done       1s
    11/21 Candidate label FST                                    done       1s
    12/21 Surface accounting                                     done      30s
    13/21 Flushing counters                                      done    120ms
    14/21 Surface counter RAM release                            done    392ms
    15/21 Surface form selection                                 done    420ms
    16/21 Final label FST                                        done    293ms
    17/21 Lucene indexing                                        done      56s
    18/21 Closing index                                          done      11s
    19/21 Optimizing Index                                       done     52ms
    20/21 Term statistics                                        done       1s
    21/21 Stop label extraction                                  done      15s
    > Indexing using 7 threads (4 data passes in 1m 59s,
      ~4.37Mi chars/sec., 2.03ki docs/ sec.).
    > Done. Total time: 3m 46s.
  4. Starting Lingo4G REST API server. In the same console window, run:

    l4g server -p datasets/dataset-stackexchange

    When the REST API starts up successfully, you should see messages similar to:

    > Lingo4G 1.0.0 (Java HotSpot(TM) 64-Bit Server VM, 25.77-b03)
    > Starting Lingo4G server...
    > Lingo4G REST API endpoint at /api/v1, attached to project: [...]\dataset-stackexchange
    > Web server endpoint at /, serving content of: [...]\web
    > Enabling development mode for web server.
    > Lingo4G server started on port 8080.
  5. Exploring the data with Lingo4G Explorer. Open http://localhost:8080/apps/explorer in a modern browser (Chrome, Firefox, Internet Explorer 11+). You can use Lingo4G Explorer to analyze the whole collection or a subset of it. See the video at the beginning of this section for typical interactions with Lingo4G Explorer.

  6. Exploring other data sets. To index and explore other StackExchange site, pass the identifier of the site using the -Dstackexchange.site=<site> option, for example:

    l4g index  -p datasets/dataset-stackexchange -Dstackexchange.site=scifi
    l4g server -p datasets/dataset-stackexchange -Dstackexchange.site=scifi

    The Example data sets section lists other public data sets you can try.

  7. Exploring your own data. The quickest way to index and explore your own data is to modify the example JSON data set project descriptor available in the datasets/dataset-json directory.

  8. Next steps. See the Introduction section for some more information about the architecture and conceptual design of Lingo4G. For more information about the Explorer application, see the Lingo4G Explorer section.

Introduction

Carrot Search Lingo4G is a next-generation text clustering engine capable of processing tens of gigabytes of text and millions of documents.

Lingo4G features include:

  • Topic discovery. Lingo4G can extract and meaningfully describe the topics covered in a set of documents. Related topics can be organized into themes. Lingo4G can retrieve the specific documents matching each identified topic and theme.
  • Document clustering. Lingo4G can organize the provided set of documents into non-overlapping groups.
  • Near real-time processing. On modern hardware Lingo4G can process large sets of documents in a matter of seconds or minutes.
  • Browser-based tuning application. To enable rapid experimentation and tuning of processing results, Lingo4G comes with a browser-based application called Lingo4G Explorer.
  • REST API. All Lingo4G features are exposed through a JSON-based REST API.

Architecture

To efficiently handle millions of documents and gigabytes of text, Lingo4G processing needs to be split into two phases: indexing and analysis (see figure below). Indexing is typically a one-time process in which Lingo4G digests all the documents from your collection. Once indexing is complete, Lingo4G can analyze the whole indexed collection or an arbitrary subset of it to discover topics or cluster documents. Analysis parameters, such as the subset of documents to analyze, topic extraction thresholds or the characteristics of labels can be varied without the need to index the documents again.

The two-phase operation model of Lingo4G is analogous to the workflow of enterprise search platforms, such as Apache Solr or Elasticsearch. The collection of documents first needs to be indexed and only then can the whole collection or a part of it be searched and retrieved.

Due to the two-phase processing model Lingo4G is particularly suitable for clustering fairly "static" collections of documents where the text of all documents can be retrieved for indexing. Therefore, the natural use case for Lingo4G would be analyzing large volumes of human-readable text, such as scientific papers, business or legal documents, news articles, blog or social media posts.

Conceptual overview

This chapter describes the fundamental concepts involved in the operation of Lingo4G. Subsequent sections describe various aspects of content indexing and analysis. The glossary section summarizes all important terms used throughout Lingo4G documentation.

Project

A project defines all the necessary information to process one collection of documents in Lingo4G. Among others, the project defines:

  • default parameter values for the indexing and analysis process,
  • document source to use during indexing,
  • dictionaries of stop words and stop phrases that can be used during indexing and analysis, for example to remove meaningless labels,
  • work and temporary directories: location of the disk directories in which Lingo4G will keep the persistent and temporary files used during indexing and analysis.

Lingo4G stores project information in the JSON format. Please see datasets/​dataset-stackexchange/​stackexchange.project.json for an example project definition and the project file documentation for the list of all available properties.

Each Lingo4G command (indexing, analysis, REST server) can operate on content of one project at a time. Obviously, if enough resources are available, you can simultaneously run multiple instances of Lingo4G processing content of different projects.

Indexing

Indexing is a preparatory process that must be applied to all documents in the project before they can be analyzed. During indexing, Lingo4G will iterate several times over the documents returned by the document source defined in the project and store an internal persistent representation of them. Additionally, Lingo4G will try to discover collection-specific stop labels.

Source documents

The task of a document source is to define the structure and deliver the text of the source documents to be indexed. Lingo4G comes with a number of example document sources for accessing publicly available collections of documents, such as StackExchange, IMDb or PubMed data. A generic document source for reading data from a JSON file is also available.

Taking StackExchange as an example, each source document would correspond to one question asked on the site. Each such document would consist of a number of source fields corresponding to the natural parts of the documents, such as:

  • id — the unique identifier of the question,
  • title — the title of the question,
  • body — the text of the question,
  • answeredtrue if question is answered,
  • accepted_answer — the text of the accepted answer, if any,
  • other_answers — the text of other answers,
  • tags — the user-provided tags for the question,
  • created — the date the question was created,
  • score — the StackExchange-assigned score of the question,
  • answers, views, comments, favorites — the number of answers, views, comments and times the question was marked as favorite, respectively.

The document source also needs to define how contents of each field should be processed for searching and analysis. For instance, the id will likely need to be stored exactly as provided by the document source, while the "natural text" fields, such as title and body will need to be split into words and have English stemming applied. For a complete list of available source field options, see the fields section of the project descriptor.

Please note that currently Lingo4G can only analyze the "natural text" fields, which would be the title, body and answers fields of the above example. The remaining fields can only be used for display purposes and for building the analysis scope queries.

The document source is configured in the source section of the project descriptor. The source code of the example document sources is available in the src/ directory of Lingo4G distribution. Feel free to use them as a starting point for creating the document source for your custom data.

Indexed documents

During indexing, Lingo4G will convert each source document returned by the document source into an indexed document stored in Lingo4G index. There are two major parts of this process: feature extraction and indexing.

Feature extraction

During feature extraction, Lingo4G will scan the selected "natural text" fields of source documents to collect the labels to be used during analysis. Currently, two feature extractors are available. The phrase extractor will extract frequent words and sequences of words as labels, while the dictionary extractor will use the provided dictionary of predefined labels.

Indexing

During the indexing phase, Lingo4G will create the index, which is a persistent copy of all the documents provided by the document source. Additionally, the index will record the occurrences of all labels returned by the feature extractor across all documents. Lingo4G will access the index during analysis.

Stop label extraction

After indexing is complete, Lingo4G will attempt to discover the collection-specific stop labels, that is labels that do not very well differentiate documents in the collection. When indexing e-mails, the stop labels could include kind regards or attachment; for medical articles the set of meaningless labels would likely include words and phrases like indicate, studies suggest or control group.

Analysis

During analysis, Lingo4G will return information helping the analyst to get insight into the contents of the whole indexed collection or the requested part of it. This section discusses various concepts involved in the analysis phase.

Note: Many concepts in this section are illustrated by screen shots of the Lingo4G Explorer application processing StackExchange Super User data, which is a question and answer site for computer enthusiasts and power users. While Lingo4G Explorer uses specific user interface metaphors to visualize different Lingo4G analysis facets, your application will likely choose different means to present the same data.

Analysis scope

Analysis scope defines the set of documents to be analyzed. The scope may include only a small subset of the collection, but it can also extend over all indexed documents. The specific definition of the analysis scope is usually based on a search query targeting one or more indexed fields.

Sticking to our StackExchange example, the scope definition queries could look similar to:

  • title:amiga — all questions containing the word amiga in their title
  • title:amiga OR body:amiga OR accepted_answer:amiga OR other_answers:amiga — all questions containing the word amiga in any of the "natural text" fields. To simplify queries spanning all the textual fields, you can define the default list of fields to search. If all the textual fields are on the default search fields lists, the query could be simplified to amiga.
  • amiga 1200 — all questions containing both the word amiga and the word 1200 in any of their natural text fields. Please note that the interpretation of such a a query will depend on the configuration; the configuration may change the operator from the default AND to OR.
  • amiga AND tag:data-transfer — all questions containing the word amiga in any of the text fields and having the data-transfer tag (and possibly other tags).
  • security AND created:2015* — all questions containing the word security created in year 2015.

Please note that how specific query words are matched against the actual occurrences of those words in documents depends on the field specification provided by the document source. For instance, if the English analyzer is used, matching will be done in case- and grammatical form-insensitive way. In this arrangement, the query term programmer will match all of programmer, programmers and Programmers.

Label list

The fundamental analysis result is the list of labels that best describe the documents in scope. For each label, Lingo4G will provide additional information including the occurrence frequencies (document frequency, term frequency). In a separate request, Lingo4G can retrieve the documents containing the specified label or labels. The list of selected labels is the base input for computing other analysis facets, such as label clusters and document clusters.

Lingo4G offers a broad range of parameters that influence the choice of labels, such as the label exclusions dictionary, maximum number of labels to select, the minimum relative document frequency, the minimum number of label words or automatic stop label removal strength. Please see the documentation of the labels project description section for more details.

An important property of the selected set of labels is its coverage, that is the percentage of the documents in scope that contain at least one of the selected labels. In most applications, it is desirable for the selected labels to cover as many of the documents in scope as possible.

Label clusters

Lingo4G can organize the flat list of labels into clusters, that is groups of related labels. In such an arrangement, the users can more easily get an overview of the documents in scope and navigate to the content of interest.

Structure of label clusters

Clusters of labels created by Lingo4G have the following properties:

  • Non-overlapping. Each label can be a member of one cluster, some labels may remain unclustered.
  • Described by exemplars. Each cluster has one designated label, the exemplar, that serves as the description of the whole cluster. It is important to stress that the relation between member labels and the exemplar are more of the is related to kind rather than the is parent / child of kind. The following figure illustrates this distinction.

  • Connected to other clusters. The exemplar label defining one cluster can itself be a member of another cluster. In the example graph above, the Firefox, Malware, Google Chrome and Html labels, while serving as exemplars for the clusters they define, are also members of the cluster defined by the Browser label. This establishes relationships between label clusters which is similar in nature to the member–exemplar label relation. Coupled with the fact that this relationship is also of the is related to kind, this can create chains of related clusters, as shown in the following figure.

Presentation of label clusters

The output of label clusters returned by Lingo4G REST API preserves the hierarchical structure of label clusters to make it easy for the application code to visualize the internal structure. However, in some applications, it may be desirable to “flatten” that structure to offer a simplified view. In a flattened arrangement, the cluster hierarchy of arbitrary depth is represented as a two-level structure: each connected group of label clusters gives rise to one “master” label cluster, individual label clusters become members of the master cluster. With this paradigm, the complete label clustering result can be presented as a flat list of master clusters.

Lingo4G Explorer flattens label clusters for presentation in the textual and treemap view. To emphasize the two-level structure of the view, instead of the generic notion of label cluster, Lingo4G explorer uses the notions of theme and topic. In Explorer's terms, theme is the “master” cluster that groups individual label clusters referred to as topics. The topic whose exemplar label is not a member of any other cluster (the Partition topic in the example below) serves as the description of the whole theme.

Retrieval of label cluster documents

The list of label clusters produced by Lingo4G does not come with documents automatically assigned to the clusters. This gives the specific application the flexibility of choosing which documents to show when the user selects a specific label cluster or cluster member label for inspection. A couple approaches are possible:

  • Fetching documents matching individual labels. The application fetches documents containing the selected cluster member label, and when a label cluster is selected — documents containing the exemplar label. This approach is simple to understand for the users, but may cause irrelevant documents to be presented. Referring back to the “web browser” label clusters example, if the user selects the Cache label, which is a member of the Browser cluster, the list of documents containing the Cache label will likely include documents not related to web browsers.
  • Limiting the presented documents to the ones matching the exemplar label. With this approach, if the user selects a member label, the application would fetch documents containing both the selected member label and the cluster exemplar label. If the whole cluster is selected, the application could present the documents containing the exemplar label and any of the cluster's label members.

    With this approach, when the user selects the Cache label being part of the Browser cluster, only documents about browser cache would be presented. The downside of this method is that it may not be appropriate for certain member-exemplar combinations, such as the Opera member label being part of the Firefox cluster. Also, if the cluster contains irrelevant labels, irrelevant documents will be shown when the user selects the whole cluster.

  • Letting the user decide. In this approach, the application would allow the user to make multiple label selections to indicate which specific combination of labels they are interested in. Even in this scenario, some processing should be applied. For instance, if the user selects two cluster exemplar labels, the application should probably show all the documents containing either of the exemplar labels. However, if the user selects the label exemplar and two member labels of that cluster, it may be desirable to show documents containing the exemplar label and any of the selected member labels.

Document clusters

Lingo4G can organize the list of documents in scope into a flat list of non-overlapping clusters, that is groups of content-wise similar documents. In a typical use case, document clustering could help the analyst to divide a collection of, for example, research papers, into batches to be assigned to reviewers based on the subjects the reviewers specialize in.

Structure of document clusters

Document clusters created by Lingo4G have the following properties:

  • Non-overlapping. Each document can belong to only one cluster or remain unclustered.

  • Described by exemplar. Each cluster has one designated document, the exemplar, selected as the most characteristic “description” of the other documents in the cluster.

  • Described by labels. For each cluster, Lingo4G will generate a list of labels that most frequently appear in the cluster's documents. Labels on this list are chosen from the set of labels selected for analysis.

Performance considerations

Unlike in label clustering, where the number of labels is usually in the order of a thousand, the number of documents to cluster may reach tens or hundred of thousands. While Lingo4G is capable of clustering hundreds of thousands of documents and more, the process may take several minutes. For this reason, there is parameter that limits the maximum number of documents to cluster.

Document retrieval

Typically, apart from internal data, Lingo4G index will also contain the original text of all the indexed documents. The document retrieval part of Lingo4G REST API lets the Lingo4G-based application fetch content of documents based on different criteria. Most commonly, the application will request documents containing a specific label or labels (when the user selects some label or label cluster for inspection) or documents with specific identifiers (when the user selects a document cluster).

Glossary

This section provides basic definitions of the terms used throughout Lingo4G documentation. Please see the former sections of this chapter for more in-depth description.

Analysis

During analysis, Lingo4G will process the documents found in the requested analysis scope and produce the following information:

Label list
A flat list of labels that describe the documents in scope.
Label clusters
A list of clusters that group related labels. The specific user interface based on Lingo4G, such as Lingo4G Explorer, may use alternative more user-friendly names for a label cluster, such as theme or topic.
Document clusters
A list of clusters, each of which groups related documents.
Analysis scope
Analysis scope defines the set of documents being analyzed. An analysis scope can include just a handful of the documents in the project, but may well cover all of the project's documents. The specific definition of the analysis scope is usually based on a search query targeting one or more indexed fields.
Dictionary
A collection of words and phrases that can be used during indexing or analysis. Typically, dictionaries are used to exclude certain labels from analysis.
Document

Document is a basic unit of content processed by Lingo4G, such as a scientific paper, business or legal document, blog post or a social media message. Each document can consist of one or more fields, which correspond to the natural parts of the document such as the title, summary, publication date, user-generated tags.

Lingo4G distinguishes two types of documents:

Source document
Original document (fields and their text) delivered by the document source.
Indexed document
Representation of a source document stored in the index. Lingo4G will create one indexed document for each source document.
Document source

Document source defines the structure and delivers the text of the source documents to be included in Lingo4G processing. Lingo4G will access this information during indexing. The index will contain a copy of all documents provided by the document source. During analysis, Lingo4G will retrieve the documents from the index and therefore will not access the document source.

Field

Field corresponds to a natural part of a document. Typically, each document will consist of many fields, such as title, abstract, body, creation date, human-assigned keywords.

Lingo4G distinguishes three types of fields:

Source field
Field of a source document. Definition of the source field can include information on how the contents of the field should be processed for searching and analysis.
Indexed field
Field of an indexed document. In typical configurations, the indexed document will contain one indexed field for each source field of the corresponding document. The indexed fields will usually be referenced in queries defining the analysis scope.
Feature field
Internal field stored in the index containing the specific labels Lingo4G can use for analysis.
Index

Index is the persistent internal representation of the documents in the project. The index contains a copy of each document's text, but also other internal information required to perform document analysis.

Indexing

Indexing is a preparatory process that must be applied to all documents in the project before they can be analyzed. During indexing, Lingo4G will iterate several times over the documents returned by the document source and store an internal persistent representation of them called the index.

Label

A specific human-readable feature that occurs in one or more documents. Labels are the basic bits of information Lingo4G will use to build the results of document analysis.

Currently, Lingo4G supports labels based on sequences of words (phrases). For example, if the label text is Christmas tree, any document containing the Christmas tree text will be deemed to contain that label.

Project

A project defines all the necessary information to index and analyze one collection of documents.

Silhouette

Silhouette coefficient is a property that can be computed for individual labels or documents arranged in clusters. Silhouette indicates how well the entity matches its cluster.

High Silhouette values indicate a good match, which happens when the entity's similarity to other entities in the same cluster is high and the entity's similarity to the closest entity outside of the cluster is low.

Low Silhouette values indicate that the entity may match a different cluster better, that is its similarity to other cluster members is low while the similarity to the closest non-member of the cluster is high.

Stop label

A label that carries no significant meaning in the context of the currently processed collection of documents. The set of stop labels will usually contain common function words, such as the or for. Additionally, domain-specific stop labels are also possible: in the context of medical articles these could be for example the studies suggest or control group phrases. Lingo4G will automatically detect some meaningless phrases during indexing.

APIs and tools

The following Lingo4G tools and APIs are available in the distribution bundle:

Command-line tool

You can use the l4g command line tool to:

  • index your documents,
  • invoke analysis of your documents and save the results to a JSON, XML or Excel file,
  • start Lingo4G REST API,
  • get diagnostic information.
HTTP/REST API
You can use Lingo4G REST API to invoke the analysis of your documents and retrieve the analysis results. Lingo4G REST API comes with the Lingo4G Explorer application that will let you tune analysis settings in an interactive way.
Lingo4G Explorer

Lingo4G Explorer is a browser-based application you can use to:

  • run Lingo4G analyses in an interactive fashion,
  • explore analysis results through text- and visualization-based views,
  • tune Lingo4G analysis settings.

Lingo4G Explorer runs on top of Lingo4G HTTP/REST API and comes with full source code. You can study the code to see how the REST API is used to drive a real-world application. Feel free to reuse parts of Explorer's code in your own code base. You can also fork Lingo4G Explorer into your own own custom Lingo4G-based front-end.

Limitations

Lingo4G has the following limitations we plan to address in the future:

  • Lingo4G does not very well support the ad-hoc processing case, such as clustering search results from a public search engine, where the documents to be clustered are not known in advance. Such a case could be handled by creating an ad-hoc throw-away index for each set of results to be clustered, but Lingo4G is not currently optimized for efficient operation in such mode. Lingo3G was created precisely with this use-case in mind.

  • The initial releases of Lingo4G can only cluster text written in English.

Lingo4G or Lingo3G?

Despite similar names, Lingo4G and Lingo3G are fundamentally different with respect to architecture and the intended use cases. Therefore, we will maintain and offer them as two separate products. The following table compares Lingo3G with Lingo4G.

Lingo4G Lingo3G
Architecture

Stateful. Lingo4G first needs to index all your documents and only then can it analyze the whole indexed collection or a subset of it. Including previously unseen documents in an analysis requires re-indexing.

The split into two phases allows Lingo4G to perform the expensive operations, such as tokenization of documents, only once during indexing.

Stateless. Lingo3G performs processing in one step, all documents provided on input will be immediately processed and disposed of.

The stateless paradigm makes it possible to process arbitrary and previously unseen documents. The cost is that the time-consuming operations, such as tokenization, will be repeated for each set of documents being processed.

Memory requirements

Dependent on the specific analysis task. Lingo4G does not need to keep the content of documents being analysed in-memory. For example, the memory requirements for label clustering depend on the number of selected labels, not the number of documents in scope. Therefore, label clustering can be applied to gigabytes of text while keeping memory requirements limited.

Dependent on the number of documents. Lingo3G keeps the input documents and their internal representation in-memory for the duration of processing. Therefore, Lingo3G is not recommended for clustering more than about 50 MB of text at once.

Optimum use-case Processing of fairly static collections of documents where the text of all documents can be made available for indexing. Practical limit on the number of indexed and analyzed documents depends on the hardware and will usually be near 100 GB of text. Processing of small and medium sized collections of ad-hoc content, such as search results.
Result facets

Lingo4G can produce a range of analysis facets, including label clusters and document clusters.

Lingo3G results are conceptually similar to Lingo4G label clustering. Conventional document clustering is not available in Lingo3G.
APIs
  • HTTP/REST API
  • Java API
  • C# API
  • HTTP/REST API

Requirements

While Lingo4G processing cannot currently be distributed to multiple machines, a high-end workstation with SSD storage should be capable of handling collections of several tens of gigabytes. For most data sets not exceeding gigabytes, any computer with 4GB of memory and some disk space will be sufficient.

Storage

Storage technology and size is the key factor that influences Lingo4G performance.

Storage technology

Solid-state drives (SSD) are highly recommended for storing Lingo4G index and temporary files, especially if the files are too large to fit the operating system's disk cache. With SSD storage, Lingo4G will be able to effectively use multiple CPU cores for processing and thus significantly decrease the processing time.

The following table shows the time required to write and read Lingo4G index for the OHSUMED collection on a server-grade SSD drive and a desktop-grade HDD drive. While the drive technology does not make a significant difference during indexing, when it comes to content analysis (reading the index), SSD drives offer significant speed-ups in multi-threaded configurations. However, once the index data gets cached by the operating system, multi-threaded performance is similar for SSD and HDD. Please note that the latter can only happen when there is enough RAM for the operating system to cache the whole Lingo4G index.

Operation SSD HDD
Indexing (11 threads) 147 s 158 s
Index reading (1 thread, empty cache) 22 s 27 s
Index reading (1 thread, cached) 22 s 22 s
Index reading (12 threads, empty cache) 6 s 24 s
Index reading (12 threads, cached) 6 s 7 s
The measurements were taken in the following environment: Intel(R) Xeon(R) CPU E5-1650 v2 (6 cores) @ 3.50GHz, 32 GB RAM, Ubuntu 12.04.4 LTS, Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode). SSD drive: Samsung SSD 840 PRO Series, HDD drive: WD RE4 1TB 7200 rpm. Command used to perform indexing: l4g index -p datasets/dataset-ohsumed. Command used to perform index reading: l4g stats -p datasets/dataset-ohsumed --accuracy 1.0 --analyze-text-fields -t <threads>.

Storage space

Lingo4G persistent storage requirements are typically 2x–3x the total size in bytes of the text in your collection. The following table shows the size of Lingo4G persistent index for the example data sets.

Collection Total size of raw text Size of Lingo4G index
IMDb 231 MB 752 MB
OHSUMED 366 MB 843 MB
PubMed 31 GB 66 GB

In addition to the space occupied by the index itself, Lingo4G will require several GB of additional disk space for temporary files while indexing. These temporary files can be deleted after indexing is complete.

CPU and memory

CPU: 4–16 hardware threads. Lingo4G can perform processing in parallel on multiple CPU cores, which can greatly decrease the latency. Depending on the size of the collection and the number of concurrent analysis threads, the reasonable number of CPU hardware threads will be between 4 and 16.

RAM: the more the better. During document analysis, Lingo4G will frequently reach to its persistent index data store created during indexing. For the highest multi-threaded processing performance, the amount of RAM available to the operating system should ideally be large enough for the OS to cache all Lingo4G index files, so that the number of disk accesses is minimized. This is especially important if Lingo4G index files are stored on a HDD.

JVM heap size: the default 4 GB should be enough in most scenarios. The default JVM heap size should be enough to perform indexing regardless of the size of the input data set and for the typical document analysis scenarios. When analyzing very large subsets of the data set or handling multiple concurrent analyses, the JVM heap size may need increasing. Also note that needlessly increasing the JVM heap may have an adverse effect on performance as it may decrease the amount of memory that would be otherwise used for disk caches.

Java Virtual Machine

Lingo4G requires Java 8 or later. Ideally, the JVM should be 64-bit to enable multi-gigabyte heap sizes if needed.

Heads up, JVM bugs!

Oracle Java 8 JVMs prior to version 1.8.0_u60 contain several bugs that can cause Lingo4G processing to fail. To avoid runtime errors, use Oracle Java 8 version 1.8.0_u60 or later.

Installation

To install Lingo4G:

  1. Extract Lingo4G ZIP archive to some local directory. We will refer to this directory as Lingo4G home directory or L4G_HOME.
  2. Copy your license file (license.zip or license.xml) the L4G_HOME/conf directory. Alternatively, you can place the license file in the conf directory under a given project. In that case, the license will be read for commands operating on that project only.

    Any license*.xml file (in a ZIP archive or unpacked) will be loaded as a license key, so you may give your license keys more descriptive names, if needed (license-production.xml, license-development.xml).

  3. You may want to add L4G_HOME to your command search path, so that you can easily run Lingo4G commands in any directory.

The contents of the particular directories inside L4G_HOME is the following:

conf
Configuration files, license file.
datasets
Project files for the example data sets.
doc
Lingo4G manual.
lib
Lingo4G implementation and dependencies.
resources
The default lexical resources, such as stop words and label dictionaries.
src
Example code: calling Lingo4G REST API from Java.
Source code of the IMDb, OSHUMED, PubMed and other example document sources.
web
Static content served by Lingo4G REST API. You can prototype your HTML/JavaScript application based on Lingo4G REST API directly in that directory.
l4g, l4g.cmd
The Lingo4G command script for Linux/Mac and Windows.
README.txt
Basic information about the distribution, pointers to the documentation.

Basic usage

The general workflow with Lingo4G will consist of three phases: creating the project descriptor file for your specific data, indexing your data and finally running analysis.

Creating project descriptor

To start analyzing data, you need to create a project descriptor file that will describe how to access the content during indexing and what specific indexing and analysis parameters to use. Only the required and non-default values are mandatory in the descriptor, everything else is filled in with defaults. To see a fully resolved descriptor, including all the settings, invoke the l4g show command.

There are three major ways to get some data into Lingo4G:

  • Use one of the example data sets. Lingo4G ships with a number of example project descriptors for processing publicly available data sets, such as PubMed papers or StackExchange questions. This is the quickest way to try Lingo4G on real-world content.
  • Modify the example JSON data set project descriptor. This is the easiest way to get your own data into Lingo4G.
  • Write custom Java code to bring your data into Lingo4G. While this method is most demanding, it will allow, for example, to feed Lingo4G indexer directly from your data store, such as a Lucene index, SQL database or file share. The easiest way to get started is to set up the source code of the JSON documents source in an IDE and modify it to suit your needs.

If Carrot Search provided a template descriptor for your specific data set, you will only need to copy the descriptor, possibly accompanied by some other files, into a new directory.

Example data sets

The L4G_HOME/datasets directory contains a number of project descriptors you can use to index and analyze selected publicly available document sets. With the exception of the PubMed data set, Lingo4G will attempt to download the data set from the Internet. The following table summarizes the available example data sets.

Project directory Description Number of docs Disk space1 Indexing time2

1 Disk space required to index the data set, includes the source version of the data downloaded from the Internet and temporary files.

2 Time required to index the data set, excluding source version download. The times are reported for indexing executed on the following hardware: Intel Core i7-2600K 3.4GHz, 12GB RAM, Windows 10, SSD drive: Samsung 850 PRO Series.

dataset-clinicaltrials

Clinical trials data set from clinicaltrials.gov, a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world.

200k 2.5GB 8m
dataset-imdb

Movie and TV show descriptions from imdb.com.

476k 1GB 3.5m
dataset-json

A small sub-sample of the StackExchange data set, converted to a straightforward JSON format. This example (and project descriptor) can be reused to index custom data.

251 1MB 3s
dataset-ohsumed

Medical article abstracts from the OHSUMED collection.

350k 1GB 3m
dataset-pubmed

Open Access subset of the PubMed Central database of medical paper abstracts.

Due to the large size of the original data set, Lingo4G does not download it automatically by default. Please see datasets/dataset-pubmed/README.txt for detailed instructions.

1.34M 86GB 5h 11m
dataset-research.gov

Summaries of research projects funded by the US National Science Foundation and NASA between 2007 and 2015, as available from research.gov.

128k 1.5GB 2m
dataset-stackexchange

Content of the selected StackExchange QA site. By default, content of the superuser.com site will be used.

You can pass the -Dstackexchange.site=<site> property to choose a different StackExchange site to process. Depending on your interests, you can try some of the following sites:

You can also see the full list of available sites in XML format (where TinyName attribute of each record would be the value passed to stackexchange.site property) or a more human-friendly list of archived site dumps, noting that the document source automatically truncates stackexchange.com.7z suffix (to fetch outdoors.stackexchange.com.7z you should pass -Dstackexchange.site=outdoors).

240k 3GB 3m

Indexing JSON data

The JSON document source reads source documents from an array of JSON objects (key-value pairs). An example descriptor based on that document source is located in L4G_HOME/datasets/dataset-json.

To index your custom data using the JSON document source:

  1. Convert your data to a JSON file. The structure of the JSON file must be the following:

    • The top-level element must be an array of objects representing individual source documents.
    • Each document object must be a flat collection of key-value pairs, where each object key represents field name and value represents field value.
    • Field names are arbitrary, you will reference them in various parts of the project descriptor.
    • Field value must be a string, a number or an array of those types. The latter denotes a multi-value field.

    The remaining part of this section assumes the following JSON file contents:

    [
      {
        "title": "Title of document 1",
        "created": "2009-07-15",
        "score": 195,
        "notes": [
          "multi-valued field value 1",
          "multi-valued field value 2"
        ],
        "tags": [ "tag1", "tag2" ]
      },
    
      {
        "title": "Title of document 2",
        "created": "2010-06-10",
        "score": 20,
        "notes": "single value here",
        "tags": "tag3"
      }
    ]

    A larger example is available in L4G_HOME/datasets/dataset-json/data/sample-input.json.

  2. Modify the project descriptor that comes with the example to reference the document fields present in your JSON file. The following sections list the required changes, highlighting them with yellow background.

    1. Point at the custom JSON file:

      "source":  {
        "feed":  {
          "type":  "com.carrotsearch.lingo4g.datasets.json.JsonDocumentSourceFragment",
          // Custom JSON file here (path is project-relative).
          "jsonFile":  "data/custom-data.json"
        }
      }
      
    2. Declare how fields of your documents should be processed by Lingo4G. Refer to project descriptor's fields section for a detailed specification of field types.

      // Declare your fields.
      "fields": {
        "title":    { "analyzer": "english" },
        "notes":    { "analyzer": "english" },
      
        // Convert date to a different format on import.
        "created":  { "type": "date", "inputFormat": "yyyy-MM-dd",
                                      "indexFormat": "yyyy/MM/dd" },
      
        "score":    { "type": "integer" },
        "tags":     { "type": "keyword" }
      }
      
    3. Declare feature extractors and fields they should be applied to. Typically, you will include all fields with the english analyzer in both the sourceFields and targetFields arrays below.

      // Declare feature extractors and fields they should be applied to.
      "features": [
        {
          "type": "phrases",
          "key": "phrases",
          "sourceFields": [ "title", "notes" ],
          "targetFields": [ "title", "notes" ],
          "maxTermLength": 200,
          "minTermDf": 10,
          "maxPhraseTermCount": 5,
          "minPhraseDf": 10
        }
      ]
      
    4. Declare additional information for the automatic stop label extractor. If there are any clear overlapping or non-overlapping document categories in your data (defined by such fields as tags, category, division), the extractor can make more intelligent choices. In our case, we'll use the tags field for this purpose.

      // Declare hints for stop label extractor.
      "stopLabelExtractor": {
        "categoryFields": [ "tags" ],
        "featureFields": [ "title$phrases" ],
        "partitionQueryMaxRelativeDf": 0.05,
        "maxPartitionQueries": 500,
        "accuracy": 0.5
      }
      
    5. Modify the settings of the query parser to declare which fields to search when scope query is typed without an explicit field prefix.

      "queryParsers": [
        {
          "type": "standard",
          "key": "standard",
          // Declare the default set of fields to search
          "defaultFields": [
            "title",
            "notes"
          ]
        }
      ]
      
    6. Finally, tweak the fields used by default for analysis and document content output.

      "source": {
        // Provide fields to analyze (note feature extractor's suffix).
        "fields": [
          { "name": "title$phrases" },
          { "name": "notes$phrases" }
        ]
      }
      
      "documents": {
        "enabled": false,
        "onlyWithLabels": true,
        "content": {
          "enabled": true,
          "fields": [
            // Write back these fields for each document.
            { "name": "title" },
            { "name": "notes" }
          ]
        }
      }
      

Once the project descriptor and JSON data are assembled, the project is ready for indexing and analysis.

Indexing PDF/Word/HTML files

The L4G_HOME/datasets/dataset-autoindex example contains an implementation of a document source that processes files in Microsoft Word, PDF and other file formats and extracts their title and textual content. The default project descriptor declares the following fields:

"fields": {
  "fileName":    { "analyzer": "literal" },
  "contentType": { "analyzer": "literal" },
  "title":       { "analyzer": "english" },
  "content":     { "analyzer": "english" }
}

The fileName is the last path segment of the file indexed, contentType is the auto-detected MIME content type of the file and title and content are plain text fields extracted from the file using Apache Tika.

To quickly start experimenting with Lingo4G and index your files using this document source:

  1. Copy all files that should be indexed to a single folder (or subfolders). The document source will scan and index all files in a given folder and subfolders. Note that Apache Tika may not support all types of content (for example encrypted PDFs or ancient Word formats). In general, however, PDFs, Word files, OpenOffice documents and HTML or plain text files are processed just fine.

  2. Index your data. Note the source folder is passed as a system property in the command line below.

    l4g index  -p datasets/dataset-autoindex -Dautoindex.dir=[absolute folder path]

    In case certain files cannot be processed, a warning will be logged to the console.

  3. Start the Explorer.

    l4g server -p datasets/dataset-autoindex
  4. Note about automatic stopword detection

    Because automatic text extraction only recognizes the title and content of a document, the options for automatic discovery of stopwords are limited. Edit label dictionaries to refine the indexing and analysis, this should be an iterative improvement process.

Custom document source

For complete control over the way your data is delivered for Lingo4G indexing, you will need to write a custom document source in Java. The easiest route is to take the source code of the JSON document source as a starting point and modify it to suit your needs.

One possible workflow of Lingo4G document source development is the following:

  1. Set up the source code provided in the src folder of Lingo4G distribution in your Java IDE. The source code uses Maven for dependency management, no major IDE should have problems opening it.
  2. Set up a run configuration in your IDE to contain in its classpath:

    • the JSON document source, contained in the datasets/dataset-json Maven project,
    • the lingo4g-cli-commands-x.x.x.jar.
  3. Modify the source code of the JSON document source to suit your needs. Typically you'll modify the code to fetch data from a different data store (local file in a custom format, Lucene index, SQL database).
  4. Modify the project descriptor to match the fields emitted by your modified document source. See the indexing JSON data section for the typical modifications to make.
  5. Run Lingo4G indexing directly from your IDE to see how your custom document source performs, fix bugs, if any.
  6. Once the code of your custom document source is ready, you use Maven to build a complete data set package to be installed in your production Lingo4G instance.

The following video shows how to set up the source code and run Lingo4G indexing from IntelliJ IDEA.

Indexing

Before you can analyze your data, you need to index it. To perform the indexing, run the index command providing a path to your project descriptor JSON in the -p parameter:

l4g index -p <project-descriptor-JSON-path>

You can customize certain aspects of indexing by providing additional parameters for the index command and editing the project descriptor file.

Analysis

Once your data is indexed, you can analyze the indexed documents. You can run the analysis in an interactive way using the Lingo4G Explorer application. Alternatively, you can use the analyze command from Lingo4G command-line tool. The following sections show typical clustering invocations.

Running analysis in Lingo4G Explorer

To use Lingo4G Explorer, start Lingo4G REST API:

l4g server -p <project-descriptor-JSON-path>

Once the server starts up, open http://localhost:8080/apps/explorer in a modern browser.

You can use the Query text box to select the documents for analysis. Please see the overview of analysis scope for some example queries. The Lingo4G Explorer section contains a detailed overview of the application.

Running analysis from command line

You can use the l4g analyze command to invoke analysis and save the results to a JSON, XML or Excel file. The following sections show some typical invocations.

Analyzing all indexed documents

To analyze all documents contained in the index, run:

l4g analyze -p <project-descriptor-JSON-path>

By default, the results will be saved in the results directory. You can change the location of the results using the -o option.

Analyzing a subset of indexed documents

You can use the -s option to provide a query that will select a subset of documents for analysis. The query must follow the syntax of the Lucene query parser configured in the project descriptor. The examples below show a number of queries on the StackExchange Super User collection (using the default query parser), the -p parameter is omitted for brevity.

  • Analyzing all documents tagged with the osx label.
    l4g analyze -s "tags:osx"
  • Analyzing all documents whose creation date name begins with 2015.
    l4g analyze -s "created:2015*"
  • Analyzing all documents containing Windows 10 or Windows 8 in their titles. Please note that the quotes in each search term need to be escaped (preceded with the \ character).
    l4g analyze -s "title:\"windows 10\" OR title:\"windows 8\""
  • Selecting documents for analysis by identifiers.

    If your documents have identifiers, such as the id field in the StackExchange collection, you can select for analysis a set of documents with the specified identifiers.

    For highest performance of id-based selection, use the following procedure:

    1. Edit the analysis scope.type in your project descriptor JSON to change the type to byFieldValues (and remove other properties of that section):

      "scope": {
        "type": "byFieldValues"
      }
    2. Pass the field name and the list of values to match to the -s option in the following format:

      <field-name>=<value1>,<value2>,...

      For example:

      l4g analyze -s "id=25539,125543,54724,125545"

    Since in most practical cases the list of field values will be too long for the command interpreter to handle, you will need to invoke Lingo4G with all parameter values provided in a file.

    Note for the curious

    The by-document-id selection could also be made using the standard Lucene query syntax:

    l4g analyze -s "id:125539 OR id:125543 OR id:54724 OR id:125545"

    In real-world scenarios, however, the number of documents to select by identifier will easily reach thousands or tens of thousands. In such cases, parsing the standard query syntax shown above may take longer than the actual clustering process. For long lists of field values it is therefore best to use the dedicated byFieldValues scope type outlined above.

Changing analysis parameters

You can change some of the clustering parameters using command line parameters. Fine-tuning of analysis parameters is possible by overriding or editing the project descriptor file.

  • Changing the number of labels. You can change the number of labels Lingo4G will select using the -m command line parameter:
    l4g analyze -m 1000
  • Changing the feature fields used for analysis. By default, Lingo4G will analyze the list of fields defined in the project descriptor's labels.source.fields property. To apply clustering to a different set of feature fields you can either edit that property of your project descriptor or pass a space-separated list of fields to use to the --feature-fields option.

    To apply clustering only to the title field of the StackExchange data set you can run:

    l4g analyze --feature-fields title$phrases

    You may have to add quotes around title$phrases on shells where $ is a variable-substitution character.

  • Preventing one-word labels. To prevent one-word labels, you can provide an override project descriptor JSON using the -j parameter:

    l4g analyze -j "{ labels: { surface: { minLabelTokens: 2 } } }"

Saving analysis results in different formats

Currently, Lingo4G can save the analysis results in XML, JSON and Excel XML formats. To change format of the results, open the project descriptor file and change the format property contained in the output subsection of the clustering section. The allowed values are xml, json and excel.

Alternatively, you can provide an override project descriptor JSON using the -j parameter and set the desired output format:

l4g analyze -j "{ output: { format: \"excel\" } }"

Finally, Lingo4G Explorer can export the results in the same formats.

Advanced usage

Feature extractors

Feature extractors provide the key ingredient used for analysis in Lingo4G — the features used to describe each document. During indexing, features are stored together with the content of each document and are processed later when analytical queries are issued to the system.

Features are typically computed directly from the content of input documents, so that new, unknown, features can be discovered automatically. For certain applications, a fixed set of features may be desirable, for example when the set of features must be aligned with a preexisting ontology or fixed vocabulary. Lingo4G comes with feature extractors covering both these scenarios.

Features

Each occurrence of a feature contains the following elements:

label

Visual, human-friendly representation of the feature. Typically, the label will be a short text: a word or a short phrase. Lingo4G uses feature labels as identifiers, so features with exactly the same label are considered identical.

occurrence context

All occurrences of a feature always point at some fragment of a source document's text. The text the feature points to may contain the exact label of the feature, its synonym or even some other content (for example, an acronym 2HCl for the full label histamine dihydrochloride).

The relationship between features, their labels and where they occur in documents is governed by a particular feature extractor contributing the feature.

Frequent phrase extractor

This feature extractor:

  • automatically discovers and indexes terms and phrases that occur frequently in input documents,
  • can normalize minor differences in the appearance of the surface form of a phrase, picking the most frequent variant as the feature's label, for example: web page, web pages, webpage or web-page would all be normalized into a single feature.

Internally, terms and phrases (n-grams of terms) that occur in input documents are collected and counted. A term or phrase is counted only once per document, regardless of how many times it is repeated within that document. A term is promoted to a feature only if it occurred in more than minTermDf documents. Similarly, a phrase is promoted to a feature only if it occurred in more than minPhraseDf documents.

Note that terms and phrases can overlap or be a subset of one another. The extractor will thus create many redundant features — these are later eliminated by the clustering algorithm. For example, for a sentence like this one:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

all of the following features could be discovered and indexed independently (the whole input is repeated for clarity, features are underlined):

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Important configuration settings

Cutoff thresholds minTermDf and minPhraseDf should be set with care. Too low values may result in proliferation of noisy phrases that denote structural properties of a language rather than entities or strong stimuli that should give rise to potential clusters. Setting them to very large values may quietly omit valuable phrases from the index and in the end from clustering.

See the extractor's configuration section for more information.

Dictionary extractor

This feature extractor annotates phrases or terms from a fixed, predefined dictionary provided by the user. This can be useful when the set of features (cluster labels) should be limited to a specific vocabulary or ontology of terms. Another practical use case is indexing geographical locations, mentions of places or people.

The dictionary extractor requires a JSON file listing features (and their variants) that should be annotated in the input documents. Multiple such files can be provided via the features.dictionary.labels attribute in the extractor's configuration section.

An example content of the dictionary file is shown below.

[
  { 
    "label": "Animals", 
    "match": [ 
      "hound", 
      "dog", 
      "fox", 
      "foxy"
    ]
  },
  { 
    "label": "Foxes",
    "match": [ 
      "fox", 
      "foxy", 
      "furry foxy"
    ]
  }
]

Given this dictionary, the english analyzer and the following input:

The quick brown fox jumps over the lazy dog.

The following underlined fragments would be indexed as Animals:

The quick brown fox jumps over the lazy dog.

Additionally, this underlined fragment would be indexed as Foxes:

The quick brown fox jumps over the lazy dog.

Note that:

  • Each feature must have a non-empty and unique visual label. This label will be used to represent the feature in clustering results.
  • A single feature may contain a number of different matching variants. These variants can be terms or phrases.
  • If two or more features contain the same matching string (as it is the case with fox and foxy in the example above), all those features will be indexed at the position the string occurs in the input.

Important

The text of input documents is processed according to the featureAnalyzer specification given in the declaration of indexed fields. When a dictionary extractor is applied to a field, its matching strings are also preprocessed with the the same analyzer as the field the extractor is applied to — the resulting sequence of tokens is then matched against the token sequence produced for documents in the input.

Thus, analyzers that normalize case and stem the input will typically not require all spelling or uppercase-lowercase variants of a given label — a single declaration of the base form will be sufficient. For analyzers that preserve letter case and surface forms, all potential spelling variants of a given matching string must be enumerated.

See the extractor's configuration section for more information.

Dictionaries

It often happens that you would like to exclude certain non-informative labels from analysis. This is the typical use case of the dictionary data structure discussed in this section.

The task of a dictionary is to answer the question Does this specific string exist in the dictionary? Details of the string matching algorithm, such as case-sensitivity or allowed wildcard characters, depend on the type of the dictionary. Currently, two dictionary types are implemented in Lingo4G: one based on word matching and another one using regular expression matching.

Depending on its location in the project descriptor, the dictionary will follow one of the two life-cycles:

static

Dictionaries declared in the dictionaries section are parsed once during the initialization of Lingo4G. Changes to the definition of the static dictionaries are reflected only on the next initialization of Lingo4G, for example after the restart of Lingo4G REST API server.

Once the static dictionaries are declared, you can reference them in the analysis options. Typically, you will use the analysis.surface.exclude option to remove from analysis all labels contained in the provided dictionaries.

Note that you can declare any number of static dictionaries. For example, instead of one large dictionary of stop labels you may have one dictionary of generic meaningless phrases (such as common verbs and prepositions) along with a set of domain-specific stop label dictionaries. In this arrangement, the users will be able to selectively apply static dictionaries at analysis time.

ad-hoc

Dictionaries declared outside of the dictionaries section, for example in the analysis.surface.exclude option, are parsed on-demand. Therefore, any new definitions of the ad-hoc dictionaries, provided, for example, in the REST API request, will be applied only for that specific request.

The typical use case of ad-hoc dictionaries is to allow the users of your Lingo4G-based application to submit their own lists of excluded labels.

See the documentation of the dictionaries section for in-depth description of the available dictionary types. The documentation of the analysis.surface.exclude option shows how to reference static dictionaries and declare ad-hoc dictionaries.

FAQ

Licensing

I hold a Lingo3G license, will I receive Lingo4G as an upgrade?

No. Lingo3G and Lingo4G are separate products we intend to offer and maintain independently. Lingo3G will remain an engine for real-time clustering of small and medium collections, while Lingo4G will address clustering of large data sets. Therefore, Lingo4G is not an upgrade to Lingo3G, but a complementary offering.

If you would like to switch from Lingo3G to Lingo4G, we offer a license trade-in option and count the initial Lingo3G license purchase fee towards the Lingo4G license fee. After you trade-in your Lingo3G license, you will not be able to use it any more. If you'd like to trade-in your Lingo3G license, please get in touch for more details.

What kind of limits can my Lingo4G license include?

Depending on your Lingo4G edition, your license file may include two limits:

  • Maximum total size of indexed documents, defined by the max-indexed-content-length attribute of your license file. The limit restricts the maximum total size of the text declared to be analyzed by Lingo4G. Text stored in the index only for literal retrieval is not counted towards the limit.

    In more technical terms:

    • The limit is applied to the content of fields processed by the feature extractors. Subject to limiting will be fields passed in the phrases.targetFields or dictionary.targetFields options. Contents of each field is counted towards the limit only once, even if it is processed by multiple feature extractors.
    • The length of each field is computed as the number of Unicode code points. Therefore, each character is counted as one byte, even if the Unicode representation of the character spans multiple bytes.
    • After the total indexed size limit is exceeded, contents of further documents returned by the document source will be ignored.
  • Maximum number of documents analyzed in one request, defined by the max-documents-in-scope attribute of your license file. The limit restricts the number of documents in analysis scope. If the number of documents matching the scope query exceeds the limit, Lingo4G will ignore the lowest-scoring documents.

The above limits are enforced for each Lingo4G instance / project separately.

Is the total number of documents in the index limited?

No. Regardless of your Lingo4G edition, there will be no license-enforced limits on the total number of documents in Lingo4G index.

How many projects / instances of Lingo4G can I run on the same server?

There are no restrictions on the number of Lingo4G instances running on one physical or virtual server. The only limit may be the capacity of the serve, including RAM size, disk space and the number of CPUs.

Indexing

Can I add new documents to an existing Lingo4G index?

Not at the moment. If you'd like make Lingo4G aware of new document, you need to make sure the document source returns all the existing and new documents and re-index the whole collection.

Which languages does Lingo4G support?

Currently, Lingo4G can only process English text. If you'd like to apply Lingo4G to content written in a different language, please contact us.

What is the maximum size of the project Lingo4G can handle?

The early adopters of Lingo4G have been successfully using it with collections of millions of documents spanning over 50 GB of text. If your collection is larger than that, please do get in touch for an evaluation license to see if Lingo4G can handle your data.

One important factor to consider is that currently Lingo4G does not offer distributed processing. This means that the maximum reasonable size of the project will be limited by the amount of RAM, disk space and processing power available on a single virtual or physical server.

Analysis tuning

You can influence the process and outcomes of Lingo4G analysis through through the parameters in the analysis section. Below are answers to typical analysis tuning questions.

How can I increase the number of labels selected for analysis?

  1. Increase maxLabels to the desired number. If there is still fewer labels selected, try the following changes.
  2. Lower minRelativeDf, possibly to 0.
  3. Lower minWordCount and minWordCharacterCountAverage. You may also need to increase preferredWordCountDeviation to allow a wider spectrum of label lengths.
  4. Lower minAbsoluteDf, possibly to 0. Please note though that allowing labels that occur only in one in-scope document may bring in a lot of noise to the result.

How to prevent meaningless labels from being selected for analysis?

There are two general ways of removing unwanted labels:

  1. Allow Lingo4G to remove a larger portion of the automatically extracted stop labels. To do this, increase autoStopLabelRemovalStrength and possibly decrease autoStopLabelMinCoverage.

    Note that this method will remove large groups of labels, possibly also those that your users may find useful.

  2. Add the specific labels to the label exclusions directory.

How to increase the number of documents covered by labels?

  1. Set Lingo4G to select more labels for analysis.

  2. Alternatively, set Lingo4G to prefer higher-frequency labels: lower preferredWordCount, increase maxRelativeDf, increase singleWordLabelWeightMultiplier.

How to increase the number of label clusters?

The easiest way to increase the number of label clusters (and therefore decrease their size) is to change the similarityWeighting to LOVINGER, DICE or BB. Use the Experiments feature of Lingo4G to try out the impact of weighting schemes on the clusters.

How to increase the size of label clusters?

  1. Lower inputPreference, possibly all the way down to -1.
  2. For further cluster size increases, consider setting similarityWeighting to CONTEXT_RR, bearing in mind that this may produce meaningless clusters if there are many low-frequency labels selected for the analysis

You can also use the Experiments feature of Lingo4G to try out the impact of weighting schemes on the size of clusters.

How to increase the number of document clusters?

There are two independent ways to increase the number of document clusters (and therefore decrease their size):

  1. Increase inputPreference, possibly up to 0.
  2. Decrease maxSimilarDocuments.

How to increase the size of document clusters?

There are two independent ways to increase the size of document clusters:

  1. Decrease inputPreference, possibly down to a large negative value, such as -10000. For further increase of document cluster size, see below.
  2. Further increase of cluster size is possible by making the document relationship matrix more dense. You can achieve this by increasing maxSimilarDocuments, bearing in mind that this will significantly increase the processing time.

Lingo4G Explorer

Lingo4G Explorer is a browser-based application you can use to experiment with and tune Lingo4G clustering. Lingo4G Explorer is one of the example applications included in Lingo4G REST API.

Getting started

To launch Lingo4G Explorer:

  1. Start Lingo4G REST API for the project you would like to explore:
    l4g server -p <project-descriptor-JSON-path>
  2. Point your browser to http://localhost:8080/apps/explorer. Lingo4G requires a modern browser, such as a recent version of Chrome, Firefox, Internet Explorer 11 or Edge

Once Lingo4G Explorer loads, you will be able to initiate the analysis by pressing the Analyze button. Once the analysis is complete, you will see the main screen of the application similar to the screen shot below. Hover over various areas of the screen shot to see some description.

Parameters view

You can use the parameters view to alter parameters and trigger new analyses:

Analyze
Triggers a new analysis using the current parameter values. If you change value of any parameter, you must press the Analyze button to get the changes applied.
Collapses all expanded parameter sections.
Defaults
Resets all parameter values to their default values.
JSON

Opens a dialog showing all parameters in the JSON format ready for pasting into your project descriptor, command line or REST API invocation.

Only include options
different from defaults
If unchecked, all options, including the defaults will be included in the JSON export.
For pasting into
command line
If checked, the JSON will be formatted in one line and escaped, so that it can be pasted directly into a l4g analyze command's -j option.
Copy
Copies the currently visible JSON export directly into clipboard.
Filters

Toggles parameter filters. Currently, parameters can be filtered by free text search over their names.

Analysis result view

The central part of Lingo4G Explorer is the analysis results view. The screen shot below shows the analysis results view with a label clusters treemap active. Hover over various areas to see more detailed descriptions.

The following statistical summaries, shown at the top of the screen, are common across all analysis results facets:

total time

The time spent on performing the analysis. Hover over the statistic to see a detailed break down shown on the right.

docs analyzed
The number of documents in the current analysis scope.
labels
The number of labels selected to represent the documents in scope.
labeled docs
The percentage of documents that contain at least one of the selected labels. Generally, it is advisable to keep the coverage as high as possible, so that the analysis results fully represent the majority of documents in scope.

Note: The following sections concentrate on the user interface features of each analysis result facet view. Please see the conceptual overview section for a high-level discussion of Lingo4G analysis.

Labels

The labels list view shows a flat list of labels selected to represent the currently analyzed set of documents.

The number shown to the right of each label is the number of in-scope documents containing the label. Clicking on the label will show those documents in the document content view.

The following tools are available in document label list view:

Allows to copy the list of labels to clipboard in the CSV format. If the label list comparison view is enabled (see below), the copied list will also contain the comparison status for each label.
Compare

Shows differences between lists of labels belonging to two analysis results. You can use this tool to see, for example, which labels get added or removed as a result of changing label selection parameters.

When the label differences view is enabled, labels contained in the current result will be compared with a reference label list. The reference can either be the previous analysis result or a snapshot result you can capture by clicking the Use current result as snapshot link.

When comparing two lists of labels, labels appearing in both lists will be shown with a yellow icon. Labels appearing only in the current results will receive a green icon, labels appearing only in the reference result will have a red icon. You can click the Venn diagram in the compare tool to toggle the visibility of each of those classes of labels.

The common, added and removed status of each label will be included in the CSV export of the label list.

Configures the label list view. Currently, the maximum number of labels shown in the list can be configured. Please note that the CSV export of the label list will contain all labels regardless of the maximum number of labels configured to show.

Additional options are available in the context menu activated by right-clicking on some label.

Add to
excluded
labels

Use this option to add the label to the dictionary of labels excluded during analysis. Two variants are available: excluding the exact form of a label or excluding all labels containing the selected label as a sub-phrase.

You can also add and edit existing dictionary entries in the Label exclusion patterns text area in the settings panel. For complete syntax of the dictionary entries, see the simple dictionary type documentation.

Note: The list of excluded labels you create in Lingo4G Explorer is remembered in your browser's local storage and sent to Lingo4G with each analysis request. The list is not saved in Lingo4G server, so it will be lost if you clear your browser's local storage. To make your exclusions lists persistent and visible for other Lingo4G users, move the entries to a dedicated static dictionary.

Themes and topics

The topics views show labels organized into larger structures, themes and topics. You can view the textual and treemap based presentation of the same data.

Topic list

The topic list view shows themes and topics in a textual form:

  • The light bulb icon indicates a topic, that is a group of related labels. The label printed in bold next to the icon is the topic's exemplar — the label that aims to describe all labels grouped in that topic.
  • The CAPITALIZED font indicates themes, that is groups of topics.

You can click on individual labels, topics and themes to view the documents associated with them.

The following tools are available in the topic list view:

Toggles the network view of the relationships between topics inside the selected theme.

Use mouse wheel to zoom in and out, click and move mouse to pan the zoomed view. Click graph nodes to show documents containing the selected label.

Topic list settings. Use this tool to set the maximum number of topics per theme and the maximum number of labels per topic to display. If the theme or topic contains more members than the specified minimum, you can click the +N more link to show all members.

Tip: separate limits apply when the theme structure network is showing and hidden. When the theme structure view is enabled, the list of themes is presented in one narrow column, hence by the individual labels are hidden in this case. You can change that in the settings dialog.

Topic treemap

Lingo4G Explorer can visualize themes and topics as a treemap. Top-level cells represent themes, their child groups represent topic. Children of the topic cell represent individual labels. Size and color of the cells can represent specific properties of themes, topics and labels.

The following tools are available in the topic treemap view:

Export the current treemap as a JPEG/PNG image.

Configuration of various properties of the treemap, such as group sizing and colors.

Cell sizing

Which property of the theme, topic and label to use to compute the size of the corresponding cell.

By similarity
Cell size is determined by the similarity between the label and its topic or the topic and its theme. For a theme, the average similarity of its topics is taken.
By label DF
Cell size is determined by the number of documents in which the associated label appears.
Size theme & topic
cells by label count

When enabled, only the size of label cells will follow the value of the property selected in the Cell sizing option, size of topic and theme cells will be proportional to the sum of the sizes of their children.

When disabled, treemap cells at all levels will follow the value of the property selected in the Cell sizing option.

Treemap style
Determines the treemap style to use. Note that polygonal layouts may take significantly more time to render, especially when the analysis contains large numbers of labels.
Treemap layout

Determines the treemap layout to use.

Flattened
All treemap levels, that is themes, topics and labels, are visible at the same time. This layout demands more CPU time on the machine running Lingo4G Explorer app.
Hierarchical
Initially only themes are visible in the treemap. To browse topics, double-click the theme's cell, to browse labels, double-click the topic cell. This layout puts less stress on the CPU.
Show label DF in cells

When enabled, the number of documents in which the label appears will be shown in the corresponding cell.

The Theme color, Topic color and Label color options control how the color of the corresponding cells is computed. Currently, the color scale is fixed and ranges from blue for lowest values, light yellow for medium values to red for largest values:

lowest medium largest

The following cell coloring strategies are available:

none
The cell will be painted in grey.
from parent
The cell will use the same color as its parent. Not available for themes.
by label DF
The number of documents in which the label appears will determine the color.
by label DF (shade)
Same as "by label DF" but the lightness of the parent color will be varied instead of the color itself. Dark shades will represent low values, light shades will represent high values. Not available for themes.
by similarity
Similarity to the parent entity will determine the color. For themes, average similarity of the theme's topics will be used.
by similarity (shade)
Same as "by label similarity" but the lightness of the parent color will be varied instead of the color itself. Not available for themes.
by silhouette
Silhouette coefficient value will determine the color. High values (red) mean that the label very well matches its cluster, low values (blue) may indicate that the label would better match a different cluster.
by silhouette (shade)
Same as "by label similarity" but the lightness of the parent color will be varied instead of the color itself. Not available for themes.

You can use the Show up to inputs to determine how many themes, topics and labels should be shown in the visualization in total. Large numbers of labels in the visualization will make it more CPU-demanding. The statistics bar will indicate if any limits were applied and display the number of themes, topics and labels visible in the treemap.

Document clusters

The document clusters view shows documents organized into related groups. You can view the textual and treemap representation of document clusters.

Document clusters list

The document clusters list view shows document groups in a textual form:

  • The folder icon indicates one document cluster.
  • Each cluster is described by a list of labels that most frequently appear in the documents contained in the cluster. The number in parentheses shows how many of the cluster's documents contained that label.
  • Clicking on the cluster entry will load the cluster's documents in the document content view.

Document clusters treemap

Lingo4G can visualize document clusters as a treemap. Each document cluster is represented by one top-level treemap cell. Lower-level cells represent individual documents contained in the cluster. The landmark icon indicates the cluster's exemplar document. Coloring and sizing of document cells can depend on the configured field of the document.

Clicking on the document cluster cell will load the cluster's documents in the document content view. Clicking on the document cell will load the specific document.

To keep the treemap visualization responsive, the number of individual document cells will be limited to the value configured in the view's settings. In the screen shot above, of about 9k documents clustered, only 3k have their representation in the treemap, as indicated by the 3k docs shown statistic.

The following tools are available in the document clusters treemap view:

Export the current treemap as a JPEG/PNG image.

Configuration of various properties of the treemap, such as layout or cell number limits.

Treemap style
Determines the treemap style to use. Note that polygonal layouts may take significantly more time to render, especially when the analysis contains large numbers of clustered documents.
Treemap layout

Determines the treemap layout to use.

Flattened
All treemap levels, that is document clusters and individual document cells, are visible at the same time. This layout demands more CPU time on the machine running Lingo4G Explorer app.
Hierarchical
Initially only document cluster cells are visible in the treemap. To browse individual documents, double-click the cluster's cell. This layout puts less demand on the CPU.
Color by

Determines the document field that Lingo4G Explorer will use to assign colors to document cells. Color configuration consists of three select boxes:

Field choice
Lists all available document fields you can choose for coloring. Two additional choices are: <none> for coloring all cells in grey and <similarity> for coloring based on the document's similarity to the cluster exemplar.
Transforming
function
The transformation function to apply to numeric values before computing colors. Such transformation may be useful to "even out" very large or very small outlier values.
Color
palette

The color palette to use:

auto
Automatic palette, diverging for numeric and date fields, hash for other types.
sequential
Colors are taken from a yellow-to-red palette, where yellow represents smallest values and red represents largest values.
diverging
Colors are taken from a blue-to-red palette, where blue represents smallest values and red represents largest values.
hash
Colors are computed based on a hash code of the field value. This palette will always generate the same color for the same field value. Hash palette is useful for enumeration type of fields, such as country or division.
Size by

Determines the document field to use to compute the size of document cells. Sizing configuration consists of two select boxes:

Field choice
Lists all available document fields you can choose for sizing. Two additional choices are: <none> for same-size cells and <similarity> for sizing based on the document's similarity to the cluster exemplar.
Transforming
function
The transformation function to apply to numeric values before computing colors. Such transformation may be useful to "even out" very large or very small outlier values.
Hide
zero-sized
If checked, groups with zero size will be hidden from the treemap. Zero-sized groups will most often be a result of empty values of the document field used for sizing. Note that the numbers of documents and label occurrences displayed in the label will always refer to the whole cluster, regardless of whether some documents are hidden from view.
Label by

Determines the document field to display in document cells. Apart from document field names the additional choices are: <none> for no text in document cells, <coloring field> to display the coloring field value, <sizing field> to display the sizing field value and <similarity> to display the document's similarity to the cluster exemplar.

Highlight

Enables highlighting of same-color or same-label cells. When enabled, cells with the same color or same label as the selected cell will be highlighted.

You can use the Show up to ... input boxes to limit the number of document clusters and individual documents represented in the visualization. Large numbers of documents in the visualization will make it more CPU-demanding. The statistics bar will indicate if any limits were applied.

Document content view

The document content view shows the text of the top documents matching the currently selected label, theme, topic or document cluster. Along with the contents of the document, Lingo4G Explorer will display which of the labels selected for analysis occur in the document.

The document content view has the following tools:

Fields

Configuration of which document fields to show. For each field, you can choose one of the following display modes:

show as title
Contents of the field will be shown in bold at the top of the document. Use this mode for short document fields, such as the title.
show as subtitle
Contents of the field will be shown in regular font below the title. Use this mode for fields representing additional short information, such as authors of a paper.
show as body
Contents of the field will be shown below subtitle. Use this mode for fields representing the document body.
show as tag
Contents of the field will be shown below document body, prefixed with the field name. Use this mode for short synthetic fields, such as document id, creation date or user-generated tags.
don't show
Contents of the field will not be shown at all.

Additionally you can determine how much content should be shown:

Show up to N
values per field
For multi-value fields, such as user-generated tags, determines the maximum number of values to show.
Show up to M chars
per field value
Determines the maximum number of characters to fetch per each field value. This setting prevents displaying the entire contents of very long documents.

You can also choose how to highlight scope query and selected labels:

Highlight scope query
When checked, Lingo4G Explorer will highlight matches of scope query in the documents.
Highlight labels
When checked, Lingo4G Explorer will highlight occurrences of labels selected in the label list, topic list and topic treemap views.

Configures how to load the documents to display:

Load up to
N documents
Sets the maximum number of documents to load. Currently, Lingo4G Explorer does not support paging when browsing lists of documents.
Show up to M labels
per document
Determines the number of labels to display for each document. Lingo4G Explorer will display the labels in the order of decreasing number of occurrences in the document.
Show only labels with
P or more occurrences
per document
Only labels with the specified minimum number of occurrences per documents will be shown. You can use this option to filter out rarely-occurring labels.

Results export

You can use the analysis results export tool to save the current analysis results as Excel, XML or JSON file. You can also use the tool to get a curl command invocation that would fetch the result directly from Lingo4G REST API. To open the results export tool, click the Export link located at the top-right corner of the application window.

Lingo4G Explorer results export tool

The following export settings are available:

Format
The format of the export file. Currently the Excel, XML and JSON formats are available.
Include themes and topics
Check to generate and include in the export file the list of themes and topics.
Include document clusters
Check to generate and include in the export file the list of document clusters.
Include document content
Check to include the content of selected document fields in the export file. You can configure the list of fields to include using the Choose document fields to include list.
Include document labels
Check to include for each document the list of labels contained in that document.
Include documents
without labels
Check to include documents that did not contain any of the labels selected for analysis.

Click the Export button to initiate the export file download. Please note that for large export files it may take several seconds for the download to begin. Click the Copy as curl button to copy to clipboard the curl command invocation that will fetch the configured export content directly from Lingo4G REST API.

Parameter experiments

You can use parameter experiments tool to observe how certain properties of analysis results change depending on parameter values. For example, you can observe how the number of label clusters depends on the input preference and softening parameters.

To run an experiment, use the controls located on the right to configure the independent and dependent variables and press the Run button.

The following experiment configuration options available:

X axis
Choice of the primary independent variable. Depending on the type of the variable, you will be able to specify the range of values to use during experiments.
X cat
Choice of the secondary independent variable. If some variable is selected, an independent chart will be generated for each value of the secondary independent variable.
Series
Choice of the series variable. For each value of the selected variable a separate series will be computed and presented on the chart.
Threads
The number of parallel REST API invocations to allow when running the experiment.
Run
Click to start the experiment, click again to stop computation. Please note that the experiments tool will take a Cartesian product of the ranges configured on the X axis, X cat and Series. Depending on the configuration this may lead to a large number of analyses to perform. Please check the hint next to the Run button for the number of analysis that will need to be performed.
Y axis

Choice of the dependent variable. The selected property will be drawn on the chart.

The following results properties are available:

Theme count
The number of top-level themes in the result, excluding the "Unclustered" theme.
Theme size average
The total number of labels assigned to topics divided by the number of top-level themes.
Topic count
The total number of topics, excluding the "Unclustered" topic.
Topic size average
The total number of labels assigned to topics divided by the number of topics.
Topic size sd/avg
The the standard deviation of the number of labels per topic divided by the average number of labels per topic. Low values of this property mean that all topics contain similar numbers of labels, higher values mean that the result contains size-imbalanced topics.
Multi-topic theme %
The number of themes containing more than one topic divided by the total number of themes. Indicates how "rich" the structure of themes is.
Topics per theme average
The total number of topics defined by the total number of themes. Indicates how "rich" the internal structure of themes is.
Coverage
The number of labels assigned to topics divided by the total number of labels. Low coverage means many unclustered labels.
Topic label word count average
The average number of words in the labels used for describing topics.
Topic label DF average
The average document frequency of the labels used for describing topics.
Topic label DF sd/avg
The the standard deviation of the topic label document frequency divided by average topic label document frequency.
Topic label stability
How many common topic labels there are compared to the last result active in the main application window. A value of 1.0 means the sets of topic labels are identical, a value of 0.0 means no common labels. Technically, this value is computed as 2 * common-labels / (current-topic-count + main-application-topic-count).
Silhouette average
The average value of the Silhouette coefficient calculated for each label in the result. The Silhouette average shows how well topics are separated from each other. The lower the value, the worse the separation.
Net similarity
The sum of similarities between topic member labels and the corresponding topic description labels. Unclustered labels are excluded from net similarity calculation.
Pruning gain
How much of the original similarity matrix could be pruned without affecting the final label clustering result.
Iterations
The number iterations the clustering algorithm required for convergence.
Copy
results as
CSV
Click to copy to clipboard the results of the current experiments in CSV format.

Example usage

To observe, for example, how the number of themes generated by Lingo4G depends on the Softening and Similarity weighting parameters:

  1. In the X axis drop down, choose Input preference
  2. In the Series drop down, choose Similarity weighting
  3. In the Y axis drop down, choose Topic count
  4. Click the Run button

Once the analyses complete, you will most likely see that negative Input preference values produce fewer clusters and increasing the preference value also increases the number of clusters. To further confirm this, choose Topic size average in the Y axis drop down to see that the number of labels per topic decreases as Input preference gets higher.

To further break down the results by, for example, the Softening parameter values, choose that parameter in the X cat drop down and press the Run button.

Ideas for experiments

Try the following experiments with your data. Note that your results will depend on your specific data set, scope query and other base parameters set in the main application window.

  • What impact does Input preference have on the number of unclustered labels?

    Choose Coverage for the Y axis to see what percentage of labels were assigned to topics.

  • What impact does Softening have on the structure of themes?

    The Topics per theme average property on the Y axis can show how "rich" the structure of themes is. Values larger than 1 will suggest the presence of theme-topic hierarchies, while values close to 1 will indicate flat one-topic themes.

  • Which Similarity weighting creates most size-balanced topics?

    To find out, put on the Y axis the Topic size sd/avg property, which is the standard deviation of the number of labels per topic divided by the average number of labels per topic. Low values of this property mean that all topics contain similar numbers of labels.

  • How stable are the topic labels with respect to different Similarity weighting schemes?

    Choose Similarity weighting on the X axis and Topic label stability for the Y axis. The topic label stability property indicates how many common topic labels there are compared to the last result active in the main application window. A value of 1.0 means the sets of topic labels are identical, a value of 0.0 means no common labels.

  • How to affect the length of labels Lingo4G chooses to describe themes and topics?

    Set Preference initializer scaling on the X axis and choose Preference initializer in the X cat drop down. Putting Topic label word count average on the Y axis will reveal the relationship. Try also graphing Coverage to see the cost of increasing theme and topic description length.

  • How well are topics separated?

    Put Similarity weighting on the X axis, choose Silhouette average on the Y axis. The Silhouette coefficient shows how well topics are separated from each other. The lower the value, the worse the separation. Due to the nature of label clustering, highly-separated clusters are hard to achieve. Increasing Input preference will usually increase separation at the cost of lowered coverage.

  • What impact does Softening have on how quickly the clustering algorithm converges? Choose Iterations on the Y axis to find out.

Tips and notes

  • Experiments limited to label clustering. Currently, the available result properties and independent variables concentrate on label clustering. Further releases will make it possible to experiment also with label selection and document clustering.
  • Base parameter values. Parameter changes defined in this dialog are applied as overrides over the current set of parameters defined in the main application window. Therefore, to change the value of some "base" parameter, such as scope query, close this dialog, modify the parameter in the main application window and invoke the experiments dialog again.
  • Y axis property changes. Changes of the property displayed on the Y axis are immediate, they do not require re-running the experiment.
  • You can click the icon in the top-right corner of the tool to view a help screen that repeats the information contained in this section. Pressing the Run button closes the help text to reveal the results chart.

Commands

The l4g (Linux/Mac OS) and l4g.cmd (Windows) scripts serve as the single entry point to all Lingo4G commands.

Note for Cygwin users

When running Lingo4G in Cygwin, use the l4g script (Bash script). Windows-specific l4g.cmd will leave stray processes running in the background when ctrl-c is received in the terminal.

Running Lingo4G under mingw or any other (non-CygWin) posix shell under Windows is not officially supported.

l4g

Launch script for all Lingo4G commands. Usage:

l4g [options] command [command options]
options

The list of launcher options, optional.

--exit
Call System.exit() at end of command.
-h, --help
Display the list of available commands.
command
The command to run, required. See the rest of this chapter for the available commands and their options.
command options
The list of command-specific options, optional.

Tip: reading command parameters from file. If your invocation of the l4g script contains a long list of parameters, such as when selecting documents to cluster by identifier, you may need to put all your parameters in a file, one per line:

cluster
-p
datasets/dataset-ohsumed
-v
-s
id=101416,101417,101418,101419,10142,101420,101421,101422,101423,101424,101425,101426,101427,101428,101429,10143,101430,101431,101432,101433,101434,101435,101436,101437,101438,101439,10144,101440,101441,101442,101443,101444,101445,101446,...
              

and provide the file path to l4g launcher script using the @ syntax:

l4g @parameters-file

l4g analyze

Performs analysis of the provided project's data. Usage:

l4g analyze [analysis options]

The following clustering options are supported:

-p, --project
Location of the project descriptor file, required.
-s, --select

A query that selects documents for analysis, optional. The syntax of the query depends on the analysis scope.type defined in the project descriptor.

  • For the byQuery scope type, Lingo4G will analyze all documents matching the provided query. The query must follow the syntax of the Lucene query parser configured in the project descriptor.

  • For the byFieldValues scope type, Lingo4G will select all documents whose specified field is equal to any of the provided values. The syntax in this case must be the following:

    <field-name>=<value1>,<value2>,...

The basic analysis section lists a number of example queries. If this parameter is not provided, the query specified in the project descriptor is used.

-m, --max-labels
The maximum number of labels to select, optional. If not provided, the default maximum number of labels defined in the project descriptor file will be assumed.
-ff, --feature-fields
The space-separated list of feature fields to use for analysis.
-j, --analysis-json-override

The JSON override to apply to the analysis section of the project descriptor. You can use this option to temporarily change certain analysis parameters from their default values. The provided string must be a valid JSON object following the syntax of the analysis section of the project descriptor. The override JSON may contain only those parameters you wish to override. Make sure you properly quote the double quote characters being part of your JSON override value. An easy way to get the proper override JSON string is to use Lingo4G Explorer JSON export option.

Some example JSON overrides:

l4g analyze -j "{ labels: { surface: { minLabelTokens: 2 } } }"
l4g analyze -j "{ labels: { frequencies: { minAbsoluteDf: 5 }, scorers: { idfScorerWeight: 0.4 } } }"
l4g analyze -j "{ output: { format: \"excel\" } }"
-o, --output

Target file name (or directory) to which analysis results should be saved, optional.

  • If the option is not provided, the results will be saved in the project's results directory. Previous result file will be overwritten.
  • If a path to an existing directory is provided, the result will be saved to that directory. Previous result file will be overwritten.
  • If a path to a file is provided, the result will be saved to that file, overwriting its previous content if the file exists. All parent directories of the provided file path must exist.
-v, --verbose
Output detailed logs, useful for problem solving.
-q, --quiet
Limit the amount of logging information.
-D

Sets a system property to the provided value. You can refer to such system properties in the project descriptor file.

Use JVM syntax to provide the values: -Dproperty=value, for example -Dwork.dir=/mnt/ssd/collection1/work.

l4g index

Performs indexing of the provided project's data. Usage:

l4g index [indexing options]

The following indexing options are supported:

-p, --project
Location of the project descriptor file, required.
-f, --force
Delete the contents of the existing non-empty index before re-indexing. Lingo4G requires an explicit confirmation before overwriting a non-empty index, so that you do not accidentally delete an index that may have taken hours to generate.
--max-docs N
If present, Lingo4G will index only the provided number of documents. If the document source returns more than N documents, the extra documents will be ignored.
-v, --verbose
Output detailed logs, useful for problem solving.
-q, --quiet
Limit the amount of logging information.
-D

Sets a system property to the provided value. You can refer to such system properties in the project descriptor file.

Use JVM syntax to provide the values: -Dproperty=value, for example -Dwork.dir=/mnt/ssd/collection1/work.

l4g server

Starts Lingo4G REST API server.

l4g server [server options]

The following options are supported:

-p, --project
Location of the project descriptor file to expose in the REST API, required. Currently, one running instance of the REST API can expose one project.
-r, --port
The port number the server will bind to, 8080 by default. When port number 0 is provided, a free port will be assigned automatically.
-w, --web-server
Enables built-in web server serving content from L4G_HOME/web, enabled by default. Please take security into consideration when leaving this option enabled in production.
-d, --development-mode
Enables development mode, enabled by default. In development mode, Lingo4G REST API server will not lock the files served from the L4G_HOME/web, so that changes made to those files are visible without restarting the server.
--cors origin

Enables serving CORS headers, for the provided origin, disabled by default. If a non-empty origin value is provided, Lingo4G REST API will serve the following headers:

Access-Control-Allow-Origin: origin
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Content-Type, Origin, Accept
Access-Control-Expose-Headers: Location
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS

Please take security into consideration when enabling this option in production.

--idle-time
Sets the default idle time on socket connections, in milliseconds. If synchronous, large REST requests expire before results are received then bumping idle time with this option may solve the problem (alternatively, use asynchronous API).
--so-linger-time
Sets socket lingering to a given amount of milliseconds.
-v, --verbose
Output detailed logs, useful for problem solving.
-q, --quiet
Limit the amount of logging information.
-D

Sets a system property to the provided value. You can refer to such system properties in the project descriptor file.

Use JVM syntax to provide the values: -Dproperty=value, for example -Dwork.dir=/mnt/ssd/collection1/work.

Heads up, public HTTP server!

Lingo4G's REST API starts and runs on top of a HTTP server. There is no way to configure limited access or HTTP authorization to this server — security should be ensured externally, for example by restricting public access to the HTTP port designated for Lingo4G on the machine.

The above remark is particularly important when l4g server is used together with the -w option, as then the entire content of the L4G_HOME/web folder is then made publicly available.

l4g show

Shows the project descriptor JSON with all default and resolved values. You can use this command to

  • verify the syntax of a project descriptor file,
  • check if all variables are correctly resolved,
  • view all option values that apply to the project, including the default ones that were not explicitly defined in the project file.
l4g show [show options]

The following options are supported:

-p, --project
Location of the project descriptor file to show, required.
-v, --verbose
Output detailed logs, useful for problem solving.
-q, --quiet
Limit the amount of logging information.
-D

Sets a system property to the provided value. You can refer to such system properties in the project descriptor file.

Use JVM syntax to provide the values: -Dproperty=value, for example -Dwork.dir=/mnt/ssd/collection1/work.

l4g stats

Shows some basic statistics of the Lingo4G index associated with the provided project, including the size of the index, histogram of document lengths and term vectors, histogram of phrase frequencies.

l4g stats [stats options]

The following options are supported:

-p, --project
Location of the project descriptor file to generate the statistics for, required.
-a, --accuracy
Accuracy of document statistics fetching, optional, default: 0.1. You can increase the accuracy for more accurate but slower computation of document length and term vector size histogram estimates. Use the value of 1.0 for an accurate computation.
-tf, --text-fields
The list of fields to use when computing document length histogram, optional, default: all available text fields. Computation of document length histogram is disabled by default, use the --analyze-text-fields to enable it.
--analyze-text-fields
When provided, the histogram of the lengths of raw document text will be computed.
-ff, --feature-fields
The list of feature fields to use when computing phrase frequency histogram, optional, default: all available feature fields.
-t, --threads
The number of threads to use for processing, optional, default: the number CPU cores available.
-v, --verbose
Output detailed logs, useful for problem solving.
-q, --quiet
Limit the amount of logging information.
-D

Sets a system property to the provided value. You can refer to such system properties in the project descriptor file.

Use JVM syntax to provide the values: -Dproperty=value, for example -Dwork.dir=/mnt/ssd/collection1/work.

l4g version

Prints Lingo4G version, revision and release date information.

REST API

You can use Lingo4G REST API to invoke and retrieve the analysis results from your favorite programming language or directly from a browser. You can start Lingo4G REST API server using the server command.

Overview

Lingo4G REST API follows typical patterns for remote HTTP services:

  • HTTP protocol is used for initiating analyses and retrieving their results.
  • JSON is the main data exchange format. Details of an individual analysis request can be specified by providing a JSON object that corresponds to the analysis section of the project descriptor. The provided JSON object needs to specify only those parameters for which you wish to use a non-default value. Analysis results are available in JSON and XML formats.
  • Asynchronous service pattern is available to handle long-running analysis requests and to monitor their progress.

Currently, Lingo4G REST API has the following limitations:

  • Only analysis capabilities are exposed in the REST API. Indexing can only be performed using the index command.
  • One instance of Lingo4G REST API can handle one project. To expose multiple projects through the REST API, start multiple REST API instances on different ports.
  • Lingo4G REST API does not offer any authentication or authorization layer. If such features are required, you need to build them into your applications and APIs that call Lingo4G API making sure that Lingo4G REST API is available only to your application.

Example invocations

Lingo4G analysis is initiated by making a POST request at the /api/v1/analysis endpoint. Request body should contain a JSON object corresponding to the analysis section of the project descriptor. Since only non-default values are required, the provided object can be empty, in which case the analysis will be based entirely on the definition loaded from the project descriptor.

The following sections demonstrate how to invoke analysis in a synchronous and asynchronous mode. We omit non-essential headers for brevity. Please refer to the REST API reference for details about all endpoints and their parameters.

Synchronous invocation

POST /api/v1/analysis?async=false HTTP/1.1
Host: localhost:8080
Accept-Encoding: gzip, deflate

{ }

We pass an empty JSON object { } in request body, so the processing will be based entirely on the analysis parameters defined in the project descriptor.

HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: gzip

{
  "labels": {
    "list": [ {
        "index": 16,
        "text": "New York"
      }, {
        "index": 253,
        "text": "young man"
      }, ...
    ]
  },

  ...

  "summary": {
    "elapsedMs": 6939,
    "candidateLabels": 8394
  },
  "scope": {
    "selector": "",
    "documentsInScope": 426281
  }
}

The request will block until the analysis is complete. The response will contain the analysis results in the required format, JSON in this case.

It is not possible to monitor the progress of synchronous analysis request. To be able to access progress information, use the asynchronous invocation mode.

Asynchronous invocation

The asynchronous invocation sequence consists of three phases: initiating the analysis, optional monitoring of analysis progress and retrieving analysis results.

Initiating the analysis

POST /api/v1/analysis HTTP/1.1
Host: localhost:8080
Accept-Encoding: gzip, deflate

{
  "scope": {
    "type": "byQuery",
    "query": "christmas"
  },
  "labels": {
    "frequencies": {
      "minAbsoluteDf": 5,
      "minRelativeDf": 0
    },
    "scorers": {
      "idfScorerWeight": 0.4
    }
  }
}

In this example, the POST request body will include a number of overrides over the project descriptor's default analysis parameters. Notably, we override the scope section to analyze a subset of the whole collection.

HTTP/1.1 202 Accepted
Location: http://localhost:8080/api/v1/analysis/b045f2fcbcc16c9f
Content-Length: 0

Following the asynchronous service pattern, once the analysis request is accepted, the Location header will point you to the URL from which you will be able to get progress information and analysis results.

Monitoring analysis progress

GET /api/v1/analysis/b045f2fcbcc16c9f HTTP/1.1
Host: localhost:8080
Accept-Encoding: gzip, deflate

To monitor the progress of analysis, make a GET request at the status URL returned in the Location header.

HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: gzip

{
  "status": "PROCESSING",
  "progress": [ {
      "step": "Resolving selector query",
      "progress": 1.0
    }, {
      "step": "Fetching candidate labels",
      "progress": 1.0
    }, {
      "step": "Scoring candidate labels",
      "progress": 0.164567753
    }, ...
  ]
}

The response will contain a JSON object with analysis progress information.

GET /api/v1/analysis/b045f2fcbcc16c9f HTTP/1.1
Host: localhost:8080
Accept-Encoding: gzip, deflate
HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: gzip

{
  "status": "AVAILABLE",
  "progress": [ {
      "step": "Resolving selector query",
      "progress": 1.0
    }, {
      "step": "Fetching candidate labels",
      "progress": 1.0
    }, {
      "step": "Scoring candidate labels",
      "progress": 1.0
    }, ..., {
      "step": "Computing coverage",
      "progress": 1.0
    }
  ]
}

You can periodically poll the progress information until the processing is complete.

Fetching analysis results

POST /api/v1/analysis/b045f2fcbcc16c9f/result HTTP/1.1
Host: localhost:8080
Accept-Encoding: gzip, deflate

To retrieve the analysis results, make a POST request at the status URL with the /result suffix.

HTTP/1.1 200 OK
Content-Type: application/json
Content-Encoding: gzip

{
  "labels": {
    "list": [ {
        "index": 7,
        "text": "Christmas Eve"
      }, {
        "index": 196,
        "text": "Santa Claus"
      }, ...
    ]
  },

  ...

  "summary": {
    "elapsedMs": 378,
    "candidateLabels": 3340
  },
  "scope": {
    "selector": "christmas",
    "documentsInScope": 3866
  }
}

The request will block until the analysis results are available. This means you can issue the results fetching request right after you receive the status URL and then concurrently poll for processing progress, while the results fetching request blocks waiting for the results.

POST /api/v1/analysis/b045f2fcbcc16c9f/result HTTP/1.1
Host: localhost:8080
Accept-Encoding: gzip, deflate

{
  "format": "xml",
  "labels": {
    "documents": {
      "enabled": true,
      "outputScores": true
    }
  }
}

You can retrieve a different "view" of the same result by making another (or a concurrent) request at the /result URL passing in POST request body a JSON object that overrides the output specification subsection. In this example, we change the response format to XML and request Lingo4G to fetch top-scoring documents for each selected label.

HTTP/1.1 200 OK
Content-Type: application/xml
Content-Encoding: gzip

<result>
  <labels>
    <list>
      <label index="125" text="Christmas Eve">
        <document id="361453" score="16.852114"/>
        <document id="168068" score="15.5833"/>
        ...
      </label>
      <label index="378" text="Santa Claus">
        <document id="148398" score="19.069061"/>
        <document id="353471" score="17.928875"/>
        ...
      </label>
    </list>
  </labels>
  ...
</result>

The response is now in XML format and contains top-scoring documents for each selected label.

Caching

To implement the asynchronous service pattern, Lingo4G REST API needs to cache the results of completed analyses for some time. By default, up to 1024 results will be cached for up to 120 minutes, but you can change those parameters by editing L4G_HOME/conf/server.json.

One consequence of the asynchronous service pattern is that the requests for analysis progress or results may complete with the 404 Not Found status code if the analysis the requests were referring to have already been evicted from the cache. In this case, the application needs to initiate a new analysis with the same parameters.

Application development

Example code. If you are planning to access Lingo4G REST API from Java, the src/lingo4g-examples directory contains some example code that makes API calls using JAX-RS Client API and Jackson JSON parser.

Built-in web server. If you are planning to call Lingo4G REST API directly from client-side JavaScript, you can use the REST API's built-in web server to serve your application. The built-in web server exposes the L4G_HOME/web directory, so you can put your application code in there and access through your browser.

Reference

The base URL for Lingo4G REST API is http://host:port/api. Entries in the following REST API reference omit this prefix for brevity.

/v1/about

Returns basic information about Lingo4G version and the project served by this instance of the REST API.

Methods
GET
Parameters
none
Response

A JSON object with Lingo4G and project information similar to (build identifier's pattern is given for reference, but it can change at any time):

{
  "product": "Lingo4G",
  "version": "1.2.0",
  "build": "yyyy-MM-dd HH:mm gitrev",
  "projectId": "imdb"
}

/v1/analysis

Initiates a new analysis.

Methods
GET, POST
Request body
A JSON object corresponding to the analysis section of the project descriptor with per-request overrides to the parameter values specified in the project descriptor.
Parameters
async

Chooses the synchronous vs. asynchronous processing mode.

true
(default) The request will be processed in an asynchronous way and will return immediately with the Location header pointing at a URL for fetching analysis progress and results.
false
The request will be processed in a synchronous way and will block until processing is complete. The response will contain the analysis result.
spec

For GET requests, the analysis specification JSON.

Response

For asynchronous invocation: the 202 Accepted status code along with the status URL in the Location header. Use the status URL to get processing progress information, use the results URL to retrieve the analysis results.

For synchronous invocation: results of the analysis .

/v1/analysis/{id}

Returns partial analysis results, including processing progress and selected result statistics. You can call this method in, for example, 1 second intervals to get the latest processing status and statistics.

Methods
GET
Parameters
none
Response

Partial analysis results JSON object following the structure of the complete analysis results JSON. When retrieved using this method, the JSON object will contain processing progress information as well as label and document statistics as soon as they become available.

If certain statistics are yet to be computed, the corresponding fields will be absent from the response. Once a statistic becomes available, its value will not change until the end of processing.

{
  // Label statistics
  "labels": {
    "selected": 1000,
    "candidate": 5553
  },

  // Document statistics
  "documents": {
    "inScope": 9107,
    "labeled": 9084
  },

  // Processing status and progress
  "status": {
    "status": "PROCESSING",
    "elapsedMs": 2650,
    "progress": [ ]
  }
}

Click the properties in the example above for a complete description.

Errors
This request will return 404 Not Found if an analysis with the provided identifier does not exist. The analysis may be missing if its entry was evicted from the results cache. In such cases, the application will need to request a new analysis with the same parameters.

/v1/analysis/{id}/result

Returns the analysis result. The request will block until the analysis results are available.

Methods
GET, POST
Request body
(optional) A JSON object corresponding to the output section of the project descriptor with per-request overrides to the analysis output specification.
Parameters
spec

For GET requests, the output specification JSON.

Response

Analysis results in the requested format. While the following documentation is based on the JSON result format, the XML format contains exactly the same data.

The top-level structure of the result JSON output is shown below. Click on the property names to jump to the description.

{
  // List of labels and label clusters
  "labels": {
    "selected": 1000,
    "candidate": 5553,

    "list": [ ],
    "arrangement": { }
  },

  // List of documents and document clusters
  "documents": {
    "inScope": 9107,
    "labeled": 9084,

    "list": [ ],
    "arrangement": { }
  },

  // Processing status and progress
  "status": {
    "status": "PROCESSING",
    "elapsedMs": 2650,
    "progress": [ ]
  },

  "spec": {
    "scope": { },
    "labels": { },
    "documents": { },
    ...
  }
}

Labels

The labels section contains all the result artifacts related to the labels selected for analysis. The labels section can contain the following subsections:

selected
The number of labels selected for analysis.
candidate
The number of candidate labels considered when selecting the final list of labels.
list
The list of labels selected for analysis.
arrangement
The list of label clusters.

Label list

The list property contains an array of objects, each of which represents one label:

{
  "labels": {
    "list": [
      {
        "id": 314,
        "text": "Excel",
        "df": 1704,
        "display": "Excel",
        "score": 0.0020637654,
        "documents": [ 219442, 182400, 186036, ... ]
      },
      {
        "id": 1,
        "text": "Microsoft Office",
        "df": 1646,
        "display": "Microsoft Office",
        "score": 0.0023052557,
        "documents": [ 173570, 19411, 109766, ... ]
      },
      ...
    ]
  },
}

Each label object has the following properties:

id
A unique identifier of the label. Other parts of the analysis result, such as label clusters, reference labels using this identifier.
text
The text of the label as stored in the index. Use this text where the REST API requires to specify label text, such as in the document retrieval criteria.
display
The text of the label to display in the user interface. The display text depends on the label formatting options, such as labelFormat.
df
The Document Frequency of the label, that is the number of documents that contain at least one occurrence of the label.
score
The internal score computed by Lingo4G for the label. Label scores are only meaningful when compared to scores of other labels. The larger the score, the more valuable the label according to Lingo4G scoring mechanism.
documents

The list of documents in which the label occurs. Documents are represented by internal integer identifiers you can use in the document retrieval criteria.

The list is returned only if the output.labels.documents.enabled parameter is set to true.

The type of documents array entries depend on the value of the parameter:

false

If label-document assignment scores are not requested, the documents array consists of internal identifiers of documents.

{
  "documents": [ 173570, 19411, 109766, ... ]
}
true

If label-document assignment scores are requested, the documents array consists of objects containing the id and score properties.

{
  "documents": [
    {
      "id": 15749,
      "score": 10.0173645
    },
    {
      "id": 228297,
      "score": 9.601537
    },
    ...
  ]
}

Label clusters

If label clustering was requested by setting to true, the arrangement section will contain the clusters:

{
  "labels": {
    "arrangement": {
      // Top-level clusters
      "clusters": [
        {
          "id": 28,
          "exemplar": 28,
          "similarity": 1.0,
          "silhouette": -0.9497682,

          // Labels assigned to the cluster
          "labels": [
            {
              "id": 301,
              "similarity": 0.25650117,
              "silhouette": -0.8132695
            },
            {
              "id": 22,
              "similarity": 0.1252955,
              "silhouette": -0.6878787
            },
            ...
          ],

          // Clusters related to this cluster, if any
          "clusters": [ ]
        },
        ...
      ],

      // Global properties of the result
      "converged": true,
      "iterations": 162,
      "silhouetteAverage": -0.42054054,
      "netSimilarity": 43.99878,
      "pruningGain": 0.07011032
    }
  }
}

The main part of the clustering result is the clusters property that contains the list of top-level label clusters. Each cluster contains the following properties:

id
Unique identifier of the cluster.
exemplar
Identifier of the label that serves as the exemplar of the cluster.
similarity
Similarity between this cluster's and the parent cluster's exemplars, 1.0 for top-level clusters.
silhouette
The silhouette coefficient computed for the cluster's exemplar.
labels

The list of label members of the cluster. Each label member is represented by an object with the following properties:

id
Identifier of the label.
similarity
Similarity between the member label and the cluster's exemplar label.
silhouette
Silhouette coefficient computed for the member label.

Note: The list of member labels includes only the "ordinary" labels, that is those that are not exemplars of this cluster or the related clusters.

Note, however, that the exemplar labels are legitimate members of the cluster and they should also be presented to the user. The exemplar of this cluster, its similarity and silhouette values are direct properties of the cluster object. Similarly, the exemplars of related clusters are properties of the related clusters, available in the clusters property of the parent cluster.

clusters
The list of clusters related to this cluster. Each object in the list follows the structure of top-level clusters. Please see the conceptual overview of label clustering for more explanations about the nature of cluster relations.

The clustering results contains a number of properties specific to the Affinity Propagation (AP) clustering algorithm. Those properties will be of interest mostly to the users familiar with that clustering algorithm.

converged
true if the AP clustering algorithm converged to a stable solution.
iterations
The number of iterations the AP clustering algorithm performed.
silhouetteAverage
The Silhouette coefficient average across all member labels.
netSimilarity
The sum of similarities between labels and their exemplar labels.
pruningGain
The proportion of label relationships that were removed as part of relationships matrix simplification. A value of 0.5 means 50% of the relationships could be removed without affecting the final result.

Documents

The documents section contains all the result artifacts related to the documents being analyzed. This section can contain the following properties:

inScope
The number of documents in scope.
totalMatches
The total number of documents that matched the scope query. The total number of matches will be larger than the number of documents in scope if the scope was limited by the user-provided limit parameter or by the limit encoded in the license file.
scopeLimitedBy

If present, explains the nature of the scope size limit:

USER_LIMIT
Scope was capped at the limit provided in the limit parameter.
LICENSE_LIMIT
Scope was capped at the limit encoded in the license file.
labeled
The number of documents that contained at least one of the labels selected for analysis.
list
The list of documents in scope.
arrangement
The document clusters.

Document list

Documents will be emitted only if the output.documents.enabled parameter is true. The list property contains an array of objects, each of which represents one document:

{
  "documents": {
    "list": [
      {
        "id": 236617,
        "content": [
          {
            "name": "id",
            "values": [ "802569" ]
          },
          {
            "name": "title",
            "values": [ "How to distill / rasterize a PDF in Linux" ]
          },
          ...
        ],
        "labels": [
          { "id": 301, "occurrences": 10 },
          { "id": 637, "occurrences": 4 },
          { "id": 62,  "occurrences": 2 },
          ...
        ]
      },
      ...
    ]
  }
}

Each document object has the following properties:

id
The internal unique identifier of the document. You can use the identifier in the document retrieval and scope selection criteria. Please note that the identifiers are ephemeral — they may change between restarts of Lingo4G REST API and when content is re-indexed.
content

Textual content of the requested fields. For each requested field, the array will contain an object with the following properties:

name
The name of the field.
values
An array of values of the fields. For single-valued fields, the array will contain at most one element. For multi-value fields, the array can contain more elements.

You can configure whether and how to output document content using the parameters in the section. If document output is not requested, the content property will be absent from the document object.

labels

The list of labels of labels occurring in the document. The list includes only the labels selected for processing in the analysis whose result you are retrieving.

Each object in the array represents one label. The object has the following properties:

id
Identifier of the label.
occurrences
The number of times the label appeared in the document.

The labels are sorted decreasing by the number of occurrences. You can configure whether and how to output labels for each document using the parameters in the section. If labels output is not requested, the labels property will be absent from the document object.

Document clusters

If document clustering was requested by setting to true, the arrangement section will contain document clusters.

{
  "documents": {
    "arrangement": {
      // Clusters
      "clusters": [
        {
          "id": 0,
          "exemplar": 188002,

          "documents": [
            { "id": 188002, "similarity": 0.18461633 },
            { "id": 29328,  "similarity": 0.062834464 },
            { "id": 221101, "similarity": 0.06023093 },
            ...
          ],

          "labels": [
            { "occurrences": 7, "text": "automate" },
            { "occurrences": 5, "text": "text" },
            { "occurrences": 5, "text": "office computer" },
            ...
          ]
        },
        ...
      ],

      // Global properties of the result
      "converged": true,
      "iterations": 505,
      "netSimilarity": 834.61414,
      "pruningGain": 0.014
    }
  }
}

The main part of document clustering result is the clusters property that contains the list of document clusters. Each object in the list represents one cluster and has the following properties:

id
Unique identifier of the cluster.
exemplar
Identifier of the document chosen as the exemplar of the cluster. Equal to -1 for the special "non-clustered documents" cluster that contains documents that could not be clustered.
documents

The list of documents in the cluster. Each object in the list represents one document and has the following properties:

id
Identifier of the member document.
similarity
Similarity of the document to the cluster's exemplar document.
labels

The list of labels that occur most frequently in the cluster's documents. The list will only include labels selected for processing in the analysis to which this result pertains.

Each object in the list represents one label and has the following properties:

text
Text of the label.
occurrences
The number of label's occurrences across all member documents in the cluster.

The labels are sorted decreasingly by the number of occurrences. You can configure the number of labels to output using the parameter.

The clustering results also contains a number of properties specific to the Affinity Propagation (AP) clustering algorithm. Those properties will be of interest mostly to the users familiar with that clustering algorithm.

converged
true if the AP clustering algorithm converged to a stable solution.
iterations
The number of iterations the AP clustering algorithm performed.
netSimilarity
The sum of similarities between documents and their cluster's exemplar documents.
pruningGain
The proportion of document relationships that were removed as part of relationships matrix simplification. A value of 0.5 means 50% of the relationships could be removed without affecting the final result.

Status

The status section some low-level details of the analysis, including total processing time and the specific tasks performed.

{
  // Processing status
  "status": {
    "status" "AVAILABLE",
    "elapsedMs": 2650,
    "progress": [
      {
        "task": "Resolving selector query",
        "status": "DONE",
        "progress": 1.0,
        "elapsedMs": 12,
        "remainingMs": 0
      },
      {
        "task": "Fetching candidate labels",
        "status": "STARTED",
        "progress": 0.626,
        "elapsedMs": 1204,
        "remainingMs": 893
      },
      {
        "task": "Fetching candidate labels",
        "status": "NEW"
      },
      ...
    ]
  }
}
status

Status of this result:

PROCESSING
The result is being computed. Some result facets may already be available for retrieval using the analysis progress method.
AVAILABLE
The analysis has completed successfully, result is available.
FAILED
The analysis has not completed successfully, result is not available.
elapsedMs
The total time elapsed when performing the analysis, in milliseconds.
progress

An array of entries that summarizes the progress of individual tasks comprising the analysis. All tasks scheduled for execution will be available in this array right from the start of processing. As the analysis progresses, tasks will change their status, progress and other properties.

Each entry in the array is an object with the following properties:

task
Human-readable name of the task.
status

Status of the task:

NEW
Task not started.
STARTED
Task not started, not completed.
DONE
Task completed.
SKIPPED
Task not executed. Certain tasks can be skipped if the result they compute was already available in partial results cache.
progress
Progress of the task, 0 means no work has been done yet, 1.0 means the task is complete. Progress is not defined for tasks with status NEW and SKIPPED.
elapsedMs
Time spent performing the task so far, in milliseconds. Elapsed time is not defined for tasks with status NEW and SKIPPED.
remainingMs
The estimated time required to complete the task, in milliseconds. Estimated remaining time is not defined for tasks with for tasks with status NEW and SKIPPED and for tasks with progress less than 0.2.

Analysis parameters specification

The spec property contains the analysis parameters used to produce this result. The descriptor included here contains all analysis parameters, including the ones overridden for the duration of the request and the ones that were not overridden and hence having the default value.

The structure of the spec object is the same as the structure of the analysis section of the project descriptor:

{
  "scope":       { ... },
  "labels":      { ... },
  "documents":   { ... },
  "performance": { ... },
  "output":     { ... },
  "summary":     { ... },
  "debug":       { ... }
}
Errors
This request will return 404 Not Found if an analysis with the provided identifier does not exist. The analysis may be missing if its entry was evicted from the results cache. In such cases, the application will need to request a new analysis with the same parameters.

/v1/analysis/{id}/documents

Retrieves the content of the analyzed documents. You can retrieve documents based on a number of criteria, such as documents containing a specific label. Optionally, Lingo4G can highlight the occurrences of document selection criteria (scope query, labels) in the text of the retrieved document.

Methods
GET, POST
Request body

A JSON object defining which documents and fields to retrieve. The structure of the specification is shown below. The content and labels subsections are exactly the same as the corresponding parts of the analysis output section, click on the property names to follow to the relevant documentation.

{
  // How many documents to retrieve
  "limit": 10,
  "start": 0,

  // Document retrieval criteria
  "criteria": {
    "type": "forLabels",
    "labels": [ "data mining", "KDD" ],
    "operator": "OR"
  },

  // The output of labels found in each document
  "labels": {
    "enabled": false,
    "maxLabelsPerDocument": 20,
    "minLabelOccurrencesPerDocument": 0
  },

  // The output of documents' content
  "content": {
    "enabled": false,
    "fields": [
      {
        "name": "title",
        "maxValues": 3,
        "maxValueLength": 160,
        "highlighting": {
          "criteria": false,
          "scope": false
        }
      }
    ]
  }
}

Properties specific to document retrieval are the following:

limit
The maximum number of documents to retrieve in this request. Default: 10.
start
The document index at which to start retrieval. Default: 0.
criteria

An object that narrows down the set of returned documents. The following criteria are supported:

allInScope

Retrieves all documents in the scope of the analysis. This type of criteria does not define any other properties:

"criteria": {
  "type": "allInScope"
}
forLabels

Retrieves documents containing the specified labels. This type of criteria requires additional properties:

"criteria": {
  "type": "forLabels",
  "labels": [ "data mining", "KDD" ],
  "operator": "OR"
}
labels
An array of label texts to use for document retrieval
operator
If OR, documents containing any of the specified labels will be returned. If AND, only documents that contain all of the specified labels will be returned.
byId

Retrieves all documents matching the provided list of identifiers. This type of criteria requires an additional array if numeric document identifiers, for example:

"criteria": {
  "type": "byId",
  "ids": [ 7, 123, 235, 553 ]
}
ids
An non-empty array of document identifiers referenced in the analysis response.
composite

Allows to compose the base retrieval criteria using OR and AND operators, for example:

"criteria": {
  "type": "composite",
  "operator": "AND",
  "criteria": [
    {
      "type": "forLabels",
      "labels": [ "email" ]
    },
    {
      "type": "forLabels",
      "operator": "OR",
      "labels": [
        "Thunderbird",
        "Outlook",
        "IMAP"
      ]
    }
  ]
}
criteria
An array of criteria to compose. The array can contain criteria of all types, including the composite type.
operator
The operator to use to combine the individual criteria. The supported operators are OR and AND.

Note: Regardless of the criteria, the returned documents will be limited to those in the scope of the analysis.

Parameters
spec

For GET requests, the output specification JSON.

Response

A JSON object containing the retrieve documents similar to:

{
  "matches": 120,
  "list": [
    {
      "id": 107288,
      "score": 0.98,
      "content": [
        { "name": "title", "values": [ "Mister Magoo's Christmas Carol" ] },
        { "name": "plot", "values": [ "An animated, magical, musical vers..." ] },
        { "name": "year", "values": [ "1962" ] },
        { "name": "keywords", "values": [ "actor", "based-on-novel", "blind" ] },
        { "name": "director", "values": [ ] }
      ],
      "labels": [
        { "id": 371, "times": 2 },
        { "id": 117, "times": 1 }
      ]
    },
    {
      "id": 218172,
      "score": 0.95,
      "content": [
        { "name": "title", "values": [ "Brer Rabbit's Christmas Carol" ] },
        ...
      ]
    },
    ...
  ]
}
Errors
This request will return 404 Not Found if an analysis with the provided identifier does not exist. The analysis may be missing if its entry was evicted from the results cache. In such cases, the application will need to request a new analysis with the same parameters.

/v1/project/defaults/source/fields

Returns the fields section of the project descriptor for which this instance is running.

Methods
GET
Parameters
none
Response
A JSON object representing the fields section of the project descriptor.

/v1/project/defaults/indexer

Returns the indexer section of the project descriptor for which this instance is running.

Methods
GET
Parameters
none
Response
A JSON object representing the indexer section of the project descriptor.

/v1/project/defaults/analysis

Returns the analysis section of the project descriptor for which this instance is running.

Methods
GET
Parameters
none
Response
A JSON object representing the analysis section of the project descriptor.

/v1/project/defaults/dictionaries

Returns the dictionaries section of the project descriptor for which this instance is running.

Methods
GET
Parameters
none
Response
A JSON object representing the dictionaries section of the project descriptor.

/v1/project/defaults/queryParsers

Returns the queryParsers section of the project descriptor for which this instance is running.

Methods
GET
Parameters
none
Response
A JSON object representing the queryParsers section of the project descriptor.

Environment variables

L4G_HOME

Sets the path to Lingo4G home directory, which contains Lingo4G binaries and global configuration. In most cases there is no need to explicitly set the L4G_HOME variable, the l4g launch scripts will set it automatically.

L4G_OPTS

Sets the extra JVM options to pass when launching Lingo4G. The most common use-case for setting L4G_OPTS is increasing the amount of memory Lingo4G can use:

SET L4G_OPTS=-Xmx6g
export L4G_OPTS=-Xmx6g

When not set explicitly, Lingo4G launch scripts will set L4G_OPTS to -Xmx4g.

Project descriptor

Project descriptor is a JSON file that defines all information required to index and analyze a data set. The general structure of the project description is the following:

{
  // Project-level settings
  "id": "project-id",
  "name": "Project Name",

  // Blocks of settings for specific components of Lingo4G
  "dictionaries": [ ... ],
  "analyzers": [ ... ],
  "queryParsers": [ ... ],

  // Document source specification
  "source": { ... },

  // Indexing settings
  "indexer": { ... },

  // Analysis settings
  "analysis": { ... }
}

The following sections describe each block of the project descriptor in more detail. Please note that most of the properties and configuration blocks are optional and will not need to be provided explicitly in the project descriptor. You can use the show command to display the project descriptor with all blanks filled-in.

Project settings

id

Project identifier, optional. If not provided, the name of the project descriptor JSON file name will be assumed as the project identifier.

Lingo4G will use project name at a number of occasions, for example as part of the clustering results file names.

name

Human-readable project name, optional. If not provided, project identifier will be used as the project name.

projectDirectory

Path to the project directory, optional. If not provided, the directory in which the project descriptor file is contained will be assumed to be the project directory.

By default, Lingo4G will store the index and temporary files in the project directory. For convenience, you can also keep other project-specific resources, such as word and label dictionaries, in the project directory. Please note that for large data sets, the size of the index and temporary files may be large, see the storage requirements section for more details.

workDirectory

Path to the work directory, optional. If not provided, the location of the work directory will be projectDirectory/work.

By default, Lingo4G will store the index, temporary files and other internal files in the work directory.

resultsDirectory

Path to the results directory, optional. If not provided, the location will be ${projectDirectory}/results.

Lingo4G will use the results directory to store the output of analyze command.

indexDirectory

Path to the index directory, optional. If not provided, the location of the index directory will be workDirectory/index.

Lingo4G will use the index directory to store its index, which is the persistent representation of the documents in the data set required for clustering.

tempDirectory

Path to the temporary directory, optional. If not provided, the location of the temporary directory will be workDirectory/temp.

Lingo4G will use the temporary directory to store intermediate data generated during indexing. The temporary files will be deleted as soon as they are not needed.

Dictionaries

The dictionaries section is an array of static dictionary declarations you can later reference to be used at various stages of Lingo4G processing, for example for exclusion of labels from processing. Each declaration is an object with the key, type and additional type-dependent properties.

Typically, a dictionary is defined by a set of entries, such as string matching patterns or regular expressions. In such cases, the set of entries can be passed directly in the descriptor or stored in an external file referenced from the descriptor.

The following example shows typical dictionary definitions:

"dictionaries": [
  {
    "key": "common",
    "type": "simple",
    "files": [ "${project.directory}/resources/stoplabels.utf8.txt" ]
  },

  {
    "key": "common-inline",
    "type": "simple",
    "entries": [
      "information about *",
      "overview of *"
    ]
  },

  {
    "key": "extra",
    "type": "regex",
    "entries": [
      "\\d+ mg"
    ]
  }
]

key

A unique key of the dictionary. You can reference dictionaries in other sections of project description using its key.

type

The type of the dictionary. Syntax of the dictionary declaration, syntax of dictionary entries and the matching algorithm depend on the dictionary type.

The following dictionary types are supported:

simple

Dictionary with simple, word-based matching.

regex
Dictionary with entries defined as Java regular expression patterns.

type=simple

The word-matching dictionary, fast to parse, very fast to apply, offering limited support for wildcard matching. The primary use case of the simple dictionary is case-insensitive matching of literal phrases, as well as "begins with", "ends with" or "contains" phrase type of patterns.

Entry syntax and matching rules

  • Each entry must consist of one or more space-separated tokens.
  • A token can be a sequence of arbitrary characters, such as words, numbers, identifiers.
  • Matching is case-insensitive by default. Letter case normalization is performed based on the ROOT Java locale, which performs language-neutral case conflation according to Unicode rules.
  • The * token matches zero or more words.
  • Using the * wildcard character in combination with other characters, for example programm*, is not supported.
  • Token put in double quotes, for example "Rating***" is taken literally: matching is case-sensitive, * characters are allowed and taken literally.
  • To include double quotes as part of the token, escape them with the \ character, for example: \"information\".

Example entries

The following example shows a few dictionary entries. Hover over the values in the "Non-matching strings" column for an explanation why there is no match.

Entry Matching strings Non-matching strings
more information

more information

More information

MORE INFORMATION

more informations

more information about

some more information

more information *

more information

More information about

More information about a

more informations

more informations about

some more information

* information *

information

more information

information about

a lot more information on

informations

more informations about

some more informations

"Information" *

Information

Information about

Information ABOUT

information

information about

Informations about

"Programm*"

Programm*

Programmer

Programming

\"information\"

"information"

"INFOrmation"

information

"information

programm* Illegal pattern, combinations of the * wildcard and other characters are not supported.
"information Illegal pattern, unbalanced double quotes.
* Illegal pattern, there must be at least one non-wildcard token.

type=simple.entries

Array of entries of the simple dictionary, provided directly in the project descriptor or overriding JSON fragment. For syntax of the entries, see the simple dictionary type documentation. Please note that double quotes being part of the pattern must be escaped as in the example below.

"dictionaries": [
  {
    "key": "simple-inline",
    "type": "simple",
    "entries": [
      "information about *",
      "\"Overview\""
    ]
  }
]

type=simple.files

Array of files to load simple dictionary entries from. The files must adhere to the following rules:

  • Must be plain-text, UTF-8 encoded, new-line separated.
  • Must contain one simple dictionary entry per line.
  • Lines starting with # are treated as comments.
  • There is no need to escape the double quote characters in dictionary files.

An example simple dictionary file may be similar to:

# Common stop labels
information *
overview of *
* awards

# Domain-specific entries
supplementary table *
subject group

A typical file-based dictionary declaration will be similar to:

"dictionaries": [
  {
    "key": "simple",
    "type": "simple",
    "files": [
      "${project.directory}/resources/stoplabels.utf8.txt"
    ]
  }
]

If multiple dictionary files or extra inline entries are provided, the resulting dictionary will contain the union of patterns from all sources.

type=regex

The regular-expression based dictionary, offers more expressive syntax but is expensive to parse and apply.

Use simple dictionary type whenever possible and practical

Dictionaries of the simple type are fast to parse and very fast to apply. This should be the preferred type of dictionary to use with other dictionary types reserved for entries impossible to express in the simple dictionary syntax.

Each entry in the regular experession dictionary must be a valid Java Regular Expression pattern. A string is considered present in the dictionary if it matches, as a whole, at least one of the patterns defining the dictionary.

Example entries

The following are some example regular expression dictionary entries. Hover over the values in the Non-matching strings column for an explanation why there is no match.

Entry Matching strings Non-matching strings
more information

more information

More information

more information about

(?i)more information

more information

More Information

more information about

(?i)more information .*

more information about

more information

(?i)more information\b.*

more information

more information about

some more information

Year\b\d+

Year 2000

Year

.*(low|high|top).*

low coverage

nice yellow dress

top coder

without stopping

Low coverage

Regular expression syntax is very powerful, but it sometimes hard to write patterns that match exactly the strings the author originally intended. For instance, the intention of last example in the table above may have been to match all strings containing the low, high or top words, but the pattern actually matches a much broader class of strings. For more predictable semantics and much faster matching, use the simple dictionary format if possible.

type=regex.entries

Array of entries of the regular expression dictionary, provided directly in the project descriptor or overriding JSON fragment. Please note that double quotes and backslash characters being part of the pattern must be escaped as in the example below.

"dictionaries": [
  {
    "key": "regex-inline",
    "type": "regex",
    "entries": [
      "information about .*",
      "\"Overview\"",
      "overview of\\b.*"
    ]
  }
]

type=regex.files

Array of files to load regular expression dictionary entries from. The files must adhere to the following rules:

  • Must be plain-text, UTF-8 encoded, new-line separated.
  • Must contain one regular expression dictionary entry per line.
  • Lines starting with # are treated as comments.
  • There is no need to escape the double quote and backslash characters in dictionary files.

An example simple dictionary file may be similar to:

# Common stop labels
information about .*
"Overview"
overview of\b.*

A typical file-based dictionary declaration will be similar to:

"dictionaries": [
  {
    "key": "regex",
    "type": "regex",
    "files": [
      "${project.directory}/resources/stoplabels.regex.txt"
    ]
  }
]

If multiple dictionary files or extra inline entries are provided, the resulting dictionary will contain the union of patterns from all sources.

Analyzers

The function of an analyzer is to split the stream of input characters into smaller units (words, punctuation) which then undergo further analysis or indexing (phrase detection, matching against an input dictionary). An analyzer should be applied to fields provided by the document source.

Analyzers in Lingo4G are specialized subclasses of Apache Lucene classes. There are several analyzers provided by default as an array under the analyzers block of the project descriptor. A default analyzer's settings can be tweaked by redeclaring its key field, or a new analyzer can be added under a new key. The definition of analyzers array is typically as follows:

"analyzers": [
  {
    "key": "...",
    "type": "...",
    ... // analyzer-specific fields.
  }
]

The key of an analyzer provides a unique reference used from other places of the project descriptor (for example from the fields declaration section). The type of an analyzer is currently hardcoded to one of the predefined analyzer types, as detailed in sections below.

type=english

The English analyzer is best suited to processing text written in English. It normalizes word forms and applies heuristic stemming to unify various spelling variants of the same word (lemma). The default definition declares the following properties:

{
  "key": "english",
  "type": "english",
  "requireResources": false,
  "useHeuristicStemming": true,
  "stopwords": [ "${l4g.home}/resources/indexing/stopwords.utf8.txt" ],
  "stemmerDictionary": "${l4g.home}/resources/indexing/words.en.dict"
}
requireResources

(default: false) Declares whether resources for the analyzer are required or optional. The default analyzer does not require the resources to be available (but points at their default locations under l4g.home).

useHeuristicStemming

(default: true) If true, the analyzer will apply heuristic stemming techniques to each stem (Porter stemmer).

stemmerDictionary

The location of a precompiled Morfologik FSA (automaton file) with inflected-base form mappings and part of speech tags. Lingo4G comes with a reasonably sized default dictionary. This dictionary can be decompiled (or recompiled) using the morfologik-stemming library.

stopwords

An array of zero or more locations with stopword files. A stopword file is a plain-text, UTF-8 encoded file with each word on a single line. Stopwords are excluded from indexing and demarcate phrase boundaries. Adding a word to the stopwords file can decrease the size of the index, but adding too many words should be avoided because stopwords by definition cannot occur at the beginning or end of a phrase in automatic feature discovery.

type=whitespace

Whitespace analyzer can be useful to break up a field that consists of whitespace-separated tokens or terms. Any punctuation will remain together with the tokens (or will be returned as tokens). The default definition of this analyzer is as follows:

{
  "key": "whitespace",
  "type": "whitespace",
  "lowercase": true
}
lowercase

(default: true) If true, each token will be lowercased (according to Unicode rules, no localized rules apply).

type=keyword

Keyword analyzer does not perform any token splitting at all, returning the full content of a field for indexing (or feature detection). This analyzer is useful to index identifiers or other non-textual information that shouldn't be split into smaller units.

Lingo4G declares two different analyzers of type keyword, but differing in lowercasing rules:

{
  "key": "keyword",
  "type": "keyword",
  "lowercase": true
},
{
  "key": "literal",
  "type": "keyword",
  "lowercase": false
}

Query parsers

The queryParsers section is an array of query parser definitions that can be used to parse Lucene queries selecting documents for clustering.

key

A unique key assigned to the query parser. If more than one query parser is defined, you will use the key value to reference the query parser to use during analysis in the queryParser property.

type

(default: standard) Declares the type of Lucene query parser to use. The following query parsers are currently available:

standard

Corresponds to the (flexible) standard query parser. The query syntax for this parser is identical to the classic query parser.

The internal configuration of standard query parser contains two properties:

defaultFields

An array of field names to be searched if the query does not use an explicit field prefix.

defaultOperator

(default: AND). The default Boolean operator applied to each clause of a parsed query, unless the query explicitly states the operator to use (see the query syntax guide above).

An example configuration declaring the default OR operator and fields title, content and authors for the standard query parser is shown below:

"queryParsers": [
  {
    "type": "standard",
    "key": "standard",
    "defaultFields": [
      "title", 
      "content", 
      "authors"
    ],
    "defaultOperator": "OR"
  }
]
complex

Corresponds to the complex query parser, which is an extension of standard query parser's syntax.

The internal configuration contains two properties:

defaultField

A single field name of the default field to be searched if the query does not use an explicit field prefix. Note the difference to standard query parser (no multiple default fields are allowed). This constraint stems from Lucene's implementation.

defaultOperator

(default: AND). The default Boolean operator applied to each clause of a parsed query, unless the query explicitly states the operator to use (see the query syntax guide above).

An example configuration declaring the default OR operator and field content is shown below:

"queryParsers": [
  {
    "type": "complex",
    "key": "complex",
    "defaultField": "content",
    "defaultOperator": "OR"
  }
]

Document source

The source section defines the source of document data Lingo4G will use during indexing.

feed

Defines the source of document data. The specific configuration options depend on the type of the document source.

more docs Further releases will come with a list of supported document sources.

fields

An object that defines how each document's fields should be indexed by Lingo4G. The key of the object is the field name, the value defines how the field will be indexed. The object can have the following properties:

type

(default: text) The type of value inside the field. The following types are supported:

text
The default value type denoting free text. Text fields can have associated search and feature analyzers.
date

A combined date and time type. Two additional attributes inputFormat and indexFormat determine how the input string is converted to a point in time and then formatted for actual storage in the index. Both attributes can provide a pattern compatible with Java 8 date API formatting guidelines. The inputFormat additionally accepts a special token <epoch:milli> which represents the input as the number of milliseconds since Java's epoch.

integer, long

Numeric values of the precision given by their corresponding Java type.

float, double

Floating-point numeric values of the precision given by their corresponding Java type.

analyzer

(default: none) Determines how the field's text (value) will be processed for the search-based document selection. The following values are supported (see the analyzers section for more information):

none
Lingo4G will not process this field for clustering or search-based document selection. You can use this analyzer when you only want to store and retrieve the original value of the field from Lingo4G index for display purposes.
literal

Lingo4G will use the literal value of the field during processing. Literal analysis will work best for metadata, such as identifiers, dates or enumeration types.

keyword

Lingo4G will use the lower-case value of the field during processing. Keyword analyzer will work best when it is advisable to lower-case the field value before searching, for example for people or city names.

whitespace

Lingo4G will split the value of the field on white spaces and convert to lower case. Use this analyzer when it is important to preserve all words and their original grammatical form.

english

Lingo4G will apply English-specific tokenization, lower-casing, stemming and stop word elimination to the content of the field. Use this analyzer for natural text written in English.

planned Further releases of Lingo4G will come with support for other languages.

Please note that Lingo4G is currently most effective when clustering "natural text", such as document title or body. Therefore, you will most likely be applying clustering to fields with english or whitespace analyzers.

featureAnalyzer

Determines how the field's value will be processed for feature extractors and subsequently for clustering. If not provided, type of processing is determined by the field's analyzer property. The list of accepted values is the same as for the analyzer property.

stored

(default: true) Determines whether Lingo4G will store the raw text of the field, so that it can later be retrieved for display purposes.

A typical fields definition may look like this:

"fields":  {
  // Identifier.
  "id":        { "analyzer": "literal" },

  // Simple values, will be lower-cased for query matching
  "author":    { "analyzer": "keyword" },
  "type":      { "analyzer": "keyword" },

  // English text
  "title":     { "analyzer": "english" },
  "summary":   { "analyzer": "english" },
  
  // Date, converted from incomplete information to full iso timestamp.
  "created":  { "type": "date", "inputFormat": "yyyy-MM-dd HH:mm",
                                "indexFormat": "yyyy-MM-dd'T'HH:mm:ss[X]" },

  "score":    { "type": "integer" }  
}

To lower the size of the index at the cost of not being able to retrieve the original text, you can disable storing of the field's original text by setting stored to false:

"fields":  {
  // English text, only for clustering and search-based document selection,
  // original text will not be available for retrieval.
  "title":     { "analyzer": "english", "stored": false },
  "summary":   { "analyzer": "english", "stored": false }
}

For further saving in index disk size, you can disable the ability to search by the content of the field by setting its analyzer to none. If at the same time you would like to be able to apply clustering on the field, you will need to provide the appropriate analyzer in the featureAnalyzer property:

"fields":  {
  // English text, only for clustering. The field will not be available
  // for retrieval and search-based document selection.
  "title":     { "analyzer": "none", "stored": false, "featureAnalyzer": "english" },
  "summary":   { "analyzer": "none", "stored": false, "featureAnalyzer": "english" }
}

Indexer

The indexer section configures the Lingo4G document indexing process. Indexer parameters are divided into several subsections, click the properties to go to the relevant documentation.

{
  // Feature extractors
  "features": [ ... ],

  // Automatic stop label discovery
  "stopLabelExtractor": { ... }
}

threads

Declares the concurrency level for the indexer. Faster disk drives (SSD or NVMe) permit higher concurrency levels, while conventional spinning drives typically perform very poorly with multiple threads reading from different disk regions concurrently. There are several ways to express the permitted concurrency level:

auto
The number of threads used for indexing will be automatically and dynamically adjusted to maximize indexing throughput. This is the recommended value.
fixed integer
A fixed number of threads will be used for indexing. For spinning drives, this should be set to 1 (or auto). For SSD drives and NVMe drives, the number of threads should be close to the number of available CPU cores.
range
The number of threads used for indexing will be automatically and dynamically adjusted to maximize indexing throughput, but will be restricted to the provided minimum and maximum number of threads (inclusive). For example 1-4 will result in any number of concurrent threads between 1 and 4. This syntax can be used to decrease system load if automatic throughput management attempts to use all available CPUs.

features[]

An array of sections defining feature extractors. Each section corresponds to one feature extractor whose type is determined by the type property. The specific configuration options depend on the extractor type.

The following example shows an array defining several feature extractors. By their key, they are:

fine-phrases
An automatic key term and phrase extractor using the title and summary fields as the source, applying low document frequency thresholds (phrases occurring in at least 2 documents will be indexed).
coarse-phrases
An automatic key term and phrase extractor that discovers frequent phrases based only on the title field, but applies them to both title and summary fields. The phrases will be more coarse (and very likely less noisy); the minimum number of documents a phrase has to appear in is 10.
people
An dictionary extractor that adds any phrases defined in celebrities.json and saints.json to the title and summary fields.
"features": [
  {
    "type": "phrases",
    "key": "fine-phrases",
    "sourceFields": [ "title", "summary" ],
    "targetFields": [ "title", "summary" ],
    "minTermDf": 2,
    "minPhraseDf": 2
  },

  {
    "type": "phrases",
    "key": "coarse-phrases",
    "sourceFields": [ "title" ],
    "targetFields": [ "title", "summary" ],
    "minTermDf": 10,
    "minPhraseDf": 10
  },

  {
    "type": "dictionary",
    "key": "people",
    "targetFields": [ "title", "summary" ],
    "labels": [
      "celebrities.json",
      "saints.json"
    ]
  }
]

features[].key

The identifier of the specific extractor instance. The identifier will be used to build the name of the feature field produced by the extractor instance.

features[].type

Determines feature extractors and their type. Currently two types of extractors are available: phrases, which identifies sequences of words that occur frequently in the input documents, and dictionary which indexes a set of predefined labels and their aliases.

features[].type=phrases

Defines a phrase feature extractor which attempts to extract meaningful sentences and terms automatically.

An example configuration of this extractor can look as shown below:

{
  // Names of source fields from which phrases/ terms are collected
  "sourceFields": [ ... ],

  // Names of fields to which the discovered features should be applied
  "targetFields": [ ... ],
  
  // Extraction quality-performance trade-off tweaks
  "maxTermLength": ...,  
  "minTermDf": ...,  
  "minPhraseDf": ...,  
  "maxPhraseDfRatio": ...,  
  "maxPhraseTermCount": ...,  
  "omitLabelsWithNumbers": ...,
}

features[].type=phrases.sourceFields

An array of names of source fields from which Lingo4G will extract frequent phrases.

features[].type=phrases.targetFields

An array of names of source fields to which Lingo4G will apply and index the extracted phrases. For each provided field, Lingo4G will create one feature field named <source-field-name>$<extractor-key>. For the example list of feature extractors, Lingo4G will create the following feature fields: title$fine-phrases, summary$fine-phrases and content$coarse-phrases. You can apply Lingo4G clustering on any combination of feature fields.

features[].type=phrases.maxTermLength

The maximum length of a single word, in characters, to accept during indexing. Words longer than the specified limit will be ignored.

features[].type=phrases.minTermDf

The minimum number of occurrences of a word required for the word to be accepted during indexing. Words appearing in fewer than the specified number of documents will be ignored.

Raising the minTermDf threshold will help to filter out noisy words, decrease the size of the index and speed-up indexing and clustering. For efficient noise removal, consider raising minPhraseDf as well.

features[].type=phrases.minPhraseDf

The minimum number of occurrences of a phrase required for the phrase to be accepted during indexing. Phrases appearing in fewer than the specified number of documents will be ignored.

Raising minPhraseDf threshold will help to filter out noisy phrases, decrease the size of the index and significantly speed-up indexing and clustering.

features[].type=phrases.maxPhraseDfRatio

If a phrase or term exists in more than this ratio of documents, it will be ignored. A ratio of 0.5 means 50% of documents in the collection, a ratio of 1 means 100% of documents in the collection.

Typically, phrases that occur in more than 30% of all of the documents in a collection are either boilerplate headers or structural elements of the language (not informative) and can be safely dropped from the index. This improves speed and decreases index size.

features[].type=phrases.maxPhraseTermCount

The maximum number of non-stop-words to allow in a phrase. Phrases longer than the specified limit will not be extracted.

Raising maxPhraseTermCount above the default value of 5 will significantly increase the index size, indexing and clustering time.

features[].type=phrases.omitLabelsWithNumbers

If set to true, any terms or phrases containing numeric tokens will be omitted from the index. While this option drops a significant amount of features, it should be used with care as certain potential valid features contain numbers (Windows 10, Terminator 2).

features[].type=dictionary

Defines a dictionary feature extractor which indexes features from a predefined dictionary of matching strings.

An example configuration of this extractor can look as shown below:

{
  // Names of fields to which the feature matching rules should be applied
  "targetFields": [ ... ],

  // Resources declaring features to index (labels and their matching rules)
  "labels": [ ... ]
}

features[].type=dictionary.targetFields

An array of names of source fields to which Lingo4G will apply the matching rules specified in dictionaries. For each provided field, Lingo4G will create one feature field named <source-field-name>$<extractor-key>. For the example list of feature extractors, Lingo4G will create the following feature fields based on dictionary labels: title$people, summary$people. You can apply Lingo4G clustering on any combination of feature fields.

features[].type=dictionary.labels

A string or an array of strings with JSON files containing feature dictionaries. Files are resolved relative to the project's directory.

Each JSON file should contain an array of features and their matching rules, as previously explained in the overview of the dictionary extractor.

stopLabelExtractor

During indexing, Lingo4G will attempt to discover collection-specific stop labels, that is labels that poorly characterize documents in the collection. Typically such stop labels will include generic terms or phrases. For example, for the IMDb data set, the stop labels include phrases such as taking place, soon discovers or starts. For a medical dataset the set of meaningless labels will likely include words and phrases that are not universally meaninless, but occur very frequently within that particular domain like indicate, studies suggest or control.

Heads up, experimental feature

Automatic stop label discovery is an experimental feature. Details may be altered in future versions of Lingo4G. A manual scan and review of the automatically discovered stop labels is highly encouraged.

An example configuration of stop label extraction is given below.

"stopLabelExtractor": {
  "categoryFields": [ "productClass", "tag" ],
  "featureFields": [ "title$phrases", "description$phrases" ],

  "maxPartitionQueries": 200,
  "partitionQueryMinRelativeDf": 0.001,
  "partitionQueryMaxRelativeDf": 0.15,

  "maxLabelsPerPartitionQuery": 10000,
  "minStopLabelCoverage": 0.2
}

Ideally, the categoryFields should include fields that separate all documents into fairly independent, smaller subsets. Good examples are tags, company divisions, or institution names. If no such fields exist in the collection, or if they don't provide enough information for stop label extraction, featureFields should be used to specify fields contributed by feature extractors (note the $phrases suffix in the example above; this is a particular extractor's key).

All other parameters are expert-level settings and typically will not require tuning. The completeness, the full process of figuring out which labels are potentially meaningless works as follows:

  1. First, the algorithm attempt to determine terms (at most maxPartitionQueries of them) that slice the collection of documents into potentially independent subsets. These "slicing" terms are first taken from fields declared in categoryFields attribute, followed by terms from feature fields declared in the featureFields attribute.

    Only terms that cover a fraction of all input documents between partitionQueryMinRelativeDf and partitionQueryMaxRelativeDf will be accepted. So, in the descriptor above, only documents that cover between 0.1% and 15% of the total collection size would be considered acceptable.

  2. For each label in all documents matched by any of the slicing terms above, the algorithm computes which terms the label was relevant to, and the chance of the term being a "frequent", "popular" phrase across all documents that slicing term matched.

  3. The topmost "frequent" labels relevant to at least a ratio of minStopLabelCoverage of all slicing terms are selected as stop labels. For example, minStopLabelCoverage of 0.2 and maxPartitionQueries of 200 would mean the label was present in documents matched by at least 40 slicing terms.

The set of automatically discovered stop labels is available for inspection in workDirectory/stop-labels.json. You can edit the file to remove labels incorrectly designated as meaningless. Re-indexing will overwrite the stop labels file without warning, so make sure to back up the edited version to avoid losing the changes.

The application of the stop label set at analysis time can be adjusted by the settings in the labels.probabilities section.

Analysis

The analysis section configures Lingo4G analysis process. Analysis parameters are divided into several subsections, click the properties to go to the relevant documentation.

{
  // Scope defines the subset of documents to analyze
  "scope": { ... },

  // Label selection criteria
  "labels": {
    "surface": { ... },
    "frequencies": { ... },
    "probabilities": { ... },
    "scorers": { ... },
    "arrangement": { ... }
  },

  // Document analysis
  "documents": {
    "arrangement": { ... }
  },

  // Control of performance-quality trade-offs
  "performance": { ... },

  // Control over the specific elements to include in the output
  "output": { ... },

  // Which result statistics to compute and return
  "summary": { ... },

  // Output of debugging information
  "debug": { ... }
}

scope

The scope section configures which document Lingo4G should take into account during analysis. See below for some sample scope definitions using standard query parser (the interpretation of the query will depend on the query parsers declared, a specific query parser can be selected using its key).

Select all documents for analysis
{
  "type": "byQuery",
  "query": ""
}
Select documents containing the word christmas in the default search fields
{
  "type": "byQuery",
  "query": "christmas"
}
Select documents whose year field starts with 19
{
  "type": "byQuery",
  "query": "year:19*"
}
Select documents with the provided medlineId identifiers
{
  "type": "byFieldValue",
  "field": "medlineId",
  "values": [ "1", "5", "28", "65", ... ]
}

scope.type

The type of analysis scope to use, determines the other properties allowed in the scope specification.

scope.type=byQuery

Selects documents for analysis using a Lucene search query. In most cases, this will be the preferred scope type to use.

A typical query-based scope definition will be similar to:

{
  "type": "byQuery",
  "query": "christmas",
  "queryParser": "standard",
  "limit": 10000
}

scope.type=byQuery.query

The search query Lingo4G will run on the index to select the documents for clustering. The query must follow the syntax of the Lucene query parser configured in the project descriptor. You can use all indexed fields in your queries. The basic clustering section lists a number of example queries.

If the query is empty, all indexed documents will be clustered.

scope.type=byQuery.queryParser

The query parser to use when running the query. The query parser determines the syntax of the query, the default operator (AND, OR) and the list of default search fields.

If this option is empty or not provided and there is only one query parser defined in the project descriptor, the only defined query parser will be used.

scope.type=byQuery.limit

If the query matches more documents than the declared limit, the processing scope will be truncated and include only the top scoring documents up to that limit. This option can be used to decrease the amount of resources required for clustering.

An empty value of the limit attribute includes all documents matching the query.

Any additional processing scope size restrictions embedded in the Lingo4G license always take precedence over user-defined limits.

scope.type=byFieldValues

Selects for analysis all documents whose specified field is equal to some of the provided values. The typical use case for this scope type is selecting large numbers (thousands) of documents based on their identifiers. An equivalent selection is also possible with the byQuery scope, but the latter will be orders of magnitude slower in this specific scenario.

A typical definition of field-value-based scope is the following::

{
  "type": "byFieldValue",
  "field": "medlineId",
  "values": [ "1", "5", "28", "65", ... ]
}

scope.type=byFieldValues.field

The name of the field to compare against the list of values. If the field name is empty, all indexed documents will be clustered.

scope.type=byFieldValues.values

An array of values to compare against the specified field. If a document's field is equal to any of the value from the list, the document will be selected for clustering. Please note that the comparisons are literal, case-sensitive.

If the list of values is empty or not provided, all indexed documents will be clustered.

scope.type=byId

Selects documents for analysis using their internal identifiers:

{
  "type": "byId",
  "ids": [ 154246, 40937, 352364, ... ]
}

scope.type=byId.ids

The array of internal document identifiers to include in the processing scope.

scope.type=complement

Selects documents not present in the provided processing scope. In Boolean terms, this scope type negates the provided processing scope:

{
  "type": "complement",
  "scope": [
    {
      "type": "byId",
      "ids": [ 154246, 40937, 352364, ... ]
    }
  ]
}

Using this scope type in isolation usually makes little sense, but the complement scope type can sometimes get useful as part of the composite scope definition.

scope.type=complement.scope

The processing scope to complement. Scopes of any type can be used here, such as the composite scope.

scope.type=composite

Composes two or more scope definitions using Boolean AND or OR operators:

{
  "type": "composite",
  "operator": "AND",
  "scopes": [
    {
      "type": "byQuery",
      "query": "christmas"
    },
    {
      "type": "complement",
      "scope": [
        {
          "type": "byId",
          "ids": [ 154246, 40937, 352364, ... ]
        }
      ]
    }
  ]
}

The above scope includes all documents matching the christmas query, excluding the documents with ids provided in the array.

scope.type=composite.operator

The operator to use to combine the scopes. Allowed values:

AND
A document must be present in all scopes to be selected.
OR
A document must be present in at least one scope to be selected.

scope.type=composite.scopes

An array of scopes to compose. Scopes of any type can be used here, including the composite and complement ones.

labels

Parameters in the labels section determine the characteristics of labels Lingo4G will select for analysis. Parameters in this section are divided into a number of subsections, click on the property names to follow to the relevant documentation.

{
  "maxLabels": 1000,
  "source":        { ... }, // which fields to load labels from
  "surface":       { ... }, // textual properties of labels
  "frequencies":   { ... }, // frequency constraints for labels
  "probabilities": { ... }, // probability-based boosting and suppression of labels
  "scorers":       { ... }, // label scoring settings
  "arrangement":   { ... }  // label clustering settings
}

labels.maxLabels

Sets the maximum number of labels Lingo4G should select for analysis.

labels.source

Options in the source section determine which feature fields Lingo4G will use as the source of labels for analysis.

labels.source.fields

The array specifying the feature fields to use as the source of labels for analysis. Each element of the array must be a JSON object with the following properties:

name
Feature field name. Names of the feature fields have the form <source-field-name>$<extractor-key>. In most configurations, the extractor key would be phrases, so the typical feature names would be similar to: title$phrases, content$phrases.
weight
Weight of the field, optional, 1.0 if not provided. If the weight is not equal to 1.0, for example 2.0, the labels coming from the field will be two times more likely to appear as a cluster label.

A typical fields array declaration would be similar to:

"fields": [
  { "name": "title$phrases", "weight": 2.0 },
  { "name": "summary$phrases" },
  { "name": "description$phrases" }
]

If the fields array is empty or not provided, Lingo4G will use all available feature fields with weight 1.0.

labels.surface

The surface section determines the textual properties of labels Lingo4G will select for analysis, such as the number of words or promotion of capitalized labels.

The surface section contains the following parameters:

{
  "exclude": [],
  "minWordCount": 1,
  "maxWordCount": 8,
  "minCharacterCount": 4,
  "minWordCharacterCountAverage": 2.9,
  "preferredWordCount": 2.5,
  "preferredWordCountDeviation": 2.5,
  "singleWordLabelWeightMultiplier": 0.5,
  "capitalizedLabelWeight": 1.0,
  "acronymLabelWeight": 1.0,
  "uppercaseLabelWeight": 1.0
}

labels.surface.exclude

Labels to exclude from analysis. This option is an array of elements of two types:

  • References to static dictionaries defined in the dictionaries section. Using the reference elements you can decide which of the static dictionaries to apply for the specific analysis request.
  • Ad-hoc dictionaries defined in place. You can use the ad-hoc dictionary element to include some extra entries not present in the statically declared dictionaries.

Each element of the array must be an object with the type property and other type-dependent properties. The following types are supported:

project

A reference to the static dictionary defined in the dictionaries section. The dictionary property must contain the key of the static dictionary you are referencing.

Typical object of this type will be similar to:

"exclude": [
  { "type": "project", "dictionary": "default" },
  { "dictionary": "extensions" }
]

Tip: The default value of the type property is project, so it can be omitted as in the second array element above.

simple

Ad-hoc definition of a simple dictionary. The object must contain the entries property with a list of simple dictionary entries. File-based ad-hoc dictionaries are not allowed.

Typical ad-hoc simple dictionary element will be similar to:

"exclude": [
  {
    "type": "simple",
    "entries": [
      "design narrative",
      "* rationale"
    ]
  }
]

For complete entry syntax specification, see the simple dictionary type documentation.

regex

Ad-hoc definition of a regular expression dictionary. The object must contain the entries property with a list of regular expression dictionary entries. File-based ad-hoc dictionaries are not allowed.

Typical ad-hoc regular expression dictionary element will be similar to:

"exclude": [
  {
    "type": "regex",
    "entries": [
      "(?i)year\\\b\\\d+"
    ]
  }
]

Entries of regular expression dictionaries are expensive to parse and apply, so use the simple dictionary type whenever possible.

In a realistic use case you will likely combine static and ad-hoc dictionaries to exclude both the predefined and user-provided labels from analysis, as shown in the following example.

"exclude": [
  {
    "dictionary": "default"
  },
  {
    "type": "simple",
    "entries": [
      "design narrative",
      "* rationale"
    ]
  }
]

labels.surface.minWordCount

The minimum number of words all labels must have, default: 1.

labels.surface.maxWordCount

The maximum number of words all labels can have, default: 8.

labels.surface.minCharacterCount

The minimum number of characters each label must have, default: 4.

labels.surface.minWordCharacterCountAverage

The minimum average number of characters per word each label must have, default: 2.9.

labels.surface.preferredWordCount

The preferred label length in words, default 2.5. The strength of the preference is determined by labels.source.surface.preferredLabelLengthDeviation.

Fractional preferred label lengths are allowed. For example, preferred label length of 2.5 will result in labels of length 2 and 3 being treated equally preferred; a value of 2.2 will prefer two-word labels more than three-word ones.

labels.surface.preferredLabelLengthDeviation

Determines how far Lingo4G is allowed to deviate from the labels.source.surface.preferredWordCount. A value of 0.0 allows no deviation: all labels must have the preferred length. Larger values allow more and more deviation, with the value of, for example, 20.0 meaning almost no preference at all.

When the preferred label length deviation is 0.0 and the fractional part of the preferred label length is 0.5, then the only allowed label lengths will be the two integers closest to the preferred label length value. For example, if preferred label length deviation is 0.0 and preferred label length is 2.5, the Lingo4G will create only labels consisting of 2 or 3 words. If the fractional part of the preferred label length is other than 0.5, only the closest integer label length will be preferred.

labels.surface.singleWordLabelWeightMultiplier

Set the amount of preference Lingo4G should give to one-word labels. The higher the value of this parameter, the more clusters described with single-word labels Lingo4G will produce. A value of 1.0 means no special preference for one-word labels, a value of 0.0 will remove one-word labels entirely.

labels.surface.capitalizedLabelWeight

Set the amount of preference Lingo4G should give to labels starting with a capital letter and having all other letters in lower-case. The higher the value of this parameter, the stronger the preference. A value of 1.0 means no special preference, a value of 0.0 will remove labels starting with a capital letter completely.

labels.surface.acronymLabelWeight

Set the amount of preference Lingo4G should give to labels containing acronyms. Lingo4G will assume that a label contains an acronym if any of the label's words consists in 50% or more of upper-case letters. Non-letter characters will not be counted towards the total character count; the acronym must have more than one letter character.

In light of the above definition, the following tokens will be treated as acronyms: mRNA, I.B.M., pH, p-N. The following tokens will not be treated as acronyms: high-Q, 2D.

The higher the value of this parameter, the stronger the preference. A value of 1.0 means no special preference, a value of 0.0 will remove upper-case labels completely.

labels.surface.uppercaseLabelWeight

Set the amount of preference Lingo4G should give to labels containing at least one upper-case letter. The higher the value of this parameter, the stronger the preference. A value of 1.0 means no special preference, a value of 0.0 will completely remove labels containing upper-case letters.

labels.frequencies

The labels.frequencies section determines the document or term frequency constraints that must be met by the labels selected for analysis.

The frequencies section contains the following parameters:

{
  "minAbsoluteDf": 2,
  "minRelativeDf": 0.02,
  "maxRelativeDf": 0.1,
  "truncatedPhraseThreshold": 0.2
}

labels.frequencies.minAbsoluteDf

Sets the absolute minimum number of documents each label should appear in. For example, if minAbsoluteDf is 10, each labels selected Lingo4G for analysis will appear in at least 10 documents.

labels.frequencies.minRelativeDf

Set the minimum number of documents each label should appear in, relative to the number of documents selected for analysis. For example, if the document selection query matched 20000 documents and minRelativeDf is 0.0005, Lingo4G will not select labels appearing in fewer than 10 = 20000 * 0.0005 documents.

labels.frequencies.maxRelativeDf

Set the maximum number of documents each label can appear in, relative to the number of documents selected for analysis. For example, if the document selection query matched 20000 documents and maxRelativeDf is 0.2, Lingo4G will not select labels appearing in more than than 4000 = 20000 * 0.2 documents.

labels.frequencies.truncatedPhraseThreshold

Controls the removal of truncated labels, default: 0.2. If two phrases sharing a common prefix or suffix, such as Department of Computer and Department of Computer Science have similar term frequencies, it is likely that the shorter one should be suppressed in favor of the longer one. To increase the strength of truncated label elimination (to have fewer truncated labels), increase the threshold.

The truncatedPhraseThreshold determines the relative difference between the term frequency of the longer and the shorter label beyond which the shorter labels will not be removed in favor of the longer one. For the sake of example, let us assume that label Department of Computer has 1000 occurrences and Department of Computer Science has 900 occurrences. For truncatedPhraseThreshold values equal or greater than 0.1, Department of Computer will be removed in favor of the non-truncated longer label. For threshold values lower than 0.1, both phrases will be considered during the label choice.

labels.probabilities

The probabilities section controls boosting and suppression of labels based on the occurrence probabilities associated with them. This process consists of two aspects:

  • Scope-to-collection probability ratio scoring. You can use this mechanism to promote labels that are more probable to occur in the subset of the documents being analyzed than in the whole collection. By promoting labels with high scope-to-collection probability ratios, you will boost labels that are specific to the documents in scope and unspecific to the documents outside of analysis scope. If needed, you can also do the opposite: promote labels that are more probable to find outside of the currently analyzed documents. Parameters controlling probability ratio-based scoring have the probabilityRatio prefix.
  • Application of collection-specific stop labels. You can use this mechanism to suppress meaningless labels discovered during indexing. Parameters controlling the application of collection-specific stop labels have the autoStopLabel prefix.

The probabilities section contains the following parameters:

{
  "probabilityRatioPreference": "PREFER_SCOPE_SPECIFIC",
  "probabilityRatioThreshold": 1.5,
  "probabilityRatioPreferenceStrength": 0.0,
  "probabilityRatioMaxRelativeScopeSize": 0.5,
  "autoStopLabelRemovalStrength": 0.35,
  "autoStopLabelMinCoverage": 0.4
}

labels.probabilities.probabilityRatioPreference

Determines whether probability ratio scoring will prefer scope-specific or scope-unspecific labels, default: PREFER_SCOPE_SPECIFIC.

PREFER_SCOPE_SPECIFIC
Lingo4G will prefer labels that are more probable in the documents in-scope than in the whole collection.
PREFER_SCOPE_UNSPECIFIC
Lingo4G will prefer labels that are less probable in the documents in-scope than in the whole collection.

labels.probabilities.probabilityRatioThreshold

Sets the probability ratio threshold that Lingo4G will use to discard non-compliant labels, default: 1.5. Behavior of this parameter depends on the value of probabilityRatioPreference:

probabilityRatioPreference Behavior
PREFER_SCOPE_SPECIFIC

The probabilityRatioThreshold parameter determines the minimum value for the scope-to-collection probability that each label must have. Labels with lower probability ratios will be discarded.

For example, if probabilityRatioThreshold is 2.0, only labels that are at least 2 times more probable in the documents in scope than in the whole collection will be preserved.

PREFER_SCOPE_UNSPECIFIC

The probabilityRatioThreshold parameter determines the minimum value for the collection-to-scope probability that each label must have. Labels with lower probability ratios will be discarded.

For example, if probabilityRatioThreshold is 2.0, only labels that are at least 2 times more probable in the in the whole collection than in the documents in scope will be preserved.

Please note that setting a very high value of this parameter may result in an empty list of labels.

labels.probabilities.probabilityRatioPreferenceStrength

Determines how strongly Lingo4G should boost labels meeting the required probabilityRatioThreshold, default: 0. A value of 0 means that Lingo4G will only discard labels whose probability ratio does not meet the threshold. For values larger than 0, Lingo4G will additionally promote the labels proportionally to their probability ratio. The larger the value of this parameter, the stronger promotion of high-probability-ratio labels.

Note: When the processing scope is determined by a query applied to feature fields, such as title:christmas, and probabilityRatioThreshold is larger than 0, Lingo4G will promote labels containing the word christmas. This effect may be undesirable and for this reason the default value of this parameter is 0.

labels.probabilities.probabilityRatioMaxRelativeScopeSize

Determines the maximum numbers of documents in scope, relative to the total number of documents, for which probability ratio scoring can be applied. Default: 0.5.

Probability ratio scoring works very well for small subsets of the whole collection. However, when analyzing a large subset, such as 50% of all documents, not many labels will be significantly more probable in the subset than in the whole collection. For this reason, Lingo4G will gradually decrease the strength of probability ratio scoring when the number of documents in scope approaches probabilityRatioMaxRelativeScopeSize. For scope sizes larger than this parameter, probability ratio scoring will not be applied at all.

labels.probabilities.autoStopLabelRemovalStrength

Determines the strength of the automatic removal of meaningless labels, default: 0.35. The larger the value, the larger portion of the stop labels file will be applied during analysis. If autoStopLabelRemovalStrength is 0.0, the automatically discovered stop labels will not be applied; if the value is 1.0, all labels found in the stop labels file will be suppressed.

labels.probabilities.autoStopLabelMinCoverage

Defines the minimum confidence value the automatically discovered stop label must have in order to be applied during analysis, default: 0.4. Lowering autoStopLabelMinCoverage to 0.0 will cause Lingo4G to apply all stop labels found in the stop labels file. Setting a fairly high value, such as 0.9, will apply only the most authoritative stop labels.

labels.scorers

The scorers section controls the process of selecting labels for analysis.

The scorers section contains the following parameters:

{
  "randomRatio": 0.1,
  "randomSeed": 0
}

labels.scorers.randomRatio

Controls the percentage of labels chosen at random, relative to maxLabels. If this parameter is 0, all selected labels will be the top-scoring ones. If this parameter is larger than zero, the 1 - randomRatio of selected labels will be the top-scoring ones, while the remaining randomRatio of the labels will be selected at random, with the probability of selection being proportional to the label's score.

labels.scorers.randomSeed

The random seed to be used for randomized label selection, 0 by default.

labels.arrangement

This section controls label label clustering. Click on the property names to follow to the description.

{
  "enabled": false,
  "algorithm": {
    "type": "ap",
    "ap": {
      "softening": 0.9,
      "inputPreference": 0.0,
      "preferenceInitializer": "NONE",
      "preferenceInitializerScaling": 1.0,
      "maxIterations": 2000,
      "minSteadyIterations": 100,
      "damping": 0.9,
      "minPruningGain": 0.3,
      "threads": "auto"
    }
  },
  "relationship": {
    "type": "cooccurrences",
    "similarityWeighting": "INCLUSION",
    "threads": "auto"
  }
}

labels.arrangement.enabled

If true, Lingo4G will attempt to arrange the selected labels into clusters.

labels.arrangement.algorithm

Determines the algorithm used to cluster labels. Currently, this parameter can have only one value, ap, which corresponds to the Affinity Propagation clustering algorithm.

labels.arrangement.algorithm.type

Determines the label clustering algorithm to use. Currently, the only supported value is ap, which corresponds to the Affinity Propagation clustering algorithm.

labels.arrangement.algorithm.ap

This section contains parameters specific to the Affinity Propagation label clustering algorithm.

labels.arrangement.algorithm.ap.softening

Determines the amount of internal structure to generate for large label clusters. A value of 0 will keep the internal structure to a minimum, the resulting cluster structure will most of the time consist of flat groups of labels. As softening increases, larger clusters will get split to smaller, connected subclusters. Values close to 1.0 will produce the richest internal structure of clusters.

You can use the Experiments window of Lingo4G Explorer to visualize the impact of softening on various properties of the cluster tree.

labels.arrangement.algorithm.ap.inputPreference

Determines the size of the clusters. Lowering the input preference below the default value of 0 will cause Lingo4G to produce larger clusters. Increasing input preference above 0 will make the clusters smaller. Note that in practice positive values of input preference will rarely be useful as they will increase the number of unclustered labels.

You can use the Experiments window of Lingo4G Explorer to visualize the impact of input preference on the number and size of label clusters.

labels.arrangement.algorithm.ap.preferenceInitializer

Determines how label preference values will be initialized, default NONE. The higher the label's preference value, the more likely it is to be chosen as the exemplar for a label cluster.

The following values are available:

NONE
Preference values for all values will be set to zero.
DF
The label's preference value will be set to the logarithm of label's Document Frequency.
WORD_COUNT
The label's preference value will be set to the number of label's words.

Please also see preferenceInitializerScaling, which can invert the interpretation of label preference values.

labels.arrangement.algorithm.ap.preferenceInitializerScaling

Determines the multiplier to use for the base preference values determined by preferenceInitializer, default: 1.

Negative values of this parameter will invert the preference. For example, if preferenceInitializer is WORD_COUNT, positive preferenceInitializerScaling will prefer longer labels as label cluster exemplar. Negative preferenceInitializerScaling will prefer shorter labels for label cluster exemplars.

labels.arrangement.algorithm.ap.maxIterations

The maximum number of Affinity Propagation clustering iterations to perform.

labels.arrangement.algorithm.ap.minSteadyIterations

The minimum number of Affinity Propagation iterations during which the clustering does not change required to assume that the clustering process is complete.

labels.arrangement.algorithm.ap.damping

The value of Affinity Propagation damping factor to use.

labels.arrangement.algorithm.ap.minPruningGain

The minimum estimated relationship pruning gain required to apply relationship matrix pruning before clustering. Pruning may reduce the time of clustering for for dense relationship matrices at the cost of memory usage increase by about 60%.

labels.arrangement.algorithm.ap.threads

The number of concurrent threads to use to compute document clusters. The default value is half of the available CPU cores.

labels.arrangement.relationship

Configures the kind of label-label relationship (similarity measure) to use during clustering.

labels.arrangement.relationship.type

The type of label-label relationship to use. Currently only one value is supported, cooccurrences, which computes similarity between labels based on how frequently they co-occur in the specified co-occurrence window.

labels.arrangement.relationship.type=cooccurrences

Computes similarity between labels based on how frequently they co-occur in the specified co-occurrence window. A number of binary similarity weighting schemes, configured using the similarityWeighting parameter, can be applied to raw co-occurrence counts to arrive at the final similarity values.

"labels.arrangement.relationship.type=cooccurrences.cooccurrenceWindowSize"

Sets the width of the window (in words) in which label co-occurrences will be counted. For example, with the cooccurrenceWindowSize of 32, Lingo4G will record that two labels co-occur if they are found in the input text no farther than 31 words apart.

labels.arrangement.relationship.type=cooccurrences.cooccurrenceCountingAccuracy

Sets the maximum percentage of documents to examine when computing label co-occurrences. The percentage is relative to the total number of documents in the index regardless of the number of documents being actually clustered.

For the sake of example, let us assume that cooccurrenceCountingAccuracy is set to 0.1 and the index has 1 million documents. When clustering the whole index, Lingo4G will examine a sample of 100k documents to compute label co-occurrences. When clustering a subset of the index consisting of 50k documents, Lingo4G will examine all 50k documents when counting co-occurrences.

If your index contains the order of hundreds of thousands or millions of documents, you can set the cooccurrenceCountingAccuracy to some low value such as 0.05 or 0.02 to speed-up clustering. On the other hand, if your index contains a fairly small number of documents (100k or less), you may want to increase the co-occurrence counting accuracy to a value of 0.4 or more for more accurate results.

labels.arrangement.relationship.type=cooccurrences.similarityWeighting

Determines the binary similarity weighting to apply to raw label co-occurrence counts to compute the final similarity values. In most cases, the RR, INCLUSION and BB weightings will be most useful.

The CONTEXT_* family of weightings computes similarities between entire rows of the co-occurrence matrix rather than individual labels. As a result, the similarity will reflect "second-order" co-occurrences: labels co-occurring with similar sets of other labels will be deemed similar. Use the CONTEXT_* weightings with care, they may produce meaningless clusters if there are many low-frequency labels selected for the analysis.

The complete list of supported values of this parameter is the following:

Value Description Cluster size Exemplar type
RR Russel-Rao similarity. Similarity values will be proportional to the raw co-occurrence counts. The RR weighting creates rather large clusters and selects frequent labels as cluster label exemplars. Large, high variance High-DF labels
INCLUSION Inclusion coefficient similarity, emphasizes connections between labels sharing the same words, for example Mac OS and Mac OS X 10.6. Large, high variance High-DF labels
LOEVINGER The inclusion coefficient corrected for chance. Medium Medium-DF labels
BB Braun-Blanquet similarity. Maximizes similarity between labels having similar numbers of occurrences. Promotes lower-frequency labels as cluster exemplars. Rather small, low variance Low-DF labels
OCHIAI Ochiai coefficient, binary cosine. Small Low-DF labels
DICE Dice coefficient. Small Low-DF labels
YULE Yule coefficient. Small, low variance Low-DF labels
CONTEXT_INNER_PRODUCT Inner product of the rows of the co-occurrence matrix. Medium, high-variance High-DF labels
CONTEXT_COSINE Cosine distance between the rows of the co-occurrence matrix. Small Low-DF labels
CONTEXT_PEARSON Pearson correlation between the rows of the co-occurrence matrix. Small Low-DF labels
CONTEXT_RR Russel-Rao similarity computed between rows of the co-occurrence matrix. Very large High-DF labels
CONTEXT_INCLUSION Inclusion coefficient computed between rows of the co-occurrence matrix. Very large High-DF labels
CONTEXT_LOEVINGER Chance-corrected inclusion coefficient computed between rows of the co-occurrence matrix. Small Medium-DF labels
CONTEXT_BB Braun-Blanquet similarity computed between rows of the co-occurrence matrix. Small, low variance Low-DF labels
CONTEXT_OCHIAI Binary cosine coefficient computed between rows of the co-occurrence matrix. Medium Medium-DF labels
CONTEXT_DICE Dice coefficient computed between rows of the co-occurrence matrix. Medium Medium-DF labels
CONTEXT_YULE Yule similarity coefficient computed between rows of the co-occurrence matrix. Small, low variance Medium-DF labels

You can use the Experiments window of Lingo4G Explorer to visualize the impact of similarity weighting on various properties of the cluster tree.

labels.arrangement.relationship.type=cooccurrences.threads

The number of threads to use to compute the similarity matrix.

documents

Parameters in the document configure the processing Lingo4G should apply to the documents in scope. Currently, the only available configuration is arranging documents into clusters based on their content. For the retrieval of the actual content of documents, please see the output section.

documents.arrangement

Parameters in this section control document clustering. A typical arrangement section is shown below. Click on the property names to go to the relevant documentation.

{
  "enabled": false,
  "maxDocuments": 10000,

  "algorithm": {
    "type": "ap",
    "ap": {
      "inputPreference": 0.0,
      "maxIterations": 2000,
      "minSteadyIterations": 100,
      "damping": 0.9,
      "addSelfSimilarityToPreference": false
    },
    "maxClusterLabels": 3
  },

  "relationship": {
    "type": "mlt",
    "maxSimilarDocuments": 100,
    "maxQueryLabels": 20,
    "minQueryLabelOccurrences": 0,
    "threads": 8
  }
}

documents.arrangement.enabled

If true, Lingo4G will try to arrange the documents in scope into groups.

documents.arrangement.maxDocuments

Lingo4G will attempt to create a document arrangement only if the number of documents in scope is less or equal to this parameter. Please note that certain document arranging algorithms may not scale well for millions of documents.

If trimToMaxDocuments is set to true, arrangements will be generated for a maximum of maxDocuments, which might be a subset of all documents in scope.

documents.arrangement.trimToMaxDocuments

If set to true, Lingo4G will create document arrangements regardless of the number of documents in scope, but the number of documents taking part in the arrangement will be at most maxDocuments, which might be less than the total number of in-scope documents.

documents.arrangement.algorithm

This section determines and configures the document clustering algorithm to use.

documents.arrangement.algorithm.type

Determines the document clustering algorithm to use. Currently, the only supported value is ap, which corresponds to the Affinity Propagation clustering algorithm.

documents.arrangement.algorithm.ap

Configures the Affinity Propagation document clustering algorithm.

documents.arrangement.algorithm.ap.inputPreference

Influences the number of clusters Lingo4G will produce. When input preference is 0, the number of clusters will usually be higher than practical. Lower input preference to a value of -1000 or less to get a smaller set of clusters.

documents.arrangement.algorithm.ap.addSelfSimilarityToPreference

If true, Lingo4G will prefer self-similar documents as cluster seeds, which may increase the quality of clusters. Setting addSelfSimilarityToPreference to true may increase the number of clusters, so you may need to lower inputPreference to keep the previous number of groups.

documents.arrangement.algorithm.ap.maxIterations

The maximum number of Affinity Propagation clustering iterations to perform.

documents.arrangement.algorithm.ap.minSteadyIterations

The minimum number of Affinity Propagation iterations during which the clustering does not change required to assume that the clustering process is complete.

documents.arrangement.algorithm.ap.damping

The value of Affinity Propagation damping factor to use.

documents.arrangement.algorithm.ap.minPruningGain

The minimum estimated relationship pruning gain required to apply relationship matrix pruning before clustering. Pruning may reduce the time of clustering for for dense relationship matrices (built using large documents.arrangement.ap.relationship.mlt.maxSimilarDocuments), at the cost of memory usage increase by about 60%.

documents.arrangement.algorithm.ap.threads

The number of concurrent threads to use to compute document clusters. The default value is half of the available CPU cores.

documents.arrangement.algorithm.maxClusterLabels

The maximum number of labels to use to describe a document cluster.

documents.arrangement.relationship

Configures the kind of document-document relationship (similarity measure) to use during clustering.

documents.arrangement.relationship.type

The type of document-document relationship to use. Currently only one value is supported, mlt, which refers to More-Like-This similarity.

documents.arrangement.relationship.type=mlt

Builds the document-document similarity matrix in the following way: for each document, take a number of labels that occur most frequently in the document and build a search query being an alternative of the labels. Take top documents returned by the query as documents similar to the document being processed.

documents.arrangement.relationship.type=mlt.maxSimilarDocuments

The maximum number of similar documents to retrieve for each document.

documents.arrangement.relationship.type=mlt.maxQueryLabels

The maximum number of labels to use when building the similarity search query.

documents.arrangement.relationship.type=mlt.minQueryLabelOccurrences

The minimum number of occurrences a label must have in a document to be considered when building the similarity search query.

documents.arrangement.relationship.type=mlt.threads

The number of threads to use to execute similarity queries.

performance

The performance section provides settings for adjusting the accuracy vs. performance balance.

performance.threads

Sets the number of threads to use for analysis. The default value is auto, which will set the number of threads to the number of CPU processors reported by the operating system. Alternatively, you can explicitly provide the number of indexing threads to use.

If your index is stored on an HDD and is larger than the amount of RAM available for the operating system for disk caching, you may need to set the number of threads to 1 to avoid the performance penalty resulting from highly concurrent disk access. If your index is stored on an SSD drive, you can safely keep the "auto" value. See the storage technology requirements section for more details.

performance.minPerSegmentDf

Sets the per-index-segment minimum absolute document frequency threshold. Lingo4G will ignore all labels occurring fewer times than the specified threshold. In most cases, the recommended value for this parameter is auto, in which case Lingo4G will automatically determine the threshold value based on the clusters.minClusterSize parameter.

When setting minPerSegmentDf to a specific value, please note that the threshold applies to a part of the index, so with a large data set and index split into many segments, setting minPerSegmentDf to, for example, 10 will cause some labels with frequency larger than 10 to be ignored (because in all index segments their frequency was below 10).

This threshold only applies when clustering more than about 10% of the index, see performance.maxSubsetSizeForTermVectorScan. When clustering a small subset of the index, low-frequency labels will not be ignored.

performance.maxPerSegmentDf

Sets the per-index-segment maximum document frequency threshold, relative to the number of documents in the index segment. Lingo4G will ignore in more than the provided percentage of the segment's documents. If your index contains a large number of high-frequency terms, lowering maxPerSegmentDf may speed-up clustering of large portions of the index.

This threshold only applies when clustering more than about 10% of the index, see performance.maxSubsetSizeForTermVectorScan.

performance.maxSubsetSizeForTermVectorScan

Sets the relative number of clustered documents below which Lingo4G will not apply per-segment frequency thresholds. If the number of analyzed documents in relation to the total number of documents is lower than maxSubsetSizeForTermVectorScan, Lingo4G will not apply per-segment frequency thresholds defined in performance.minPerSegmentDf and performance.maxPerSegmentDf.

For the sake of example let us assume there are 1000 documents in the index and maxSubsetSizeForTermVectorScan is equal to 0.1. If the number of documents in scope is 100 or less, per-segment frequency thresholds will not be applied.

output

The output section configures the format and contents of the clustering results produced by Lingo4G. A typical output section is shown below. Click on the property names to go to the relevant documentation.

{
  "format": "json",

  // What information to output for each label
  "labels": {
    "enabled": true,
    "labelFormat": "LABEL_CAPITALIZED",

    // The output of label's top-scoring documents
    "documents": {
      "enabled": false,
      "maxDocumentsPerLabel": 10,
      "outputScores": false
    }
  },

  // What information to output for each document
  "documents": {
    "enabled": false,
    "onlyWithLabels": true,
    "onlyAssignedToLabels": false,

    // The output of labels found in the document
    "labels": {
      "enabled": false,
      "maxLabelsPerDocument": 20,
      "minLabelOccurrencesPerDocument": 2
    },

    // The output of documents' content
    "content": {
      "enabled": false,
      "fields": [
        {
          "name": "title",
          "maxValues": 3,
          "maxValueLength": 160
        }
      ]
    }
  }
}

output.format

Sets the format of the clustering results. The following formats are currently supported:

xml
Custom Lingo4G XML format.
json
Custom Lingo4G JSON format.
excel
MS Excel XML, also possible to open in LibreOffice and OpenOffice.
custom-name
A custom XSL transform stylesheet that transforms the Lingo4G XML format into the final output. The stylesheet must be present at L4G_HOME/resources/xslt/custom-name.xsl

output.labels

This section controls the output of labels selected by Lingo4G.

output.labels.enabled

Set to true to output the selected label, default: true.

output.labels.labelFormat

Determines how the final labels should be formatted. The following values are supported:

ORIGINAL
The label will appear exactly as in the input text.
LOWERCASE
The label will be lower-cased.
LABEL_CAPITALIZED
The label will have its first letter capitalized, unless the first word contains other capital letters (such as mRNA).

output.labels.documents

This section controls whether and how to output matching documents for each selected label.

output.labels.documents.enabled

Set to true to output matching document for each label, default: false.

output.labels.documents.maxDocumentsPerLabel

Controls the maximum number of matching documents to output per labels, default: 10. If more than maxDocumentsPerLabel documents match a label, the top-scoring documents will be returned.

output.labels.documents.outputScores

Controls whether to output document-label matching scores for each document, default: false.

output.documents

Controls whether Lingo4G should output the contents of documents being analyzed.

output.documents.enabled

Set to true to output the contents of the analyzed documents, default: false.

output.documents.onlyWithLabels

If set to true, only documents that contain at least one of the selected labels will be output; default: true.

output.documents.onlyAssignedToLabels

If set to true, only top-scoring documents will be output, default: false. If this parameter is true and some document did not score high-enough to be included within the output.labels.documents.maxDocumentsPerLabel top-scoring documents for some label, the document will be excluded from output.

output.documents.labels

This section controls the output of labels contained in individual documents.

output.documents.labels.enabled

If true, each document emitted to the output will also contain a list of those selected labels that are contained in the document; default: false.

output.documents.labels.maxLabelsPerDocument

Sets the maximum number of labels per document to output. By default, Lingo4G will output all document's label. If some lower maxLabelsPerDocument is set, Lingo4G will output up to the provided number of labels, starting with the ones that occur in the document most frequently.

output.documents.labels.minLabelOccurrencesPerDocument

Sets the minimum number of occurrences of a label in a document required for the label to be included next to the document. By default, the limit is 0, which means Lingo4G will output all labels. Set the limit to some higher value, such as 1 or 2 to output only the most frequent labels.

output.documents.content

This section controls the output of the content of each document.

output.documents.content.enabled

If true, the content of each document will be included in the output; default: false.

output.documents.content.fields[]

The array of fields to output. Each entry in the array must be an object with the following properties:

name
The name of the field to include in the output
maxValues
The maximum number of values to return for multi-value fields. Default: 3.
maxValueLength
The maximum number of characters to output for a single value of the field. Default: 160.
highlighting

Context highlighting configuration. If active, the value of the field is filtered to show the text surrounding labels from the current criteria query or terms matching the scope query.

The actual matches (labels or query terms) will be surrounded with a prefix and suffix string configured at the field level.

Highligting configuration is an object with the following properties.

criteria
Extract the context and highlight labels in the current criteria. Default: false.
scope
Extract the context and highlight terms in the current scope query. Default: false.
truncationMarker
A string prepended or appended to the output if it is truncated (does not start or end at the full content of the field). Default: horizontal ellipsis mark , 0x2026 Unicode character.
startMarker
A string inserted before any highlighted fragment. The string can contain a special substitution sequence %s which is replaced with numbers between 0 and 9, indicating different kinds of highlighted regions (scope, criteria). Default: ⁌%s⁍ (the default pattern uses a pair of rarely used Unicode characters 0x204C and 0x204D).
endMarker
A string inserted after any highlighted fragment. The string can contain a special substitution sequence %s which is replaced with numbers between 0 and 9, indicating different kinds of highlighted regions (scope, criteria). Default: ⁌\%s⁍.

If the criteria and scope are undefined, or if no fragment of the source field triggers a match, the value of the field is returned as if no highlighting was performed.

When highlighting is active, field configuration property maxValues corresponds to the number of fragments to return, while maxValueLength denotes each fragment's context (window size) around the matching terms.

Heads up!

Highlighted regions can nest, overlap or both. To make HTML rendering easier, any overlap conflicts are corrected (tags are closed and reopened) to make the output a proper tree structure.

While it is possible to change the default highlighting markers, it should be done with caution. The Explorer assumes the above default patterns and replaces them with application-specific HTML.

A typical content fields specification may be similar to:

fields: [
  { "name": "title",
    "highlighting": {
      "criteria": true,
      "scope": true
    }
  },
  { "name": "abstract", 
    "maxValues": 3, 
    "maxValueLength": 160,
    "highlighting": {
      "criteria": true,
      "scope": true
    }
  },
  { "name": "tags",
    "highlighting": {
      "criteria": false,
      "scope": false
    }
  }
]

summary

The summary section contains parameter for enabling computation of various metrics describing the analysis results.

summary.labeledDocuments

When true, Lingo4G will compute the number of documents in the analysis scope that contain at least one of the selected labels. This metric can be used to determine how many documents were "covered" by the selected labels. Default: false.

debug

A number of switches useful for troubleshooting the analysis.

debug.logCandidateLabelPartialScores

When true, partial scores of candidate labels will be logged on the DEBUG level. Default: false.

Release notes

Version 1.2.0

16-01-2017

The 1.2.0 release introduces support for indexing PDF, Word and HTML files, adds automatic handling of concurrency during indexing, improves highlighting of complex queries and adds a number of other smaller improvements and bug fixes.

Compatibility

Reindexing
Recommended. Lingo4G 1.2.0 will work with indexes created with Lingo4G 1.1.0, although certain changes to stop label extraction algorithms may bring some label improvements after reindexing.
Project descriptors

Updates recommended. Deprecated indexer.stopLabelExtractor.threads and indexer.stopLabelExtractor.accuracy attributes. They should be removed from project descriptors (values will be ignored and will trigger a warning on the console).

If used, indexer.threads attribute should be set to auto (remove any fixed number of threads override if you have it). This enables automatic thread management which adjusts to hardware automatically (HDDs, SSDs, number of CPU cores).

Custom document sources
Updates may be required if your custom document source used some of the public Lingo4G utility classes (for parallel processing, for example) or Google Guava, which has been updated to a newer version.

New features

PDF/Word document source example

An new document source has been added that automatically extracts text from PDFs, Microsoft Word, OpenOffice and other file formats. The discovery of file format and text extraction are done using Apache Tika library. Full source code is included to allow modifications or for use out of the box to index local files.

Automatic concurrency in indexing

The indexer will try to automatically maximize throughput taking into account the number of available CPU cores and the speed of the drive(s) used for indexing. The threads attribute should be set to auto for this feature to work.

When the automatic adjustment isn't a good fit or lower CPU consumption is required, a global system property l4g.concurrency can be set at startup to override the defaults (using -Dl4g.concurrency=... syntax) or the threads attribute can be modified directly in the project descriptor (this is discouraged).

See the threads attribute's description for the permitted syntax of threading specification.

Cell highlighting in
document clusters treemap

Stating with the 1.2.0 release you can have Lingo4G Explorer highlight same-color or same-label cells in the document clusters treemap. The following video demonstrates the feature.

Improvements

Scope highlighting

Scope highlighting has been rewritten from scratch and should now fully support phrase and fuzzy queries. Previously any term of a phrase query would be highlighted, after this change only actual terms involved in the phrase (or a matching term span) will receive the highlight.

Lucene upgrade

Apache Lucene has been upgraded to version 6.3.0.

New analysis
scope types

The 1.2.0 release adds three scope types that make it possible to use complex criteria for document selection: by-id selection, complementary and composite scope definitions.

Stop label extraction

Better detection of nonsensical stop label extraction conditions and reporting. Automatically detected stop labels may change as a result of this adjustment.

Phrase normalization feedback

Improved console progress feedback (indexing phase): it now shows a progress bar on larger data sets.

Guava 2.0

Google Guava dependency update to version 20.0.

Bugs

Exceptions when indexing Unicode

Documents containing non-ASCII Unicode characters could result in unhandled exceptions thrown during indexing. This is a regression bug affecting version 1.1.0 and 1.1.1.

Inconsistent label selection
between server restarts

In previous versions, labels selected for the same analysis scope might differ between Lingo4G REST API restarts and command line analysis invocations. The 1.2.0 fixes the issue, so that label selection results are always the same.

License information

l4g version command wasn't able to display valid license information properly.

Label sorting broken

Labels were not sorted properly in Lingo4G Explorer. This regression bug was introduced in version 1.1.0 and is fixed in the 1.2.0 release.

Stats output broken

The stats command skipped index component statistics.

API changes

Output of labels
and documents

The labels part of the response will contain the list property only when the output of labels is requested by setting the output.labels.enabled parameter to true. Otherwise, the list property will not be present.

Similarly, the documents part of the response will contain the list property only when the output of documents is requested by setting the output.documents.enabled parameter to true.

Version 1.1.1

08-12-2016

Version 1.1.1 fixes a major bug in fetching of the contents of documents.

Compatibility

Reindexing
Not required. Lingo4G 1.1.1 will work with indexes created with Lingo4G 1.1.0.
Project descriptors
Updates not required. Version 1.1.1 does not change project descriptors.
Custom document sources
Updates not required.

Bug fixes

Content of incorrect
documents fetched

In version 1.1.0, when document content was requested simultaneously with the onlyWithLabels or onlyAssignedToLabels parameters set to true, for some documents incorrect content could be fetched.

Version 1.1.1 fixes this issue.

Version 1.1.0

25-10-2016

Version 1.1.0 improves conflation of different spelling variants of the same label, adds more control over heuristic English stemming, fixes a number of bugs and extends documentation.

Compatibility

Reindexing
Recommended. Lingo4G 1.1.0 will work with indexes created with Lingo4G 1.0.0, but reindexing is strongly recommended because of improvements in automatic label detection.
Project descriptors
Updates not required. Version 1.1.0 does not change project descriptors.
Custom document sources
Updates not required.

New features

Numeric ranges

Consistent support for numerics and numeric ranges in both the standard and complex query parser.

Stemming control

An option called useHeuristicStemming was added to disable heuristic stemming in the English analyzer.

Spelling variants

Phrase feature extraction has been improved to automatically detect and merge spelling variants of labels written with or without dashes and as a compound word or multi-term phrase. For example, the following spelling variants would be unified now:

fast boot, fast-boot, fastboot
web page, webpage, web-page
magical jelly bean, magical jellybean, magical jelly-bean
Relevance score

The query-relevance score attribute was added to each document in document retrieval API.

Improvements

Documentation

We added documentation of default analyzers and their options.

Dictionaries cleanup

The default dictionaries have been cleaned up and renamed consistently. Example projects make use of the default dictionaries and additionally project-specific dictionaries, where applicable.

Licensing

Licenses will be reloaded automatically when no active licenses are found. This permits hot-swapping of licenses while the server is running.

Other internal cleanups

A number of other internal issues have been fixed.

Bugs

Small input crashes

There was a possibility of a runtime exception being hit on analysis of small inputs.

Terminal crash

There was a possibility of an exception being thrown on non-updateable terminals.

Version 1.0.2

19-10-2016

Version 1.0.2 is a maintenance release that addresses minor software bugs.

Bugs

Terminal crash

There was a possibility of an exception being thrown on non-updateable terminals.

Version 1.0.1

28-09-2016

Version 1.0.1 is a maintenance release that addresses minor software bugs and documentation deficiencies.

Compatibility

Reindexing
Not required, version 1.0.1 will work with index created by version 1.0.0.
Project descriptors
Updates not required.
Custom document sources
Updates not required.

Improvements

Hiding zero-sized docs
in cluster treemap

Version 1.0.1 adds the possibility to hide zero-sized groups in the document cluster treemap.

Version 1.0.0

22-09-2016

Version 1.0.0 is the first official release of Lingo4G. Version 1.0.0 comes with dictionary-based filtering of labels reworked and documented, improved label selection stability and minor improvements to Lingo4G Explorer and documentation.

Compatibility

Reindexing
Required. Lingo4G 1.0.0 updates index storage format, indices created by the 0.11.x versions will not work with version 1.0.0.
Project descriptors
Updates required. Version 1.0.0 changes the way label dictionaries are defined and applied.
Custom document sources
Updates not required.

New features

Dictionaries

Version 1.0.0 introduces a common definition of label dictionaries that can be used, for example, to exclude specific labels from analysis. This release comes with two dictionary implementations: the simple and efficient word-based matching and more powerful but expensive to apply regular expression based matching. The dictionaries parameter documentation describes how to define your own dictionaries.

Additionally, the newly introduced dictionaries framework allows defining ad-hoc (per analysis request) dictionaries, which you can use to let the users tune or add their own label exclusions without restarting Lingo4G REST API server. Lingo4G Explorer comes with a simple implementation of this idea.

Improvements

Label selection
stability improvements

In previous version of Lingo4G, excluding a single label from analysis could trigger a cascade of other changes to the label list with many other unrelated labels being removed and replaced. Version 1.0.0 improves label selection stability to prevent such situations.

Hash-based analysis ids

As of version 1.0.0, the REST API will use 64-bit hash strings as identifiers of asynchronously handled analyses. This will minimize the chances of getting stale analysis results in case Lingo4G REST API is restarted between initiating the analysis and fetching its results.

This change should not require any changes in the code of your application, unless it relies on the structure of the analysis results URL returned by the REST API in the Location header.

Partial results statistics

Version 1.0.0 changes the way the REST API reports processing progress. As of this release, the result of the /v1/analysis/{id} method will follow the structure of the complete analysis result returned by the /v1/analysis/{id}/result. The difference between the two methods is that the former will only return processing progress information and certain labels and document statistics, while the latter will return the complete analysis result.

Analysis status
and parameters
in output response

As of version 1.0.0, the analysis result response includes the processing status and parameters used to produce the analysis. These two pieces of data are especially useful for debugging the specific analysis result.

Version 0.11.0

02-08-2016

Version 0.11.0 improves the stability of label selection, adds more detailed performance logging and introduces working index versioning.

Compatibility

Reindexing
Required. Starting with version 0.11.0, Lingo4G will automatically determine whether the existing index is compatible with the version of Lingo4G you are running. Re-indexing is required for this feature to work.
Project descriptors
Updates not required.
Custom document sources
Updates required. Version 0.11.0 introduces improved APIs for progress reporting, custom document sources need to be updated to use those APIs.

Improvements

Progress and performance
logging improvements

Version 0.11.0 comes with significantly improved reporting and logging of progress information. For each analysis requests, logs will now contain a detailed break down of the performed tasks.

[Task]                               [Time]    [%]
Resolving selector query              129ms   3.2%
Fetching candidate labels          1s 651ms  40.7%
  TermVectorScan                   1s 619ms  39.9%
   @ Segments: 7
   @ Documents: 8,355
   @ Threads: 8
   @ Labels fetched: 7,994
   @ Speed: 5.16ki docs/s
Scoring candidate labels              214ms   5.3%
 @ Labels scored: 7,994
 @ Labels selected: 1,000
 @ Speed: 38.25ki labels/s
Counting co-occurrences            1s 298ms  32.0%
 @ Threads: 8
 @ Speed: 770 labels/s
Computing label similarities           22ms   0.5%
Clustering labels                     398ms   9.8%
 @ Similarity density: 20.84%
 @ Similarity pruning gain: 1.98%
 @ Similarity pruning time: 77ms
 @ Similarity used: original
 @ Iterations: 155 (7.8% of max)
Computing coverage                    348ms   8.6%
 @ Segments: 7
 @ Labels: 1,000
 @ Threads: 8
 @ Speed: 2.87ki labels/s
Working index versioning

Starting with version 0.11.0, Lingo4G will automatically determine whether the existing index is compatible with the version of Lingo4G you are running. If index format is too old, you will need to re-index your data before you can run analyses.

Heads up!

When you run Lingo4G 0.11.0 analyze, server or stats command with a working index created by a previous version, you will see the following message:

The current index is too old, reindex your data.

Please re-index your data to be able to run analyses with version 0.11.0.

Maximum indexed
documents option

Since version 0.11.0 you can pass the --max-docs option to the index command to limit the number of documents to index.

Bug fixes

Label selection
stability improvements

Prior versions might select different labels for the same set of parameters. This release ensures that the same set of labels is selected, also for different numbers of processing threads.

Version 0.10.2

24-06-2016

Version 0.10.2 fixes a critical bug in license validation routines.

Compatibility

Reindexing
Not required, version 0.10.2 will work with the index created by the 0.10.x releases.
Project descriptors
Updates not required.
Custom document sources
Updates not required.

Bug fixes

License validation

A bug has been fixed in license validation routines that could result in valid licenses being omitted.

Version 0.10.1

17-06-2016

Version 0.10.1 fixes a bug in presentation of the document cluster members in treemap view.

Compatibility

Reindexing
Not required, version 0.10.1 will work with the index created by the 0.10.x releases.
Project descriptors
Updates not required.
Custom document sources
Updates not required.

Bug fixes

Incorrect member count
in document clusters

In version 0.10.0 Lingo4G Explorer may incorrectly report the number of members of document clusters in the treemap view. Version 0.10.1 fixes the issue.

Version 0.10.0

16-06-2016

Version 0.10.0 introduces highlighting of scope query and selected labels in document texts and more options for the document clusters treemap display in Lingo4G Explorer.

Compatibility

Reindexing
Required, version 0.10.0 will work with the index created by the 0.9.x releases, but highlighting will be off. For this reason, we highly recommend to reindex your project from scratch.
Project descriptors
Field content specification has changed, maxTotalLength property has been removed. The defaults have been slightly adjusted to return shorter snippets.
Custom document sources
Recompilation required due to updated binary dependencies of Lingo4G.

New features

Label highlighting

Version 0.10.0 makes it possible to highlight occurrences of scope query and selected labels in the text of documents retrieved using ad-hoc document retrieval.

Document with the Surface Pro and OneNote labels highlighted; Configuration of scope and label highlighting.
Document clusters
treemap configuration

Version 0.10.0 adds new features to the document clusters treemap, including coloring, sizing and labeling of document cells based on the selected document fields.

Document clusters treemap settings

Improvements

Document clustering for
subset of in-scope
documents

Versions prior to 0.10.0 would refuse to apply document clustering when the scope contained more than maxDocuments documents. As of version 0.10.0, if you set the trimToMaxDocuments parameter to true, Lingo4G will proceed with clustering a subset of the in-scope documents of size maxDocuments.

Label selection
improvements

Version 0.10.0 simplifies and improves the performance and memory foot-print of label selection. An important change is the option to introduce a configurable amount of randomness to the label selection process, so that some of the less frequent and lower-scoring labels have a chance to be included in the analysis. The randomized label selection process is controlled by the following newly-added parameters: randomRatio, randomSeed. Please also see below for the API changes related to this improvement.

API changes

Removed and renamed
parameters

As a result of label selection improvements, the following parameters have been removed:

  • analysis.labels.maxLabelsOverhead
  • labels.surface.partOfSpeechFiltering
  • labels.frequencies.minRelativeDfDeviation
  • labels.frequencies.maxRelativeDfDeviation
  • labels.cooccurrences.isolationThreshold
  • labels.cooccurrences.isolationThresholdWidth
  • labels.scorers.isolationRatioScorerWeight
  • labels.cooccurrences.maxOverlap
  • labels.cooccurrences.maxOverlapDeviation
  • labels.scorers.overlapRankScorerWeight
  • labels.scorers.childCountScorerWeight
  • labels.scorers.dfScorerWeight
  • labels.scorers.candidateLabelScorerWeight
  • debug.logBaseLabelPartialScores

The following parameters have been renamed:

Version 0.9.0

31-03-2016

Version 0.9.0 introduces label arrangements, major improvements to document indexing, many new features in Lingo4G Explorer and much improved documentation.

Compatibility

Reindexing
Required, version 0.9.0 comes with major improvements to indexing that removes noisy labels and decreases the disk size of the index.
Project descriptors
Updates required, certain areas of the descriptor have been reorganized, a number of parameters removed.
Custom document sources
Updates not required.

New features

Label arrangement

Version 0.9.0 makes it possible to arrange related labels into clusters. Label clusters themselves can be organized into higher-level structures.

Apart from treemap-based presentation, Lingo4G Explorer can show label clusters as a textual list and as a graph.

New public data sets

Version 0.9.0 comes with support for two new public data sets:

  • Questions and answers from a StackExchange Q&A site, such as superuser.com.
  • Summaries of research projects funded by the US National Science Foundation and NASA between 2007 and 2015, as available from research.gov.

For more information, see the summary of example data sets.

Documentation updates

Version 0.9.0 comes with significantly more documentation, including conceptual overview of Lingo4G and description of Lingo4G Explorer. Minor documentation additions concern feature extractors and analysis result response syntax.

As of version 0.9.0, all practical examples in the documentation are based on the superuser.com StackExchange data set.

Parameter experiments
in Lingo4G Explorer

Version 0.9.0 Lingo4G Explorer adds the Experiments window you can use to investigate the impact of various parameter changes on the properties of the analysis result.

Improvements

Indexing improvements

Version 0.9.0 brings significant improvements in the document indexing phase, including:

  • Keeping numeric tokens in labels, configured by the parameter.
  • Improved accounting of compound terms that should eliminate truncated labels, such as high-energy x [rays].
  • Normalization of various kinds of apostrophes.
  • Removal of globally frequent labels, configured by the parameter.
  • Decreased disk size of the index.
curl command export

You can now obtain a curl command invocation that will fetch the analysis result data configured in the Lingo4G result export window.

Composite criteria
in document retrieval

You can now retrieve the content of documents using composite criteria that allow building complex Boolean queries.

API changes

Document arrangement
section reorganized
The section of the descriptor have been reorganized to group the algorithm-specific parameters under a unique property. Lingo4G currently comes with one document clustering algorithm, Affinity Propagation, whose parameters are now available in the section.
scope section removed
from result response
The scope section has been removed from the analysis result response output, the documentsInScope property has been moved to the summary section of the output.

Version 0.8.0

2015-11-13

Version 0.8.0 improves the performance of document clustering introduced in version 0.7.0. Additionally, it brings a number of small improvements to Lingo4G Explorer.

Compatibility

Reindexing
Not required, index created by version 0.7.0 will work with version 0.8.0.
Project descriptors
Updates not required, descriptors created for version 0.7.0 with work with version 0.8.0.
Custom document sources
Updates not required.

Improvements

Faster document clustering

Version 0.8.0 adds multi-threaded document clustering. Additionally, in certain cases performance can be further improved by pruning of relationships matrix.

More export options

As of version 0.8.0 you can now choose which document fields to output in the Excel/JSON/XML report. Additionally, you opt for including documents without labels in the output.

Result export dialog
Current label view as CSV

You can now copy the contents of the label view, including the added/removed/common status, to clipboard as CSV.

Copy label list as CSV
Processing time details
and estimates

Version 0.8.0 adds remaining time estimates for long-running tasks. You can see the detailed breakdown of the processing time by hovering with mouse pointer over the total elapsed time statistic.

Processing times breakdown

Version 0.7.0

2015-08-18

Version 0.7.0 is a major new release that adds experimental support for arranging and visualizing documents as flat non-overlapping clusters.

Compatibility

Reindexing
Not required, index created by version 0.6.x will work with version 0.7.0.
Project descriptors
Updates not required, descriptors created for version 0.6.x with work with version 0.7.0.
Custom document sources
Updates not required.
Java 8 required

As of version 0.7.0, Lingo4G requires Java version 8 or later to run.

New features

Document arrangement

Version 0.7.0 makes it possible to arrange documents into flat non-overlapping clusters. Please see the quick start video for an overview and the documents.arrangement configuration section for a brief description of the involved parameters.

Document clusters view in Lingo4G Explorer

Version 0.6.0

released on 2015-07-06

Version 0.6.0 is a major new release that brings improvements in document indexing, improves label selection and adds document content retrieval to Lingo4G REST API and Explorer application.

Compatibility

Reindexing
Required, index created by version 0.5.x will not work with version 0.6.0.
Project descriptors
Updates required. The 0.6.0 release removes a number of obsolete parameters. Please see the release notes for more details or contact Carrot Search for an updated project descriptor.
Custom document sources
Updates required. Version 0.6.0 updates a number of third-party dependencies and therefore the 0.5.x custom document sources may not work with version 0.6.0.

New features

Improved label selection

Version 0.6.0 improves the quality of label selection by introducing automatic discovery of collection-specific stop labels, accompanied by collection probability label scoring and significant improvements in document text tokenization.

Document content
retrieval API

The 0.6.0 release introduces a REST API method for document content retrieval. Additionally, you can now browse the contents of documents in Lingo4G Explorer.

Document content view in Lingo4G Explorer
Result export
in Lingo4G Explorer

As of the 0.6.0 release, you can export the analysis result directly from Lingo4G Explorer and save it as an Excel, XML or JSON file.

Result export in Lingo4G Explorer
New fields in IMDb
and PubMed data sets

Version 0.6.0 parses more fields when indexing the IMDb and PubMed data sets. The new fields for IMDb are: country, rating, keywords, director and genre. The new fields for PubMed are: journal, author, keywords, date, journalName and subject.

Project descriptor changes

New output folder
The output from the analyze command is now saved to a dedicated directory called results (directly under the project's directory). Results were previously saved to work directory, which is now exclusive for internal use by the application.
TF/DF ratio scoring
removed
Version 0.6.0 replaces TD/DF ratio scoring with automatic discovery of stop labels and probability ratio scoring. You will need to remove the minTfDfRatio and minTfDfRatioDeviation parameters from your project descriptors.
Original label format
is now the default

Version 0.6.0 changes the default value of the labelFormat parameter from LABEL_CAPITALIZED to ORIGINAL to avoid confusion when tuning label surface scoring, such as acronymLabelWeight.

Bug fixes

Label filtering not applied
for small scopes

Version 0.6.0 fixes a bug that prevented Lingo4G from applying label filtering (label surface and frequency parameters) when analyzing small subsets of the collection.

Version 0.5.0

released on 2015-05-14

Version 0.5.0 is a major new release that adds an initial implementation of Lingo4G REST API along with a simple browser-based tuning application. Also, in preparation for further development, the 0.5.0 release restructures a number of the basic concepts behind Lingo4G and makes a number of backward-incompatible changes.

Compatibility

Reindexing
Not required, index created by version 0.4.1 will work with 0.5.0.
Project descriptors
Updates required. The 0.5.0 release significantly restructures certain areas of the project descriptor. Please see the release notes for more details or contact Carrot Search for an updated project descriptor.
Custom document sources
Updates not required, custom document source binaries created for version 0.4.x will work with version 0.5.0.

New features

REST API

The 0.5.0 release introduces an initial implementation of Lingo4G REST API. You can use the API to invoke Lingo4G text analysis from your favourite programming language or directly from a browser. You can start the REST API server using the server command.

Lingo4G Explorer

You can use Lingo4G Explorer to interactively tune Lingo4G parameters directly in your browser. See the quick start section for instructions on running Lingo4G Explorer.

Conceptual changes

In preparation for further development, the 0.5.0 release needed to restructure some fundamental concepts behind Lingo4G.

Labels and documents

The two basic entities involved in Lingo4G processing are now labels and documents. Document is a basic unit of information processed by Lingo4G, a label is a specific human-readable feature that occurs in one or more documents. Further releases of Lingo4G will allow arranging both labels and documents into higher-level structures such as clusters or graphs.

Clustering becomes
analysis

To accommodate further additions to Lingo4G, such as embedding of labels and documents in 2d spaces, the 0.5.0 release replaces the notion of clustering with the more general analysis. Currently, analysis consists of selecting a set of labels that best describe the subset of documents submitted for analysis.

As a consequence of the clustering to analysis transition, the l4g cluster command has been renamed to analyze and the clustering section of the project descriptor has become the analysis section.

Project descriptor changes

The 0.5.0 release introduces a number backwards-incompatible changes in the project descriptor.

"analysis" section
The clustering section has been renamed to analysis. Furthermore, to account for the changed definition of the "clustering" concept, parameters found in the clustering subsection have been moved to the labels subsection. The labels subsection has been subdivided to the surface, frequencies and cooccurrences subsections.
"labelSource" section

The labelSource section has been renamed to source and put as a subsection of the labels section. The list of feature fields to fetch labels from is now represented by an array of field descriptors rather than two arrays of field names and field weights.

Renamed parameters

A number of parameters in the former clustering.labels (now in analysis.labels.surface) have been renamed:

Old name New name
preferredLabelLength preferredWordCount
preferredLabelLengthDeviation preferredWordCountDeviation
minLabelTokens minWordCount
maxLabelTokens maxWordCount
minLabelCharacters minCharacterCount
minLabelTokenCharacters minWordCharacterCountAverage
"output" section
The output section has been restructured to reflect the introduction of labels and documents entities.
Output of label
co-occurrences
temporarily removed
Release 0.5.0 temporarily removes the option to output label co-occurrences. Further releases will allow outputting generalized relationships between labels and documents, one of which will be label co-occurrences.

Improvements

Minor quality fixes
The 0.5.0 fixes minor bugs that could deteriorate the quality of label selection.
JSON override in
l4g analyze
As of the 0.5.0 release it is possible to override arbitrary analysis parameters when invoking the analyze command using the -j command line parameter. This can be particularly handy when you export the JSON override strings from Lingo4G Explorer .

Version 0.4.1

released on 2015-04-17

Version 0.4.1 comes with an important bug fix in the clustering algorithm.

Compatibility

Reindexing
Not required, index created by version 0.4.0 will work with 0.4.1.
Project descriptors
Updates not required, project descriptor created for version 0.4.0 will work with version 0.4.1.
Custom document sources
Updates not required, custom document source binaries created for version 0.4.0 will work with version 0.4.1.

Bug fixes

Empty cluster set when
clustering a subset
of the collection

Lingo4G may erroneously create an empty cluster list when processing a subset of the collection and print a misleading message saying No candidate labels found, try lowering the DF cut-offs.. The 0.4.1 release fixes this issue.

Pure negative
queries supported

Version 0.4.1 adds support for pure negative queries in the cluster command. For example, -s "-summary:foo" would select all documents that do not contain the term foo in the summary field.

Assertion errors in indexer

Previous version might throw an assertion error when the number of segments to optimize was equal to 1. Version 0.4.1 fixes this issue.

Version 0.4.0

released on 2015-02-13

Version 0.4.0 comes with major rewrite of the indexing infrastructure, resulting in optimized memory, better phrase extraction and tuned resource utilization.

Compatibility

Reindexing
Not strictly required (index created by version 0.3.x will work with 0.4.0), but strongly recommended as the resulting output should contain better features.
Removed indexer options

Several options have been removed from the indexer section of the project descriptor. Project descriptors still carrying these attributes will fail to parse properly.

  • Indexer type sequential has been removed. Remove the type of the indexer entirely, if it is present in your descriptor file.
  • indexWriter attribute (and all children attributes) has been removed. The index writer, its buffers and memory allocation, is now adjusted automatically.
  • Phrase feature contributor's minPhraseDfAtPartialMerge and diskCounterMaxBufferSizeMb attributes have been removed.
Query parser
The default query parser's operator has been changed from OR to AND to be more similar to modern search engines.

Improvements

Indexing

The index command has been rewritten to utilize memory and disk more efficiently.

Phrase extraction

A number of improvements to automatic phrase extraction yields better label candidates and clustering output as a result.

Common terms handling

Phrases with leading or trailing common terms could be incorrectly indexed and show up as cluster labels.

Version 0.3.1

released on 2015-01-19

Version 0.3.1 allows specifying the list of fields to cluster on as a parameter of the l4g cluster command and fixes a minor bug in parsing command line arguments.

Compatibility

Reindexing
Not required, index created by version 0.3.0 will work with 0.3.1.
Project descriptors
Updates not required, project descriptor created for version 0.3.0 will work with version 0.3.1.
Custom document sources
Updates not required, custom document source binaries created for version 0.3.0 will work with version 0.3.1.

Improvements

l4g cluster --feature-fields

You can now pass the list of feature fields to use during clustering using the --feature-fields option.

Bug fixes

Incorrect parsing of quoted
command line parameters

It was impossible in earlier version of Lingo4G to pass a command line parameter enclosed in double quotes. For example, the selector query of l4g cluster -s "\"phrase query\"" would be interpreted as phrase query rather than "phrase query". Version 0.3.0 fixes this issue.

Version 0.3.0

released on 2014-12-12

Version 0.3.0 fixes a number of major bugs and introduces two small improvements.

Compatibility

Reindexing
Recommended if the source of the data was a Lucene index, see the bug fixes section for details.
Project descriptors
Project descriptors using custom document sources may require an update. Carrot Search will provide the updated project descriptor if needed.
Custom document sources
Updates required, Carrot Search will provide binaries of your custom document sources that will work with the 0.3.0 version.

Improvements

Improved l4g stats

The l4g stats command has received a number of improvements and changes:

  • Reporting of the size of document term vectors has been added, with may be a useful piece of input for performance tuning.
  • Reporting the raw text statistics is disabled by default, the term vector statistics are much more useful for performance tuning. You can get the raw text statistics by passing the --analyze-text-fields option.
  • The default accuracy of statistics gathering has been lowered from 1.0 to 0.1. The lowered accuracy is still large enough to get a very good estimate of the statistics and leads to much faster processing in case of large indices. You can set a different accuracy using the -a option.
More flexible
l4g cluster -o option
As of Lingo4G 0.3.0 you can also pass a file name to the -o option of l4g cluster to save the clustering results directly to the provided file.

Bug fixes

Some documents sourced
from a Lucene index
may not get indexed

Earlier versions of Lingo4G may ignore during indexing some documents from a source Lucene index that consists of multiple segments or has deleted documents. Version 0.3.0 fixes this issue.

If the source of documents was a Lucene index, re-indexing is required for Lingo4G to include all the desired documents in its index.

Exception when generating
document-cluster
assignments
Lingo4G 0.2.0 would throw an exception when the project descriptor had the output.components.assignments.enabled property set to true, which effectively prevented generating document-to-cluster assignments. Version 0.3.0 fixes this issue.
Use 24-hour clock
in log file names
Version 0.3.0 switches to the 24-hour clock for log fie names, so that sorting by file name produces a chronological order of log files.
Incorrect total
time in log files
Version 0.2.0 would always report zero total processing time in log files, version 0.3.0 fixes the issue.

Version 0.2.0

released on 2014-12-04

Compatibility

Reindexing
Recommended. Version 0.2.0 introduces more flexible configuration of document field indexing. It is recommended to re-index your data to keep the index synchronized with the updated project descriptor.
Project descriptors
Update required to convert the 0.1.x-style document field indexing definition to the syntax updated in 0.2.x. Carrot Search will provide the updated project descriptors for early adopters of Lingo4G.
Custom document sources
Updates required, Carrot Search will provide binaries of your custom document sources that will work with the 0.2.0 version.

New features

Improved document field
indexing configuration

Version 0.2.0 changes the document field indexing configuration syntax to allow more flexibility. With the new syntax it will be possible to reduce the size of Lingo4G index by not storing the original text of the field and/or its search index while retaining the possibly to apply clustering to that field. Please see the documentation of the fields section for more details and examples.

Carrot Search will provide the updated project descriptors for early adopters of Lingo4G.

Dedicated by-identifier
document selection syntax
Version 0.2.0 adds dedicated syntax for selecting documents for clustering based on their identifiers. You can use this syntax to efficiently select thousands, tens and hundreds of thousands of documents by some identifier field value.
Automatic minPerSegmentDf
Version 0.2.0 adds support for the auto value for the performance.minPerSegmentDf parameter, in which case the appropriate value will be computed based on the clusters.minClusterSize parameter. In most cases, the auto setting will improve the clustering performance.

Improvements

Maximum field length
in label-document
assignment result
Lingo4G will limit the maximum number of characters written for each field in the label-document assignment result to prevent from accidentally writing very large amounts of content to the result file. You can change the default length limit using the output.components.assignments.maxFieldLength option.

Bug fixes

Shell scripts return code 0
for empty cluster lists
Lingo4G 0.1.x l4g shell scripts would return a non-zero code when the list of clusters was empty. To reserve the non-zero codes for actual execution errors, as of version 0.2.0 Lingo4G launch scripts will return zero also when the execution completes successfully but with an empty cluster list.

Version 0.1.2

released on 2014-11-25

Version 0.1.2 fixes a major bug in cluster label candidate selection present in the 0.1.0 and 0.1.1 releases.

Compatibility

Reindexing
Not required, index created by previous 0.1.x versions will work with 0.1.2.
Project descriptors
Updates not required, project descriptor created for earlier 0.1.x releases will work with 0.1.2.
Custom document sources
Updates not required, custom document source binaries created for previous 0.1.x versions will work with version 0.1.2.

Bug fixes

No clusters when clustering
a subset of the index
Versions 0.1.0 and 0.1.1 may occasionally generate an empty cluster list when processing a fairly large subset of the index. In such cases the No candidate labels found, try lowering the DF cut-offs. message would be printed. Version 0.1.2 fixes the issue.

Version 0.1.1

released on 2014-11-24

Version 0.1.1 introduces a number of small improvements, bug fixes and documentation clarifications.

Compatibility

Reindexing
Not required, index created by version 0.1.0 will work with 0.1.1.
Project descriptors
Updates not required, project descriptor created for version 0.1.0 will work with version 0.1.1.
Custom document sources
Updates not required, custom document source binaries created for version 0.1.0 will work with version 0.1.1.

Improvements

Re-indexing into
non-empty index requires
explicit confirmation
Version 0.1.0 would silently discard the existing index when re-indexing. To avoid accidental deletion of the index, version 0.1.1 will only overwrite the existing non-empty index if the --force option is provided.
Cygwin and Mingw
When running Lingo4G in Cygwin on Mingw, use the l4g.cmd script so that Lingo4G can correctly resolve file paths. As of version 0.1.1 the l4g Bash launch script will refuse to run under Cygwin and Mingw.
Version information
You can now get detailed Lingo4G version information by running l4g version.
Unlimited number of
clauses in selection query

The document selection query can now use an unlimited number of clauses, which makes it possible to select large numbers of documents for clustering for example by their identifiers (id:d1 OR id:d5 OR id:d47 ...).

planned The performance of selecting thousands of documents using the OR syntax is currently very low. Further releases of Lingo4G will come with a dedicated syntax for by-id selection and much better performance characteristics.

Bug fixes

l4g shell scripts
return codes
As of version 0.1.1, the l4g shell scripts correctly return execution status codes.

Version 0.1.0

released on 2014-11-07

Initial alpha release.