Carrot Search Lingo4G Releases

Release notes, versions up to 1.7.1

This document contains Lingo4G release notes including information on new features, changes, bug fixes and upgrade considerations. Please see the Lingo4G reference for the full manual and options reference.

Version 1.7.1

10-09-2018

The 1.7.1 fixes a number of issues found in earlier releases.

Compatibility

Reindexing: Not required. Lingo4G 1.7.1 will work with indexes created by version 1.7.0.
Project descriptors: Updates not required.
Custom document sources: Document sources compatible with version 1.7.0 will work with version 1.7.1.

Improvements

Query text area: Lingo4G Explorer in version 1.7.1 replaces single-line query text box with a multi-line text area for easier input of long scope queries.

Bugs

Exception when requesting XML output

Versions 1.6.0, 1.6.1 and 1.7.0 would throw an exception when analysis output in XML format was requested. Version 1.7.1 fixes the issue.

Field length limits ignored in Export dialog

Explorer's document results export dialog would ignore the descriptor-defined field output settings. For example, a maximum of 160 characters per field would be output, regardless of the limit defined in the descriptor.

Version 1.7.1 uses the descriptor-defined field output settings when exporting analysis results from Explorer.

CygWin and Java 1.8

The launcher script l4g did not detect Java 1.8 properly under CygWin and failed to start Java virtual machine.

Version 1.7.0

05-09-2018

The 1.7.0 release is about scaling Lingo4G to terabyte-sized data sets. The new version significantly speeds up indexing and analysis of such large collections. You can test the new capabilities on the newly added US Patent and Trademark Office data set, which contains almost 500 GB of text.

Release 1.7.0 also improves the process of fetching labels describing in-scope documents to better eliminate boiler-plate labels and increase performance when analyzing small subsets of very large indices.

Compatibility

Reindexing: Recommended. Lingo4G 1.7.0 removes redundant information from the feature index, which may lower its size by 20–30%. Further index size improvements can be achieved by changing the newly introduced maxPhrases and maxPhrasesPerField indexing parameters.
Project descriptors: Updates required. Update of the label fetching algorithm resulted in a removal of a number of parameters, see the Project descriptor changes section for a detailed list of changes to apply. If you have problems upgrading your project descriptor, get in touch.
Custom document sources: Document sources compatible with 1.6.x releases will work with version 1.7.0.

New features

US patents data set

This release comes with support for the patent grant and application data available from United States Patent and Trademark Office. Please note that the whole data set contains documents spanning nearly 500 GB of indexable text and will need a high-end machine to handle.

Scalability of indexing

Version 1.7.0 introduces a number of indexing parameters to control performance and index size when working with multi-gigabyte or terabyte-sized collections.

The maxPhrases parameter can help to keep the total number of indexed phrases within a reasonable limit and thus keep the index small and noise-free.

You can use the samplingRatio parameter to run label extraction based on a sample of the indexed documents. When indexing collections of millions of documents, sampling can significantly speed up indexing with negligible loss of accuracy.

Finally, the maxPhrasesPerField parameter makes it possible to index only a number of top-frequency labels for each document. When indexing very long documents, indexing top-frequency labels can lower index size and speed up analysis.

Label fetching rewritten

Version 1.7.0 comes with significant updates to the process of fetching labels describing in-scope documents. The new algorithm should better eliminate boiler-plate labels, such as control group for medical papers, and offer stable high performance when processing small subsets of very large indices. The algorithm is controlled by the newly-added maxLabelsPerDocument parameter.

Due to changes in label scoring, the default values of the idfScorerWeight and tfScorerWeight have been changed to 1.0. To make the most of the new label selection algorithm, make sure your project descriptor and Lingo4G Explorer settings do not override these parameters.

Finally, as part of the label fetching algorithm update, the following parameters have been removed: minPerSegmentDf, maxPerSegmentDf, maxSubsetSizeForTermVectorScan, randomRatio and randomSeed.

Document deletion

A new command to delete documents from the index has been added: l4g delete. Documents to delete can be selected using a regular Lucene query.

Improvements

Document similarity computation speed-up: Version 1.7.0 significantly speeds up computation of document-to-document similarities, which is required to perform document clustering and embedding. On very large indices with millions of documents the speed-us can reach 600%.
Indexing performance bottleneck removed: An indexing performance bottleneck affecting collections with a large number of very short documents has been identified and addressed.
Feature index size reduced: We removed some redundant information from the feature index which should result in space savings of 20% to 30%, exact numbers may vary. Reindexing is highly recommended.
Sorting map legend entries in Explorer: You can now sort the entries in document map legend by the number of occurrences or alphabetically by the label. You can change the sorting order by clicking the icon in top-right corner.
Complement criteria type: A new complement criteria type has been added to the document retrieval API.
Count of field values: A new property valueCount can be returned for the selected document fields. This setting needs to be enabled by setting valueCount property for a corresponding field to true in the request or the descriptor.
Security updates: Jetty, the built-in HTTP server, has been upgraded to version 9.3.24.v20180605 to address security vulnerabilities.

Bugs

Trimming marker and maxValues: When maxValues is used on a multi-valued field to limit the number of returned values, the truncationMarker is appended as the "extra" value when the limit is exceeded. This change makes it possible to skip adding this extra value if the trimming marker is an empty string.

API changes

Document criteria cleanup

The document retrieval section of the REST API has been cleaned up to be consistent with the scope.selector element. Existing code will still work in this version, but will emit deprecation warnings on the server.

There are three changes that require adjustments to existing code.

The criteria element on an analysis document retrieval JSON becomes a selector element, for example:

{
  "limit": 10,
  "selector": {
    "type": "forLabels",
    "labels": [ "data mining", "KDD" ],
    "operator": "OR"
  }
}

The composite criteria definition is now consistent with the composite scope selector. For example, what previously could read:

{
  "limit": 10,
  "criteria": {
    "type": "composite",
    "operator": "AND",
    "criteria": [
      { "type": "forLabels", "labels": [ "email" ] },
      { "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] }
    ]
  }
}

would now read:

{
  "limit": 10,
  "selector": {
    "type": "composite",
    "operator": "AND",
    "selectors": [
      { "type": "forLabels", "labels": [ "email" ] },
      { "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] }
    ]
  }
}

The complement criteria definition is now consistent with the complement scope selector. This means that the nested criteria array element becomes a single selector (multiple selectors can be combined with a composite). For example, what previously could read:

{
  "limit": 10,
  "criteria": {
    "type": "composite",
    "operator": "AND",
    "criteria": [
      { "type": "forLabels", "labels": [ "email" ] },
      { "type": "complement": 
        "criteria": [{ "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] }]
      }
    ]
  }
}

would now become:

{
  "limit": 10,
  "sellector": {
    "type": "composite",
    "operator": "AND",
    "selectors": [
      { "type": "forLabels", "labels": [ "email" ] },
      { "type": "complement": 
        "selector": { "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] }
      }
    ]
  }
}

Project descriptor changes

Removed properties

Due to the update of label fetching algorithm, a number of descriptor properties have been removed. If your descriptor contains any of those properties, remove them to make your descriptor compatible with version 1.7.0.

Removed properties:

`labels.performance.minPerSegmentDf` `labels.performance.maxPerSegmentDf` `labels.performance.maxSubsetSizeForTermVectorScan`	Configuration of old label fetching algorithm removed in version 1.7.0.
`labels.scorers.randomRatio` `labels.scorers.randomSeed`	Configuration of randomized label selection removed in version 1.7.0.

Version 1.6.1

11-05-2018

Version 1.6.1 provides a number of small improvements.

Compatibility

Reindexing: Not required. Lingo4G 1.6.1 will work with indexes created by version 1.6.0.
Project descriptors: Updates not required.
Custom document sources: Document sources compatible with version 1.6.0 will work with version 1.6.1.

When upgrading from version 1.5.1 or earlier, please see version 1.6.0 compatibility notes for the required changes.

Improvements

REST reload response

The reload REST endpoint has been improved to minimize the risk of an unchecked exception when there is an active reindexing or indexing process running in the background and modifying the index. In case this is detected, a HTTP 503 Service Unavailable response will be returned to the client.

Regardless of those checks, we advise to issue the reload call sequentially after any index-manipulating commands have completed.

Improved label coverage computation performance

Version 1.6.1 modifies the algorithm used to compute the number of labeled documents to improve its performance on both SSD and HDD drives.

Less noisy log files

We have removed some and altered other log entries to make them less noisy.

Version 1.6.0

16-04-2018

Version 1.6.0, the largest upgrade to date, adds two major new features: support for incremental indexing of documents and 2d map-like visualization of sets of documents.

Additionally, it comes with numerous Lingo4G Explorer improvements, an increase in indexing speed and many API clean-ups.

Compatibility

Reindexing

Required. Lingo4G 1.6.0 will not work with indexes created by any previous version. Full reindexing is required because of structural changes to how the index is stored and maintained.

Any previous indexes need to be manually removed from disk, the l4g index --force option will not work.

Project descriptors

Updates required. A number of changes have been made to clean up project descriptors and provide cleaner separation of concerns. See the Project descriptor changes section for a detailed list of changes to apply. If you have problems upgrading your project descriptor, get in touch.

Custom document sources

Updates required. Minor updates, such as method signature changes, will be required to update custom document sources to compile with version 1.6.0.

New features

Document map visualization

Version 1.6.0 introduces the document embedding feature, which places documents in 2d space in such a way that textually-similar documents are close to each other.

Based on document embedding, Lingo4G Explorer introduces the document map view, a tool for interactive visualization and exploration of document collections.

**Document map view.** Each document in scope is represented by one point (marker). Analysis labels are also placed on the map to describe spatial groupings of documents. Documents belonging to the same cluster set have the same color. Panel on the left lists the cluster sets along with the color assigned to them.

Incremental indexing

Starting with the 1.6.0 release, documents can be added or updated in the index without re-processing the whole collection. Please refer to the manual to see how incremental indexing works, which document sources are supported and what the caveats are with regard to running the REST server on the index that can change on the fly.

Lingo4G Explorer improvements

Document summary view

Lingo4G Explorer introduces the document summary view, which displays themes and topics discovered in the currently selected documents.

Analyzing currently selected documents

As of version 1.6.0, you can press the Analyze link in the document content or summary views to analyze the currently selected set of documents.

Query syntax and parameter help

Lingo4G Explorer now comes with quick help for analysis parameters and scope query syntax explanation and examples.

Hold your mouse pointer over the question mark icon for quick parameter help. Click the question mark icon in the query text box to toggle query syntax help.

Processing scope limit escalation

Hold mouse pointer over the scope size information to alter the currently applied scope size limit.

To prevent unintended long-running analyses, starting with the 1.6.0 release, the default limit on the number of analyzed documents is 10k. The limit can be increased or lifted entirely by the user.

If the processing scope contained more documents than the current limit, Lingo4G Explorer will display the icon next to the scope size information. If you hold your mouse pointer over that area, you will see buttons for escalating the scope limit, along with rough estimates of how long the analysis might take.

Parameter panel collapsing

You can now click the icon to collapse the parameters panel to make more space for interaction with the analysis results.

Advanced parameters filter

Lingo4G Explorer now shows only the basic analysis parameters by default. To show all available analysis parameters, click the icon.

Document map view is now the default

As of the 1.6.0 release, the default analysis result view in Lingo4G Explorer is the document map view. The order of tabs has also been changed to put document clusters first, followed by label clusters and label list.

Other improvements

byQuery ad-hoc retrieval criteria added: As of version 1.6.0 you can use the byQuery criteria for ad-hoc document retrieval. This makes it possible to narrow-down the set of in-scope documents to those matching an arbitrary user query.
More precise document relationships: Version 1.6.0 adds the minMatchingQueryLabels parameter separately for document clustering and embedding. This makes it possible to filter out weak relationships based on, for example, only one common label shared by documents.
More precise retrieval with forLabels criteria: It is now possible to restrict the set of documents retrieved using the forLabels criteria to those containing not fewer than the specified number of the requested labels. The minimum number of labels can be passed using the minOrMatches criteria parameter.
White space normalization: When queries entered by users contained invisible white space (special unicode character sequences), query parsers typically returned no matching documents, confusing users. We added white space normalization to all query parsers, controlled by the sanitizeSpaces parameter. By default, all unicode white space characters are normalized to a single plain white space.
Dependency upgrades: All software dependencies have been updated to their latest stable versions. This includes migration of indexes to use Lucene 7.2.1.

API changes

The 1.6.0 release fundamentally changes the way Lingo4G indices are created and stored. While most of the changes are transparent to the end user, we took this opportunity to clean up certain aspects of the public APIs and project descriptors.

This section lists the general functional changes. For a list of the required project descriptor updates, see the project descriptor changes section.

Processing scope limits apply for all analysis facets

Versions up to 1.6.0 applied processing scope limits only during document clustering through the maxDocuments and trimToMaxDocuments parameters.

Version 1.6.0 applies the scope size limit across all analysis artifacts, including label list, label clustering, document clustering and embedding. As a result, the document clustering-specific limit parameters have been removed and replaced with the scope.limit parameter.

Additionally, to avoid unintended long-running analyses, the default scope limit is set to 10000 documents. You can override the limit by editing your project descriptor or by providing a new per-analysis limit value. To remove the limit entirely, pass the unlimited string as the scope limit.

Probability ratio-based label scoring removed

The 1.6.0 release removes the component of label score computation based on scope-to-collection occurrence probability ratios. While that scoring component was effective for narrowly-focused topic-specific scopes, it would significantly lower the quality of labels for scopes being all-topics samples of the full collection.

As a result, all probability ratio-based parameters have been removed from the descriptor.

Output of raw label and document relationships removed

Version 1.6.0 removes the possibility to output raw similarity matrices for labels and documents. As a result, the related parts of the descriptor and output JSON have been removed.

Index format changes

Version 1.6.0 changes the internal layout of the index. Documents and features are now stored separately. As a result, indexing requires only one iteration over the document source documents, any remaining passes required for feature discovery are executed on the contents of the index.

A side-effect of this is that labels can now be recomputed based on the current content of the index. This can save processing time when new features need to be generated after indexing parameters or dictionaries change.

Caching of source documents removed

Due to the new index structure, there is no need for explicit caching of source documents. Therefore, the cacheSource parameter has been removed from project descriptor and the l4g.cacheSource system property controlling this feature is no longer recognized.

Incremental indexing interfaces

Initial version of the incremental indexing API has been added (IIncremental interface), with two example implementations in the examples (JsonDocumentSource and JsonRecordsDocumentSource).

The incremental indexing API is still experimental and may change in the future.

Project descriptor changes

Removed properties

A number of descriptor properties have been removed, usually as a consequence of refactorings and removal of Lingo4G features. See the API changes section for a detailed description of the related functional changes.

Removed properties:

`name`	Human-readable project name.
`documents.arrangement.maxDocuments` `documents.arrangement.trimToMaxDocuments`	Document-clustering specific scope limit, now replaced with a general scope limit.
`labels.relationships` `documents.relationships`	Configuration of the output of label-label and document-document relationships.
`labels.probabilities.probabilityRatioPreference` `labels.probabilities.probabilityRatioThreshold` `labels.probabilities.probabilityRatioPreferenceStrength` `labels.probabilities.probabilityRatioMaxRelativeScopeSize` `labels.scorers.probabilityRatioScorerWeight`	Parameters of probability ratio-based label scoring.
`cacheSource`	Configuration of source document caching.

Scope specification refactored

The parameter specifying scope size limit has been moved from scope.byQuery.limit to scope.limit.

Additionally, scope selector definition has been moved to the dedicated scope.selector element to signify that the scope limit applies to all types of selectors. See scope element documentation for the current structure of scope definition.

Arrays replaced with objects

indexer.features element is an object, not an array. Previously the declaration of feature extractors required an array of objects, each with a key attribute to identify the feature extractor. The features section is now an object, with each key denoting a unique extractor and the associated value containing its definition. So, a previous JSON of:

"indexer": {
  "features": [
    {
      "key": "coarse-phrases",
      "type": "phrases",
      ...
    }
  ]
}

now becomes:

"indexer": {
  "features": {
    "coarse-phrases": {
      "type": "phrases",
      ...
    }
  }
}

Top-level queryParsers element is an object, not an array. Previously the declaration of query parsers required an array of objects, each with a key attribute to identify the feature extractor. The queryParsers section is now an object, with each key denoting a unique query parser and the associated value containing its definition. So, a previous JSON of:

"queryParsers": [
  {
    "key": "standard",
    "type" : "standard",
    ...
  }
]

now becomes:

"queryParsers": {
  "standard": {
    "type": "phrases",
    ...
  }
}

Top-level dictionaries element is an object, not an array. Previously the declaration of dictionaries required an array of objects, each with a key attribute to identify the dictionary. The dictionaries section is now an object, with each key denoting a unique dictionary and the associated value containing its definition. So, a previous JSON of:

"dictionaries": [
  {
    "key": "default",
    "type" : "simple",
    ...
  }
]

now becomes:

"dictionaries": {
  "default": {
    "type": "simple",
    ...
  }
}

fields moved to top level

The declaration of fields, previously:

{
  "source": {
    "fields": {
     ...
    }
  }
}

is now moved to top-level:

{
  "fields": {
   ...
  },
  "source": {
   ...
  }
}

More strict fields specification

The field specification is now more strict with respect to which attributes are permitted for each type of field: only text fields can declare analyzers, the stored attribute has been removed without replacement (all fields are now stored in the index to allow incremental indexing and recomputation of features).

Document identifiers

A Boolean id attribute can be defined on exactly one field to indicate the identifier of a document required for document updates in incremental indexing.

classpath element moved

Top-level classpath element has been moved under source.classpath .

Project directory configuration consolidated

Top-level projectDirectory, workDirectory, resultsDirectory, indexDirectory elements have been removed. There is a new top-level element directories declaring key project directories, but changes or overrides to this element at the project-descriptor level are discouraged.

Substitution variables renamed

Substitution variable project.directory, pointing at the directory of the project descriptor, has been renamed to l4g.project.dir. The only other substitution variable available is l4g.home pointing at the installation directory of Lingo4G. No other project locations are exposed as substitution variables.

Document source classes renamed

Names of document source classes shipped with Lingo4G have changed. If your project descriptor makes use of any Lingo4G document source (from the examples) then the naming convention has changed from ...DocumentSourceFragment to ...DocumentSourceModule, reflecting the fact that these classes implement Guice modules that provide an implementation of a document source.

Bugs

Incorrect sorting of document cluster sets: Previous versions of Lingo4G Explorer would incorrectly sort document cluster sets. Now, cluster sets are sorted by the decreasing number of documents in the cluster set.

Version 1.5.2

12-12-2017

Version 1.5.2 fixes a significant issue with label exclusion dictionaries. Update is strongly recommended.

Compatibility

Reindexing: Not required. This release will work with indexes created by any prior 1.5.x version.
Project descriptors: Updates not required.
Custom document sources: Document sources are compatible with any 1.5.x-compatible version.

Bugs

Label exclusion dictionaries

A significant bug was present in the cache fingerprint calculation routine of simple label exclusion dictionaries. This could manifest itself in hash collisions between dictionaries that had different exclusion rules, but an identical set of terms. For example, the following two dictionaries would be considered identical:

* foo bar
foo

foo
bar

While these are obviously different exclusion rules (that would apply to different patterns), Lingo4G would compile the first dictionary it sees and reuse it for any subsequent requests. This could manifest itself in hard-to-reproduce odd behavior at runtime.

Version 1.5.1

08-08-2017

Version 1.5.1 provides a number of minor improvements and bug fixes.

Compatibility

Reindexing: Not required. Lingo4G 1.5.1 will work with indexes created by version 1.5.0.
Project descriptors: Updates not required.
Custom document sources: Document sources compatible with version 1.5.0 will work with version 1.5.1.

Bugs

Incorrect document cluster selection highlighting: Lingo4G Explorer 1.5.0 might highlight a cluster different than the one the user clicked to select.

Improvements

IE compatibility view: A special meta tag has been added to Lingo4G Explorer's front page to allow it to bypass Internet Explorer's intranet "compatibility mode" policy.

Version 1.5.0

30-06-2017

Version 1.5.0 introduces hierarchical clustering of documents and implements a number of minor improvements and bug fixes.

Compatibility

Reindexing: Recommended. Lingo4G 1.5.0 enables document length normalization, which will take effect only after re-indexing.
Project descriptors: Updates required. A number of parameters have been moved, descriptors need to be updated to account for those changes.
Custom document sources: Document sources compatible with version 1.4.0 will work with version 1.5.0.

New features

Hierarchical document clustering

Version 1.5.0 adds support for hierarchical clustering of documents. Similarly to label clustering, relationships between document clusters are established by allowing cluster exemplars to be themselves members of other clusters.

Like with label clustering, Lingo4G Explorer presents document arrangements as a flattened two-level structure of cluster sets and clusters. Hierarchical clustering of documents can be controlled by the softening parameter.

Improvements

Analysis scope size limit in Explorer

Version 1.5.0 adds the possibility to limit the number of analyzed documents in Lingo4G Explorer. To apply the limit, check the check box and choose the maximum number of documents to process.

The same limit can be applied when calling Lingo4G REST API by setting the limit parameter.

Document length norms

Version 1.5.0 enabled the storage of document length normalization factors for feature fields. This will help to compute more consistent document similarity values when the index contains length-imbalanced documents.

Document normalization factors are computed and stored during indexing. Therefore, for normalization to take effect, data needs to be re-indexed.

minDocumentLabels parameter added

The minDocumentLabels parameter has been added to control which documents are included in document relationship computation and clustering.

Output of label similarities

1.6.0 You can now set labels.relationships.enabled to true to retrieve the matrix of similarities between labels. The output of the relationships matrix can also be enabled in Lingo4G Explorer's Export window.

Multivalued fields and trimming

If a field with multiple values is trimmed because of the limit set in maxValues, it will be returned with an additional value equal to the truncation marker (from in the highlighting configuration). This can be used to make it more explicit that the returned set of values is not a complete value of the field.

Bugs

Dates and numbers in queries causing exceptions: Date or numeric field types will no longer result in a runtime exception when used in a query using qualified field notation (field:value).

API changes

Moved parameters

Parameters controlling computation of relationships between labels and documents have been moved under a dedicated section of the project descriptor.

Parameters of co-occurrence based relationship computations have been moved to their dedicated cooccurrences section:

1.4.0 and earlier:

"arrangement": {
  "relationship": {
    "type": "cooccurrences",
    "similarityWeighting": "RR",
    "threads": "auto"
  }
}

1.5.0:

"arrangement": {
  "relationship": {
    "type": "cooccurrences",
    "cooccurrences": {
      "similarityWeighting": "RR",
      "threads": "auto"
    }
  }
}

Similarly, parameters of document relationship computations have been moved to a dedicated mlt section:

1.4.0 and earlier:

"arrangement": {
  "relationship": {
    "type": "mlt",
    "maxSimilarDocuments": 10,
    "maxQueryLabels": 20,
    ...
  }
}

1.5.0:

"arrangement": {
  "relationship": {
    "type": "mlt",
    "mlt": {
      "maxSimilarDocuments": 10,
      "maxQueryLabels": 20,
      ...
    }
  }
}

If the project descriptor contains non-default values of the above parameters, it must be updated to move the parameters to their dedicated sections as above.

Document relationship matrix format change

1.6.0 Version 1.5.0 switches the format of the document similarity matrix to use row / column indices in the matrix rather than document identifiers. This lowers the size of the responses containing similarity matrices.

Version 1.4.0

09-05-2017

Version 1.4.0 adds the possibility to cancel in-progress analysis requests, introduces caching of source documents and provides a number of minor improvements and bug fixes.

Compatibility

Reindexing: Not required. Lingo4G 1.4.0 will work with indexes created with Lingo4G 1.3.0.
Project descriptors: Updates not required.
Custom document sources: Document sources compatible with version 1.3.0 will work with version 1.4.0. Please, however, note the source document caching feature introduced in version 1.4.0.

New features

Deleting or cancelling analyses

The analysis endpoint of the REST API now supports the HTTP DELETE method, which you can use to cancel in-progress analyses or delete the completed ones from server caches.

Lingo4G Explorer's user interface has been modified to allow users to cancel analyses in progress.

Source document caching

1.6.0 As of version 1.4.0, Lingo4G indexing will fetch documents from the document source only once, storing all the required fields in a local, compressed cache file. The cache file will then be reused for all subsequent indexing phases and removed once indexing completes.

This feature will be useful for document sources where data fetching is costly (decompression, network access, file parsing) or can result in loading much overhead data, not referenced from any feature extractors or data fields later stored in Lingo4G indexes.

Size of the cache file will typically be 50–70% of the total size of text returned by the document source for indexing.

-Dl4g.cacheSource=false While caching is enabled by default, it can be switched off by passing the option to the index command or by permanently changing the cacheSource option in the project descriptor.

License file information

Version 1.4.0 adds license validity and limits information to the /about REST API endpoint. The same information can be displayed in Lingo4G Explorer.

Improvements

Optimized number of source scans

The indexer's source document scanning has been optimized to minimize the number of required full passes over the input.

This improvement is applicable for projects with more than one feature extractor; for a single feature extractor the behavior has not changed.

Lucene 6.5.0

The underlying Lucene search engine has been upgraded to version 6.5.0. Existing indexes are compatible and do not require reindexing.

Jetty 9.3.19

Built-in Jetty server has been upgraded to version 9.3.19.v20170502.

Bugs

sourceFields was ignored in phrase extractor: The sourceFields property was ignored in phrase feature extractor. Instead, all targetFields of were used as the source of features. This bug affected only those setups where sourceFields and targetFields were different.
--max-docs console output: When --max-docs was used with indexing the output progress was not updated properly (was stuck at partial progress forever).
Document content output in Lingo4G Explorer: Previous versions would always export the default set of fields even if some of those fields were deselected in the dialog.

Version 1.3.0

15-03-2017

The 1.3.0 release adds the Wikipedia and NIH research projects datasets, adjusts indexing concurrency to account for drive speed, fixes a bug that prevented Lingo4G Explorer from loading in IE 11 and Firefox and provides a number of other minor improvements and bug fixes.

Compatibility

Reindexing

Not required. Lingo4G 1.3.0 will work with indexes created with Lingo4G 1.2.1.

Project descriptors

Updates required. Version 1.2.1 deprecated the jsonFile property of the JSON document source in favor of the inputs property. The jsonFile property has been removed in version 1.3.0.

Custom document sources and logging configuration

Updates may be required due to the upgrade of the default logging system to log4j2. We always recommend using slf4j logging facade so that such changes are application-transparent.

Log4j2 has a slightly different syntax of configuration files, so any customizations or changes have to be applied to these new files.

New features

Indexing improvements: Lucene index optimization has been rewritten to automatically adjust to drive capabilities. The number of index merging threads now follows the general specification given in the threads attribute.
Wikipedia dataset: Version 1.3.0 comes with the Wikipedia dataset that you can use to index and analyze the contents of Wikipedia.
NIH research projects dataset: Version 1.3.0 comes with the NIH research projects dataset which indexes titles and abstracts of research projects funded by the US National Institutes of Health.

Improvements

Indexing speedups

Smaller number of phrase normalization passes required should result in minor speedups during indexing.

Phrase normalization and numerics

Phrase normalization will no longer create aliases for phrases consisting solely of numeric tokens. For example, previously if token sequences: 10 124, 10,124 and 10124 occurred frequently, they were considered an alias of the same number. This is no longer the case as it could lead to confusing results.

Note that for mixed alphanumeric tokens, such as: μ ct 40, μct40, μ ct40 such aliases are still created (this is frequently the case with measurement units, acronyms and proper names).

Output of document similarities

1.6.0 You can now set documents.relationships.enabled to true to retrieve the matrix of similarities between documents in scope. The output of the relationships matrix can also be enabled in Lingo4G Explorer's Export window.

Bugs

Errors in IE11: Lingo4G Explorer would not load in Internet Explorer 11 and older version of Firefox. Version 1.3.0 fixes the problem.

API changes

JSON document source: The deprecated jsonFile property of the JSON document source has been removed in favor of the inputs property.

Version 1.2.1

31-01-2017

Version 1.2.1 adds support for indexing multiple JSON files, fixes a major bug in label clustering and provides a number of other minor bug fixes and improvements.

Compatibility

Reindexing

Not required. Lingo4G 1.2.1 will work with indexes created with Lingo4G 1.2.0.

Project descriptors

Updates recommended. Version 1.2.1 deprecates the jsonFile property of the JSON document source in favor of the inputs property that allows specifying multiple JSON files. The jsonFile property will be removed in version 1.3.0.

Custom document sources

Updates may be required due to commons-compress and xz dependency upgrades.

Minor dependency upgrades

commons-compress has been upgraded to version 1.13, xz has been upgraded to version 1.6.

Lingo4G now ships with Apache Xerces, which changes the default XML parser from Java's default.

These changes should not affect clustering results, but may require changes to custom document sources.

Improvements

Firewall helpers: A FAQ entry and unpack utility have been added for those behind corporate firewalls (where automatic download of data files didn't work).
AutoIndex file formats: Documented supported autoindex document source file formats.
Multi-file support in JSON example: The JSON example has been modified to support multiple JSON files on input. The old jsonFile descriptor attribute is deprecated and will be removed in version 1.3.0. You should update your project descriptors to use the new inputs attribute (see the updated JSON example section for syntax).

Bugs

Incorrect co-occurrence counting: Versions prior to 1.2.1 would incorrectly take the cooccurrenceCountingAccuracy threshold relative to scope size rather than relative to the total number of documents in the index. As a result, when processing small documents subsets with low values of cooccurrenceCountingAccuracy, co-occurrence counts would be sparse and inaccurate, which might lead to many unclustered labels. Version 1.2.1 fixes the issue.
Missing progress reporting: Label fetching phase could omit job progress information. This is a regression that was introduced in version 1.2.0. Version 1.2.1 fixes the issue.
Wildcard queries not highlighted: Wildcard queries were not properly highlighted after changes introduced in version 1.2.0. Version 1.2.1 fixes the problem.
Fixed runtime exception in Progress updater: A runtime exception could be printed during indexing (while closing the index). The exception didn't affect indexing results. The issue is fixed in the 1.2.1 release.

Version 1.2.0

16-01-2017

The 1.2.0 release introduces support for indexing PDF, Word and HTML files, adds automatic handling of concurrency during indexing, improves highlighting of complex queries and adds a number of other smaller improvements and bug fixes.

Compatibility

Reindexing

Recommended. Lingo4G 1.2.0 will work with indexes created with Lingo4G 1.1.0, although certain changes to stop label extraction algorithms may bring some label improvements after reindexing.

Project descriptors

Updates recommended. Deprecated indexer.stopLabelExtractor.threads and indexer.stopLabelExtractor.accuracy attributes. They should be removed from project descriptors (values will be ignored and will trigger a warning on the console).

If used, indexer.threads attribute should be set to auto (remove any fixed number of threads override if you have it). This enables automatic thread management which adjusts to hardware automatically (HDDs, SSDs, number of CPU cores).

Custom document sources

Updates may be required if your custom document source used some of the public Lingo4G utility classes (for parallel processing, for example) or Google Guava, which has been updated to a newer version.

New features

PDF/Word document source example

An new document source has been added that automatically extracts text from PDFs, Microsoft Word, OpenOffice and other file formats. The discovery of file format and text extraction are done using Apache Tika library. Full source code is included to allow modifications or for use out of the box to index local files.

Automatic concurrency in indexing

The indexer will try to automatically maximize throughput taking into account the number of available CPU cores and the speed of the drive(s) used for indexing. The threads attribute should be set to auto for this feature to work.

When the automatic adjustment isn't a good fit or lower CPU consumption is required, a global system property l4g.concurrency can be set at startup to override the defaults (using -Dl4g.concurrency=... syntax) or the threads attribute can be modified directly in the project descriptor (this is discouraged).

See the threads attribute's description for the permitted syntax of threading specification.

Cell highlighting in document clusters treemap

Stating with the 1.2.0 release you can have Lingo4G Explorer highlight same-color or same-label cells in the document clusters treemap. The following video demonstrates the feature.

Improvements

Scope highlighting: Scope highlighting has been rewritten from scratch and should now fully support phrase and fuzzy queries. Previously any term of a phrase query would be highlighted, after this change only actual terms involved in the phrase (or a matching term span) will receive the highlight.
Lucene upgrade: Apache Lucene has been upgraded to version 6.3.0.
New analysis scope types: The 1.2.0 release adds three scope types that make it possible to use complex criteria for document selection: by-id selection, complementary and composite scope definitions.
Stop label extraction: Better detection of nonsensical stop label extraction conditions and reporting. Automatically detected stop labels may change as a result of this adjustment.
Phrase normalization feedback: Improved console progress feedback (indexing phase): it now shows a progress bar on larger data sets.
Guava 2.0: Google Guava dependency update to version 20.0.

Bugs

Exceptions when indexing Unicode: Documents containing non-ASCII Unicode characters could result in unhandled exceptions thrown during indexing. This is a regression bug affecting version 1.1.0 and 1.1.1.
Inconsistent label selection between server restarts: In previous versions, labels selected for the same analysis scope might differ between Lingo4G REST API restarts and command line analysis invocations. The 1.2.0 fixes the issue, so that label selection results are always the same.
License information: l4g version command wasn't able to display valid license information properly.
Label sorting broken: Labels were not sorted properly in Lingo4G Explorer. This regression bug was introduced in version 1.1.0 and is fixed in the 1.2.0 release.
Stats output broken: The stats command skipped index component statistics.

API changes

Output of labels and documents

The labels part of the response will contain the list property only when the output of labels is requested by setting the output.labels.enabled parameter to true. Otherwise, the list property will not be present.

Similarly, the documents part of the response will contain the list property only when the output of documents is requested by setting the output.documents.enabled parameter to true.

Version 1.1.1

08-12-2016

Version 1.1.1 fixes a major bug in fetching of the contents of documents.

Compatibility

Reindexing: Not required. Lingo4G 1.1.1 will work with indexes created with Lingo4G 1.1.0.
Project descriptors: Updates not required. Version 1.1.1 does not change project descriptors.
Custom document sources: Updates not required.

Bug fixes

Content of incorrect documents fetched

In version 1.1.0, when document content was requested simultaneously with the onlyWithLabels or onlyAssignedToLabels parameters set to true, for some documents incorrect content could be fetched.

Version 1.1.1 fixes this issue.

Version 1.1.0

25-10-2016

Version 1.1.0 improves conflation of different spelling variants of the same label, adds more control over heuristic English stemming, fixes a number of bugs and extends documentation.

Compatibility

Reindexing: Recommended. Lingo4G 1.1.0 will work with indexes created with Lingo4G 1.0.0, but reindexing is strongly recommended because of improvements in automatic label detection.
Project descriptors: Updates not required. Version 1.1.0 does not change project descriptors.
Custom document sources: Updates not required.

New features

Numeric ranges

Consistent support for numerics and numeric ranges in both the standard and complex query parser.

Stemming control

An option called useHeuristicStemming was added to disable heuristic stemming in the English analyzer.

Spelling variants

Phrase feature extraction has been improved to automatically detect and merge spelling variants of labels written with or without dashes and as a compound word or multi-term phrase. For example, the following spelling variants would be unified now:

fast boot, fast-boot, fastboot
web page, webpage, web-page
magical jelly bean, magical jellybean, magical jelly-bean

Relevance score

The query-relevance score attribute was added to each document in document retrieval API.

Improvements

Documentation: We added documentation of default analyzers and their options.
Dictionaries cleanup: The default dictionaries have been cleaned up and renamed consistently. Example projects make use of the default dictionaries and additionally project-specific dictionaries, where applicable.
Licensing: Licenses will be reloaded automatically when no active licenses are found. This permits hot-swapping of licenses while the server is running.

Other internal cleanups: A number of other internal issues have been fixed.

Bugs

Small input crashes: There was a possibility of a runtime exception being hit on analysis of small inputs.
Terminal crash: There was a possibility of an exception being thrown on non-updateable terminals.

Version 1.0.2

19-10-2016

Version 1.0.2 is a maintenance release that addresses minor software bugs.

Bugs

Terminal crash: There was a possibility of an exception being thrown on non-updateable terminals.

Version 1.0.1

28-09-2016

Version 1.0.1 is a maintenance release that addresses minor software bugs and documentation deficiencies.

Compatibility

Reindexing: Not required, version 1.0.1 will work with index created by version 1.0.0.
Project descriptors: Updates not required.
Custom document sources: Updates not required.

Improvements

Hiding zero-sized docs in cluster treemap: Version 1.0.1 adds the possibility to hide zero-sized groups in the document cluster treemap.

Version 1.0.0

22-09-2016

Version 1.0.0 is the first official release of Lingo4G. Version 1.0.0 comes with dictionary-based filtering of labels reworked and documented, improved label selection stability and minor improvements to Lingo4G Explorer and documentation.

Compatibility

Reindexing: Required. Lingo4G 1.0.0 updates index storage format, indices created by the 0.11.x versions will not work with version 1.0.0.
Project descriptors: Updates required. Version 1.0.0 changes the way label dictionaries are defined and applied.
Custom document sources: Updates not required.

New features

Dictionaries

Version 1.0.0 introduces a common definition of label dictionaries that can be used, for example, to exclude specific labels from analysis. This release comes with two dictionary implementations: the simple and efficient word-based matching and more powerful but expensive to apply regular expression based matching. The dictionaries parameter documentation describes how to define your own dictionaries.

Additionally, the newly introduced dictionaries framework allows defining ad-hoc (per analysis request) dictionaries, which you can use to let the users tune or add their own label exclusions without restarting Lingo4G REST API server. Lingo4G Explorer comes with a simple implementation of this idea.

Improvements

Label selection stability improvements: In previous version of Lingo4G, excluding a single label from analysis could trigger a cascade of other changes to the label list with many other unrelated labels being removed and replaced. Version 1.0.0 improves label selection stability to prevent such situations.

Hash-based analysis ids

As of version 1.0.0, the REST API will use 64-bit hash strings as identifiers of asynchronously handled analyses. This will minimize the chances of getting stale analysis results in case Lingo4G REST API is restarted between initiating the analysis and fetching its results.

This change should not require any changes in the code of your application, unless it relies on the structure of the analysis results URL returned by the REST API in the Location header.

Partial results statistics: Version 1.0.0 changes the way the REST API reports processing progress. As of this release, the result of the /v1/analysis/{id} method will follow the structure of the complete analysis result returned by the /v1/analysis/{id}/result. The difference between the two methods is that the former will only return processing progress information and certain labels and document statistics, while the latter will return the complete analysis result.

Analysis status and parameters in output response: As of version 1.0.0, the analysis result response includes the processing status and parameters used to produce the analysis. These two pieces of data are especially useful for debugging the specific analysis result.

Version 0.11.0

02-08-2016

Version 0.11.0 improves the stability of label selection, adds more detailed performance logging and introduces working index versioning.

Compatibility

Reindexing: Required. Starting with version 0.11.0, Lingo4G will automatically determine whether the existing index is compatible with the version of Lingo4G you are running. Re-indexing is required for this feature to work.
Project descriptors: Updates not required.
Custom document sources: Updates required. Version 0.11.0 introduces improved APIs for progress reporting, custom document sources need to be updated to use those APIs.

Improvements

Progress and performance logging improvements

Version 0.11.0 comes with significantly improved reporting and logging of progress information. For each analysis requests, logs will now contain a detailed break down of the performed tasks.

[Task]                               [Time]    [%]
Resolving selector query              129ms   3.2%
Fetching candidate labels          1s 651ms  40.7%
  TermVectorScan                   1s 619ms  39.9%
   @ Segments: 7
   @ Documents: 8,355
   @ Threads: 8
   @ Labels fetched: 7,994
   @ Speed: 5.16ki docs/s
Scoring candidate labels              214ms   5.3%
 @ Labels scored: 7,994
 @ Labels selected: 1,000
 @ Speed: 38.25ki labels/s
Counting co-occurrences            1s 298ms  32.0%
 @ Threads: 8
 @ Speed: 770 labels/s
Computing label similarities           22ms   0.5%
Clustering labels                     398ms   9.8%
 @ Similarity density: 20.84%
 @ Similarity pruning gain: 1.98%
 @ Similarity pruning time: 77ms
 @ Similarity used: original
 @ Iterations: 155 (7.8% of max)
Computing coverage                    348ms   8.6%
 @ Segments: 7
 @ Labels: 1,000
 @ Threads: 8
 @ Speed: 2.87ki labels/s

Working index versioning

Starting with version 0.11.0, Lingo4G will automatically determine whether the existing index is compatible with the version of Lingo4G you are running. If index format is too old, you will need to re-index your data before you can run analyses.

Heads up!

When you run Lingo4G 0.11.0 analyze, server or stats command with a working index created by a previous version, you will see the following message:

The current index is too old, reindex your data.

Please re-index your data to be able to run analyses with version 0.11.0.

Maximum indexed documents option

Since version 0.11.0 you can pass the --max-docs option to the index command to limit the number of documents to index.

Bug fixes

Label selection stability improvements: Prior versions might select different labels for the same set of parameters. This release ensures that the same set of labels is selected, also for different numbers of processing threads.

Version 0.10.2

24-06-2016

Version 0.10.2 fixes a critical bug in license validation routines.

Compatibility

Reindexing: Not required, version 0.10.2 will work with the index created by the 0.10.x releases.
Project descriptors: Updates not required.
Custom document sources: Updates not required.

Bug fixes

License validation: A bug has been fixed in license validation routines that could result in valid licenses being omitted.

Version 0.10.1

17-06-2016

Version 0.10.1 fixes a bug in presentation of the document cluster members in treemap view.

Compatibility

Reindexing: Not required, version 0.10.1 will work with the index created by the 0.10.x releases.
Project descriptors: Updates not required.
Custom document sources: Updates not required.

Bug fixes

Incorrect member count in document clusters: In version 0.10.0 Lingo4G Explorer may incorrectly report the number of members of document clusters in the treemap view. Version 0.10.1 fixes the issue.

Version 0.10.0

16-06-2016

Version 0.10.0 introduces highlighting of scope query and selected labels in document texts and more options for the document clusters treemap display in Lingo4G Explorer.

Compatibility

Reindexing: Required, version 0.10.0 will work with the index created by the 0.9.x releases, but highlighting will be off. For this reason, we highly recommend to reindex your project from scratch.
Project descriptors: Field content specification has changed, maxTotalLength property has been removed. The defaults have been slightly adjusted to return shorter snippets.
Custom document sources: Recompilation required due to updated binary dependencies of Lingo4G.

New features

Label highlighting: Version 0.10.0 makes it possible to highlight occurrences of scope query and selected labels in the text of documents retrieved using ad-hoc document retrieval.

Document with the Surface Pro and OneNote labels highlighted; Configuration of scope and label highlighting.
Document clusters treemap configuration: Version 0.10.0 adds new features to the document clusters treemap, including coloring, sizing and labeling of document cells based on the selected document fields.

Improvements

Document clustering for subset of in-scope documents: Versions prior to 0.10.0 would refuse to apply document clustering when the scope contained more than maxDocuments documents. As of version 0.10.0, if you set the trimToMaxDocuments parameter to true, Lingo4G will proceed with clustering a subset of the in-scope documents of size maxDocuments.
Label selection improvements: Version 0.10.0 simplifies and improves the performance and memory foot-print of label selection. An important change is the option to introduce a configurable amount of randomness to the label selection process, so that some of the less frequent and lower-scoring labels have a chance to be included in the analysis. The randomized label selection process is controlled by the following newly-added parameters: randomRatio, randomSeed. Please also see below for the API changes related to this improvement.

API changes

Removed and renamed parameters

As a result of label selection improvements, the following parameters have been removed:

analysis.labels.maxLabelsOverhead
labels.surface.partOfSpeechFiltering
labels.frequencies.minRelativeDfDeviation
labels.frequencies.maxRelativeDfDeviation
labels.cooccurrences.isolationThreshold
labels.cooccurrences.isolationThresholdWidth
labels.scorers.isolationRatioScorerWeight
labels.cooccurrences.maxOverlap
labels.cooccurrences.maxOverlapDeviation
labels.scorers.overlapRankScorerWeight
labels.scorers.childCountScorerWeight
labels.scorers.dfScorerWeight
labels.scorers.candidateLabelScorerWeight
debug.logBaseLabelPartialScores

The following parameters have been renamed:

analysis.labels.cooccurrences.cooccurrenceWindowSize renamed to analysis.labels.arrangement.relationship.cooccurrenceWindowSize
analysis.performance.cooccurrenceCountingAccuracy renamed to analysis.labels.arrangement.relationship.cooccurrenceCountingAccuracy

Version 0.9.0

31-03-2016

Version 0.9.0 introduces label arrangements, major improvements to document indexing, many new features in Lingo4G Explorer and much improved documentation.

Compatibility

Reindexing: Required, version 0.9.0 comes with major improvements to indexing that removes noisy labels and decreases the disk size of the index.
Project descriptors: Updates required, certain areas of the descriptor have been reorganized, a number of parameters removed.
Custom document sources: Updates not required.

New features

Label arrangement

Version 0.9.0 makes it possible to arrange related labels into clusters. Label clusters themselves can be organized into higher-level structures.

Result export dialog — **Labels for query *office* organized into clusters.**

Apart from treemap-based presentation, Lingo4G Explorer can show label clusters as a textual list and as a graph.

New public data sets

Version 0.9.0 comes with support for two new public data sets:

Questions and answers from a StackExchange Q&A site, such as superuser.com.
Summaries of research projects funded by the US National Science Foundation and NASA between 2007 and 2015, as available from research.gov.

For more information, see the summary of example data sets.

Documentation updates

Version 0.9.0 comes with significantly more documentation, including conceptual overview of Lingo4G and description of Lingo4G Explorer. Minor documentation additions concern feature extractors and analysis result response syntax.

As of version 0.9.0, all practical examples in the documentation are based on the superuser.com StackExchange data set.

Parameter experiments in Lingo4G Explorer

Version 0.9.0 Lingo4G Explorer adds the Experiments window you can use to investigate the impact of various parameter changes on the properties of the analysis result.

**Result experiments tool** showing how the number of topics depends on the value of the input preference parameter.

Improvements

Indexing improvements

Version 0.9.0 brings significant improvements in the document indexing phase, including:

Keeping numeric tokens in labels, configured by the parameter.
Improved accounting of compound terms that should eliminate truncated labels, such as high-energy x [rays].
Normalization of various kinds of apostrophes.
Removal of globally frequent labels, configured by the parameter.
Decreased disk size of the index.

curl command export

You can now obtain a curl command invocation that will fetch the analysis result data configured in the Lingo4G result export window.

Composite criteria in document retrieval

You can now retrieve the content of documents using composite criteria that allow building complex Boolean queries.

API changes

Document arrangement section reorganized: The section of the descriptor have been reorganized to group the algorithm-specific parameters under a unique property. Lingo4G currently comes with one document clustering algorithm, Affinity Propagation, whose parameters are now available in the section.
scope section removed from result response: The scope section has been removed from the analysis result response output, the documentsInScope property has been moved to the summary section of the output.

Version 0.8.0

2015-11-13

Version 0.8.0 improves the performance of document clustering introduced in version 0.7.0. Additionally, it brings a number of small improvements to Lingo4G Explorer.

Compatibility

Reindexing: Not required, index created by version 0.7.0 will work with version 0.8.0.
Project descriptors: Updates not required, descriptors created for version 0.7.0 with work with version 0.8.0.
Custom document sources: Updates not required.

Improvements

Faster document clustering: Version 0.8.0 adds multi-threaded document clustering. Additionally, in certain cases performance can be further improved by pruning of relationships matrix.
More export options: As of version 0.8.0 you can now choose which document fields to output in the Excel/JSON/XML report. Additionally, you opt for including documents without labels in the output.
Current label view as CSV: You can now copy the contents of the label view, including the added/removed/common status, to clipboard as CSV.
Processing time details and estimates: Version 0.8.0 adds remaining time estimates for long-running tasks. You can see the detailed breakdown of the processing time by hovering with mouse pointer over the total elapsed time statistic.

Version 0.7.0

2015-08-18

Version 0.7.0 is a major new release that adds experimental support for arranging and visualizing documents as flat non-overlapping clusters.

Compatibility

Reindexing: Not required, index created by version 0.6.x will work with version 0.7.0.
Project descriptors: Updates not required, descriptors created for version 0.6.x with work with version 0.7.0.
Custom document sources: Updates not required.
Java 8 required: As of version 0.7.0, Lingo4G requires Java version 8 or later to run.

New features

Document arrangement: Version 0.7.0 makes it possible to arrange documents into flat non-overlapping clusters. Please see the quick start video for an overview and the documents.arrangement configuration section for a brief description of the involved parameters.

Version 0.6.0

released on 2015-07-06

Version 0.6.0 is a major new release that brings improvements in document indexing, improves label selection and adds document content retrieval to Lingo4G REST API and Explorer application.

Compatibility

Reindexing: Required, index created by version 0.5.x will not work with version 0.6.0.
Project descriptors: Updates required. The 0.6.0 release removes a number of obsolete parameters. Please see the release notes for more details or contact Carrot Search for an updated project descriptor.
Custom document sources: Updates required. Version 0.6.0 updates a number of third-party dependencies and therefore the 0.5.x custom document sources may not work with version 0.6.0.

New features

Improved label selection: Version 0.6.0 improves the quality of label selection by introducing automatic discovery of collection-specific stop labels, accompanied by collection probability label scoring and significant improvements in document text tokenization.
Document content retrieval API: The 0.6.0 release introduces a REST API method for document content retrieval. Additionally, you can now browse the contents of documents in Lingo4G Explorer.
Result export in Lingo4G Explorer: As of the 0.6.0 release, you can export the analysis result directly from Lingo4G Explorer and save it as an Excel, XML or JSON file.
New fields in IMDb and PubMed data sets: Version 0.6.0 parses more fields when indexing the IMDb and PubMed data sets. The new fields for IMDb are: country, rating, keywords, director and genre. The new fields for PubMed are: journal, author, keywords, date, journalName and subject.

Project descriptor changes

New output folder: The output from the analyze command is now saved to a dedicated directory called results (directly under the project's directory). Results were previously saved to work directory, which is now exclusive for internal use by the application.
TF/DF ratio scoring removed: Version 0.6.0 replaces TD/DF ratio scoring with automatic discovery of stop labels and probability ratio scoring. You will need to remove the minTfDfRatio and minTfDfRatioDeviation parameters from your project descriptors.
Original label format is now the default: Version 0.6.0 changes the default value of the labelFormat parameter from LABEL_CAPITALIZED to ORIGINAL to avoid confusion when tuning label surface scoring, such as acronymLabelWeight.

Bug fixes

Label filtering not applied for small scopes: Version 0.6.0 fixes a bug that prevented Lingo4G from applying label filtering (label surface and frequency parameters) when analyzing small subsets of the collection.

Version 0.5.0

released on 2015-05-14

Version 0.5.0 is a major new release that adds an initial implementation of Lingo4G REST API along with a simple browser-based tuning application. Also, in preparation for further development, the 0.5.0 release restructures a number of the basic concepts behind Lingo4G and makes a number of backward-incompatible changes.

Compatibility

Reindexing: Not required, index created by version 0.4.1 will work with 0.5.0.
Project descriptors: Updates required. The 0.5.0 release significantly restructures certain areas of the project descriptor. Please see the release notes for more details or contact Carrot Search for an updated project descriptor.
Custom document sources: Updates not required, custom document source binaries created for version 0.4.x will work with version 0.5.0.

New features

REST API: The 0.5.0 release introduces an initial implementation of Lingo4G REST API. You can use the API to invoke Lingo4G text analysis from your favourite programming language or directly from a browser. You can start the REST API server using the server command.
Lingo4G Explorer: You can use Lingo4G Explorer to interactively tune Lingo4G parameters directly in your browser. See the quick start section for instructions on running Lingo4G Explorer.

Conceptual changes

In preparation for further development, the 0.5.0 release needed to restructure some fundamental concepts behind Lingo4G.

Labels and documents

The two basic entities involved in Lingo4G processing are now labels and documents. Document is a basic unit of information processed by Lingo4G, a label is a specific human-readable feature that occurs in one or more documents. Further releases of Lingo4G will allow arranging both labels and documents into higher-level structures such as clusters or graphs.

Clustering becomes analysis

To accommodate further additions to Lingo4G, such as embedding of labels and documents in 2d spaces, the 0.5.0 release replaces the notion of clustering with the more general analysis. Currently, analysis consists of selecting a set of labels that best describe the subset of documents submitted for analysis.

As a consequence of the clustering to analysis transition, the l4g cluster command has been renamed to analyze and the clustering section of the project descriptor has become the analysis section.

Project descriptor changes

The 0.5.0 release introduces a number backwards-incompatible changes in the project descriptor.

"analysis" section

The clustering section has been renamed to analysis. Furthermore, to account for the changed definition of the "clustering" concept, parameters found in the clustering subsection have been moved to the labels subsection. The labels subsection has been subdivided to the surface, frequencies and cooccurrences subsections.

"labelSource" section

The labelSource section has been renamed to source and put as a subsection of the labels section. The list of feature fields to fetch labels from is now represented by an array of field descriptors rather than two arrays of field names and field weights.

Renamed parameters

A number of parameters in the former clustering.labels (now in analysis.labels.surface) have been renamed:

Old name	New name
`preferredLabelLength`	`preferredWordCount`
`preferredLabelLengthDeviation`	`preferredWordCountDeviation`
`minLabelTokens`	`minWordCount`
`maxLabelTokens`	`maxWordCount`
`minLabelCharacters`	`minCharacterCount`
`minLabelTokenCharacters`	`minWordCharacterCountAverage`

"output" section

The output section has been restructured to reflect the introduction of labels and documents entities.

Output of label co-occurrences temporarily removed

Release 0.5.0 temporarily removes the option to output label co-occurrences. Further releases will allow outputting generalized relationships between labels and documents, one of which will be label co-occurrences.

Improvements

Minor quality fixes: The 0.5.0 fixes minor bugs that could deteriorate the quality of label selection.
JSON override in l4g analyze: As of the 0.5.0 release it is possible to override arbitrary analysis parameters when invoking the analyze command using the -j command line parameter. This can be particularly handy when you export the JSON override strings from Lingo4G Explorer .

Version 0.4.1

released on 2015-04-17

Version 0.4.1 comes with an important bug fix in the clustering algorithm.

Compatibility

Reindexing: Not required, index created by version 0.4.0 will work with 0.4.1.
Project descriptors: Updates not required, project descriptor created for version 0.4.0 will work with version 0.4.1.
Custom document sources: Updates not required, custom document source binaries created for version 0.4.0 will work with version 0.4.1.

Bug fixes

Empty cluster set when clustering a subset of the collection: Lingo4G may erroneously create an empty cluster list when processing a subset of the collection and print a misleading message saying No candidate labels found, try lowering the DF cut-offs.. The 0.4.1 release fixes this issue.
Pure negative queries supported: Version 0.4.1 adds support for pure negative queries in the cluster command. For example, -s "-summary:foo" would select all documents that do not contain the term foo in the summary field.
Assertion errors in indexer: Previous version might throw an assertion error when the number of segments to optimize was equal to 1. Version 0.4.1 fixes this issue.

Version 0.4.0

released on 2015-02-13

Version 0.4.0 comes with major rewrite of the indexing infrastructure, resulting in optimized memory, better phrase extraction and tuned resource utilization.

Compatibility

Reindexing

Not strictly required (index created by version 0.3.x will work with 0.4.0), but strongly recommended as the resulting output should contain better features.

Removed indexer options

Several options have been removed from the indexer section of the project descriptor. Project descriptors still carrying these attributes will fail to parse properly.

Indexer type sequential has been removed. Remove the type of the indexer entirely, if it is present in your descriptor file.
indexWriter attribute (and all children attributes) has been removed. The index writer, its buffers and memory allocation, is now adjusted automatically.
Phrase feature contributor's minPhraseDfAtPartialMerge and diskCounterMaxBufferSizeMb attributes have been removed.

Query parser

The default query parser's operator has been changed from OR to AND to be more similar to modern search engines.

Improvements

Indexing: The index command has been rewritten to utilize memory and disk more efficiently.
Phrase extraction: A number of improvements to automatic phrase extraction yields better label candidates and clustering output as a result.
Common terms handling: Phrases with leading or trailing common terms could be incorrectly indexed and show up as cluster labels.

Version 0.3.1

released on 2015-01-19

Version 0.3.1 allows specifying the list of fields to cluster on as a parameter of the l4g cluster command and fixes a minor bug in parsing command line arguments.

Compatibility

Reindexing: Not required, index created by version 0.3.0 will work with 0.3.1.
Project descriptors: Updates not required, project descriptor created for version 0.3.0 will work with version 0.3.1.
Custom document sources: Updates not required, custom document source binaries created for version 0.3.0 will work with version 0.3.1.

Improvements

l4g cluster --feature-fields: You can now pass the list of feature fields to use during clustering using the --feature-fields option.

Bug fixes

Incorrect parsing of quoted command line parameters: It was impossible in earlier version of Lingo4G to pass a command line parameter enclosed in double quotes. For example, the selector query of l4g cluster -s "\"phrase query\"" would be interpreted as phrase query rather than "phrase query". Version 0.3.0 fixes this issue.

Version 0.3.0

released on 2014-12-12

Version 0.3.0 fixes a number of major bugs and introduces two small improvements.

Compatibility

Reindexing: Recommended if the source of the data was a Lucene index, see the bug fixes section for details.
Project descriptors: Project descriptors using custom document sources may require an update. Carrot Search will provide the updated project descriptor if needed.
Custom document sources: Updates required, Carrot Search will provide binaries of your custom document sources that will work with the 0.3.0 version.

Improvements

Improved l4g stats

The l4g stats command has received a number of improvements and changes:

Reporting of the size of document term vectors has been added, with may be a useful piece of input for performance tuning.
Reporting the raw text statistics is disabled by default, the term vector statistics are much more useful for performance tuning. You can get the raw text statistics by passing the --analyze-text-fields option.
The default accuracy of statistics gathering has been lowered from 1.0 to 0.1. The lowered accuracy is still large enough to get a very good estimate of the statistics and leads to much faster processing in case of large indices. You can set a different accuracy using the -a option.

More flexible l4g cluster -o option

As of Lingo4G 0.3.0 you can also pass a file name to the -o option of l4g cluster to save the clustering results directly to the provided file.

Bug fixes

Some documents sourced from a Lucene index may not get indexed

Earlier versions of Lingo4G may ignore during indexing some documents from a source Lucene index that consists of multiple segments or has deleted documents. Version 0.3.0 fixes this issue.

If the source of documents was a Lucene index, re-indexing is required for Lingo4G to include all the desired documents in its index.

Exception when generating document-cluster assignments

Lingo4G 0.2.0 would throw an exception when the project descriptor had the output.components.assignments.enabled property set to true, which effectively prevented generating document-to-cluster assignments. Version 0.3.0 fixes this issue.

Use 24-hour clock in log file names

Version 0.3.0 switches to the 24-hour clock for log fie names, so that sorting by file name produces a chronological order of log files.

Incorrect total time in log files

Version 0.2.0 would always report zero total processing time in log files, version 0.3.0 fixes the issue.

Version 0.2.0

released on 2014-12-04

Compatibility

Reindexing: Recommended. Version 0.2.0 introduces more flexible configuration of document field indexing. It is recommended to re-index your data to keep the index synchronized with the updated project descriptor.
Project descriptors: Update required to convert the 0.1.x-style document field indexing definition to the syntax updated in 0.2.x. Carrot Search will provide the updated project descriptors for early adopters of Lingo4G.
Custom document sources: Updates required, Carrot Search will provide binaries of your custom document sources that will work with the 0.2.0 version.

New features

Improved document field indexing configuration

Version 0.2.0 changes the document field indexing configuration syntax to allow more flexibility. With the new syntax it will be possible to reduce the size of Lingo4G index by not storing the original text of the field and/or its search index while retaining the possibly to apply clustering to that field. Please see the documentation of the fields section for more details and examples.

Carrot Search will provide the updated project descriptors for early adopters of Lingo4G.

Dedicated by-identifier document selection syntax: Version 0.2.0 adds dedicated syntax for selecting documents for clustering based on their identifiers. You can use this syntax to efficiently select thousands, tens and hundreds of thousands of documents by some identifier field value.

Automatic minPerSegmentDf: Version 0.2.0 adds support for the auto value for the performance.minPerSegmentDf parameter, in which case the appropriate value will be computed based on the clusters.minClusterSize parameter. In most cases, the auto setting will improve the clustering performance.

Improvements

Maximum field length in label-document assignment result: Lingo4G will limit the maximum number of characters written for each field in the label-document assignment result to prevent from accidentally writing very large amounts of content to the result file. You can change the default length limit using the output.components.assignments.maxFieldLength option.

Bug fixes

Shell scripts return code 0 for empty cluster lists: Lingo4G 0.1.x l4g shell scripts would return a non-zero code when the list of clusters was empty. To reserve the non-zero codes for actual execution errors, as of version 0.2.0 Lingo4G launch scripts will return zero also when the execution completes successfully but with an empty cluster list.

Version 0.1.2

released on 2014-11-25

Version 0.1.2 fixes a major bug in cluster label candidate selection present in the 0.1.0 and 0.1.1 releases.

Compatibility

Reindexing: Not required, index created by previous 0.1.x versions will work with 0.1.2.
Project descriptors: Updates not required, project descriptor created for earlier 0.1.x releases will work with 0.1.2.
Custom document sources: Updates not required, custom document source binaries created for previous 0.1.x versions will work with version 0.1.2.

Bug fixes

No clusters when clustering a subset of the index: Versions 0.1.0 and 0.1.1 may occasionally generate an empty cluster list when processing a fairly large subset of the index. In such cases the No candidate labels found, try lowering the DF cut-offs. message would be printed. Version 0.1.2 fixes the issue.

Version 0.1.1

released on 2014-11-24

Version 0.1.1 introduces a number of small improvements, bug fixes and documentation clarifications.

Compatibility

Reindexing: Not required, index created by version 0.1.0 will work with 0.1.1.
Project descriptors: Updates not required, project descriptor created for version 0.1.0 will work with version 0.1.1.
Custom document sources: Updates not required, custom document source binaries created for version 0.1.0 will work with version 0.1.1.

Improvements

Re-indexing into non-empty index requires explicit confirmation

Version 0.1.0 would silently discard the existing index when re-indexing. To avoid accidental deletion of the index, version 0.1.1 will only overwrite the existing non-empty index if the --force option is provided.

Cygwin and Mingw

When running Lingo4G in Cygwin on Mingw, use the l4g.cmd script so that Lingo4G can correctly resolve file paths. As of version 0.1.1 the l4g Bash launch script will refuse to run under Cygwin and Mingw.

Version information

You can now get detailed Lingo4G version information by running

l4g
                    version

Unlimited number of clauses in selection query

The document selection query can now use an unlimited number of clauses, which makes it possible to select large numbers of documents for clustering for example by their identifiers (id:d1 OR id:d5 OR id:d47 ...).

planned The performance of selecting thousands of documents using the OR syntax is currently very low. Further releases of Lingo4G will come with a dedicated syntax for by-id selection and much better performance characteristics.

Bug fixes

l4g shell scripts return codes: As of version 0.1.1, the l4g shell scripts correctly return execution status codes.

Version 0.1.0

released on 2014-11-07

Initial alpha release.