Carrot Search Lingo4G Releases
Release notes, versions up to 1.7.1
This document contains Lingo4G release notes including information on new features, changes, bug fixes and upgrade considerations. Please see the Lingo4G reference for the full manual and options reference.
Version 1.7.1
10-09-2018The 1.7.1 fixes a number of issues found in earlier releases.
Compatibility
- Reindexing
- Not required. Lingo4G 1.7.1 will work with indexes created by version 1.7.0.
- Project descriptors
- Updates not required.
- Custom document sources
-
Document sources compatible with version 1.7.0 will work with version 1.7.1.
Improvements
- Query text area
-
Lingo4G Explorer in version 1.7.1 replaces single-line query text box with a multi-line text area for easier input of long scope queries.
Bugs
- Exception when
requesting XML output -
Versions 1.6.0, 1.6.1 and 1.7.0 would throw an exception when analysis output in XML format was requested. Version 1.7.1 fixes the issue.
- Field length limits
ignored in Export dialog -
Explorer's document results export dialog would ignore the descriptor-defined field output settings. For example, a maximum of 160 characters per field would be output, regardless of the limit defined in the descriptor.
Version 1.7.1 uses the descriptor-defined field output settings when exporting analysis results from Explorer.
- CygWin and Java 1.8
-
The launcher script
l4g
did not detect Java 1.8 properly under CygWin and failed to start Java virtual machine.
Version 1.7.0
05-09-2018The 1.7.0 release is about scaling Lingo4G to terabyte-sized data sets. The new version significantly speeds up indexing and analysis of such large collections. You can test the new capabilities on the newly added US Patent and Trademark Office data set, which contains almost 500 GB of text.
Release 1.7.0 also improves the process of fetching labels describing in-scope documents to better eliminate boiler-plate labels and increase performance when analyzing small subsets of very large indices.
Compatibility
- Reindexing
-
Recommended. Lingo4G 1.7.0 removes redundant information from the feature index,
which may lower its size by 20–30%. Further index size improvements can be achieved by
changing the newly introduced
maxPhrases
andmaxPhrasesPerField
indexing parameters. - Project descriptors
- Updates required. Update of the label fetching algorithm resulted in a removal of a number of parameters, see the Project descriptor changes section for a detailed list of changes to apply. If you have problems upgrading your project descriptor, get in touch.
- Custom document sources
-
Document sources compatible with 1.6.x releases will work with version 1.7.0.
New features
- US patents data set
-
This release comes with support for the patent grant and application data available from United States Patent and Trademark Office. Please note that the whole data set contains documents spanning nearly 500 GB of indexable text and will need a high-end machine to handle.
- Scalability of indexing
-
Version 1.7.0 introduces a number of indexing parameters to control performance and index size when working with multi-gigabyte or terabyte-sized collections.
The
maxPhrases
parameter can help to keep the total number of indexed phrases within a reasonable limit and thus keep the index small and noise-free.You can use the
samplingRatio
parameter to run label extraction based on a sample of the indexed documents. When indexing collections of millions of documents, sampling can significantly speed up indexing with negligible loss of accuracy.Finally, the
maxPhrasesPerField
parameter makes it possible to index only a number of top-frequency labels for each document. When indexing very long documents, indexing top-frequency labels can lower index size and speed up analysis. - Label fetching rewritten
-
Version 1.7.0 comes with significant updates to the process of fetching labels describing in-scope documents. The new algorithm should better eliminate boiler-plate labels, such as control group for medical papers, and offer stable high performance when processing small subsets of very large indices. The algorithm is controlled by the newly-added maxLabelsPerDocument parameter.
Due to changes in label scoring, the default values of the
idfScorerWeight
andtfScorerWeight
have been changed to 1.0. To make the most of the new label selection algorithm, make sure your project descriptor and Lingo4G Explorer settings do not override these parameters.Finally, as part of the label fetching algorithm update, the following parameters have been removed:
minPerSegmentDf
,maxPerSegmentDf
,maxSubsetSizeForTermVectorScan
,randomRatio
andrandomSeed
. - Document deletion
-
A new command to delete documents from the index has been added:
l4g delete
. Documents to delete can be selected using a regular Lucene query.
Improvements
- Document similarity
computation speed-up -
Version 1.7.0 significantly speeds up computation of document-to-document similarities, which is required to perform document clustering and embedding. On very large indices with millions of documents the speed-us can reach 600%.
- Indexing performance
bottleneck removed -
An indexing performance bottleneck affecting collections with a large number of very short documents has been identified and addressed.
- Feature index size reduced
-
We removed some redundant information from the feature index which should result in space savings of 20% to 30%, exact numbers may vary. Reindexing is highly recommended.
- Sorting map legend
entries in Explorer -
You can now sort the entries in document map legend by the number of occurrences or alphabetically by the label. You can change the sorting order by clicking the icon in top-right corner.
- Complement criteria type
-
A new
complement
criteria type has been added to the document retrieval API. - Count of field values
-
A new property
valueCount
can be returned for the selected document fields. This setting needs to be enabled by settingvalueCount
property for a corresponding field totrue
in the request or the descriptor. - Security updates
-
Jetty, the built-in HTTP server, has been upgraded to version 9.3.24.v20180605 to address security vulnerabilities.
Bugs
- Trimming marker and
maxValues -
When
maxValues
is used on a multi-valued field to limit the number of returned values, thetruncationMarker
is appended as the "extra" value when the limit is exceeded. This change makes it possible to skip adding this extra value if the trimming marker is an empty string.
API changes
- Document
criteria cleanup -
The document retrieval section of the REST API has been cleaned up to be consistent with the
scope.selector
element. Existing code will still work in this version, but will emit deprecation warnings on the server.There are three changes that require adjustments to existing code.
-
The
criteria
element on an analysis document retrieval JSON becomes aselector
element, for example:{ "limit": 10, "selector": { "type": "forLabels", "labels": [ "data mining", "KDD" ], "operator": "OR" } }
-
The
composite
criteria definition is now consistent with the composite scope selector. For example, what previously could read:{ "limit": 10, "criteria": { "type": "composite", "operator": "AND", "criteria": [ { "type": "forLabels", "labels": [ "email" ] }, { "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] } ] } }
would now read:
{ "limit": 10, "selector": { "type": "composite", "operator": "AND", "selectors": [ { "type": "forLabels", "labels": [ "email" ] }, { "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] } ] } }
-
The
complement
criteria definition is now consistent with the complement scope selector. This means that the nestedcriteria
array element becomes a singleselector
(multiple selectors can be combined with a composite). For example, what previously could read:{ "limit": 10, "criteria": { "type": "composite", "operator": "AND", "criteria": [ { "type": "forLabels", "labels": [ "email" ] }, { "type": "complement": "criteria": [{ "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] }] } ] } }
would now become:
{ "limit": 10, "sellector": { "type": "composite", "operator": "AND", "selectors": [ { "type": "forLabels", "labels": [ "email" ] }, { "type": "complement": "selector": { "type": "forLabels", "operator": "OR", "labels": [ "Thunderbird", "Outlook" ] } } ] } }
-
Project descriptor changes
- Removed properties
-
Due to the update of label fetching algorithm, a number of descriptor properties have been removed. If your descriptor contains any of those properties, remove them to make your descriptor compatible with version 1.7.0.
Removed properties:
labels.performance.minPerSegmentDf
labels.performance.maxPerSegmentDf
labels.performance.maxSubsetSizeForTermVectorScan
Configuration of old label fetching algorithm removed in version 1.7.0. labels.scorers.randomRatio
labels.scorers.randomSeed
Configuration of randomized label selection removed in version 1.7.0.
Version 1.6.1
11-05-2018Version 1.6.1 provides a number of small improvements.
Compatibility
- Reindexing
- Not required. Lingo4G 1.6.1 will work with indexes created by version 1.6.0.
- Project descriptors
- Updates not required.
- Custom document sources
-
Document sources compatible with version 1.6.0 will work with version 1.6.1.
When upgrading from version 1.5.1 or earlier, please see version 1.6.0 compatibility notes for the required changes.
Improvements
- REST reload response
-
The reload REST endpoint has been improved to minimize the risk of an unchecked exception when there is an active reindexing or indexing process running in the background and modifying the index. In case this is detected, a HTTP 503 Service Unavailable response will be returned to the client.
Regardless of those checks, we advise to issue the reload call sequentially after any index-manipulating commands have completed.
- Improved label coverage
computation performance -
Version 1.6.1 modifies the algorithm used to compute the number of labeled documents to improve its performance on both SSD and HDD drives.
- Less noisy log files
-
We have removed some and altered other log entries to make them less noisy.
Version 1.6.0
16-04-2018Version 1.6.0, the largest upgrade to date, adds two major new features: support for incremental indexing of documents and 2d map-like visualization of sets of documents.
Additionally, it comes with numerous Lingo4G Explorer improvements, an increase in indexing speed and many API clean-ups.
Compatibility
- Reindexing
-
Required. Lingo4G 1.6.0 will not work with indexes created by any previous version. Full reindexing is required because of structural changes to how the index is stored and maintained.
Any previous indexes need to be manually removed from disk, the
l4g index --force
option will not work. - Project descriptors
-
Updates required. A number of changes have been made to clean up project descriptors and provide cleaner separation of concerns. See the Project descriptor changes section for a detailed list of changes to apply. If you have problems upgrading your project descriptor, get in touch.
- Custom document sources
-
Updates required. Minor updates, such as method signature changes, will be required to update custom document sources to compile with version 1.6.0.
New features
- Document map
visualization -
Version 1.6.0 introduces the document embedding feature, which places documents in 2d space in such a way that textually-similar documents are close to each other.
Based on document embedding, Lingo4G Explorer introduces the document map view, a tool for interactive visualization and exploration of document collections.
- Incremental indexing
-
Starting with the 1.6.0 release, documents can be added or updated in the index without re-processing the whole collection. Please refer to the manual to see how incremental indexing works, which document sources are supported and what the caveats are with regard to running the REST server on the index that can change on the fly.
Lingo4G Explorer improvements
- Document summary view
-
Lingo4G Explorer introduces the document summary view, which displays themes and topics discovered in the currently selected documents.
- Analyzing currently
selected documents -
As of version 1.6.0, you can press the Analyze link in the document content or summary views to analyze the currently selected set of documents.
- Query syntax and
parameter help -
Lingo4G Explorer now comes with quick help for analysis parameters and scope query syntax explanation and examples.
- Processing scope
limit escalation -
To prevent unintended long-running analyses, starting with the 1.6.0 release, the default limit on the number of analyzed documents is 10k. The limit can be increased or lifted entirely by the user.
If the processing scope contained more documents than the current limit, Lingo4G Explorer will display the icon next to the scope size information. If you hold your mouse pointer over that area, you will see buttons for escalating the scope limit, along with rough estimates of how long the analysis might take.
- Parameter panel collapsing
-
You can now click the icon to collapse the parameters panel to make more space for interaction with the analysis results.
- Advanced parameters filter
-
Lingo4G Explorer now shows only the basic analysis parameters by default. To show all available analysis parameters, click the icon.
- Document map view
is now the default -
As of the 1.6.0 release, the default analysis result view in Lingo4G Explorer is the document map view. The order of tabs has also been changed to put document clusters first, followed by label clusters and label list.
Other improvements
- byQuery ad-hoc retrieval
criteria added -
As of version 1.6.0 you can use the byQuery criteria for ad-hoc document retrieval. This makes it possible to narrow-down the set of in-scope documents to those matching an arbitrary user query.
- More precise
document relationships -
Version 1.6.0 adds the
minMatchingQueryLabels
parameter separately for document clustering and embedding. This makes it possible to filter out weak relationships based on, for example, only one common label shared by documents. - More precise retrieval
with forLabels criteria -
It is now possible to restrict the set of documents retrieved using the
forLabels
criteria to those containing not fewer than the specified number of the requested labels. The minimum number of labels can be passed using theminOrMatches
criteria parameter. - White space
normalization -
When queries entered by users contained invisible white space (special unicode character sequences), query parsers typically returned no matching documents, confusing users. We added white space normalization to all query parsers, controlled by the
sanitizeSpaces
parameter. By default, all unicode white space characters are normalized to a single plain white space. - Dependency
upgrades -
All software dependencies have been updated to their latest stable versions. This includes migration of indexes to use Lucene 7.2.1.
API changes
The 1.6.0 release fundamentally changes the way Lingo4G indices are created and stored. While most of the changes are transparent to the end user, we took this opportunity to clean up certain aspects of the public APIs and project descriptors.
This section lists the general functional changes. For a list of the required project descriptor updates, see the project descriptor changes section.
- Processing scope limits
apply for all analysis facets -
Versions up to 1.6.0 applied processing scope limits only during document clustering through the
maxDocuments
andtrimToMaxDocuments
parameters.Version 1.6.0 applies the scope size limit across all analysis artifacts, including label list, label clustering, document clustering and embedding. As a result, the document clustering-specific limit parameters have been removed and replaced with the
scope.limit
parameter.Additionally, to avoid unintended long-running analyses, the default scope limit is set to 10000 documents. You can override the limit by editing your project descriptor or by providing a new per-analysis limit value. To remove the limit entirely, pass the
unlimited
string as the scope limit. - Probability ratio-based
label scoring removed -
The 1.6.0 release removes the component of label score computation based on scope-to-collection occurrence probability ratios. While that scoring component was effective for narrowly-focused topic-specific scopes, it would significantly lower the quality of labels for scopes being all-topics samples of the full collection.
As a result, all probability ratio-based parameters have been removed from the descriptor.
- Output of raw label and
document relationships
removed -
Version 1.6.0 removes the possibility to output raw similarity matrices for labels and documents. As a result, the related parts of the descriptor and output JSON have been removed.
- Index format changes
-
Version 1.6.0 changes the internal layout of the index. Documents and features are now stored separately. As a result, indexing requires only one iteration over the document source documents, any remaining passes required for feature discovery are executed on the contents of the index.
A side-effect of this is that labels can now be recomputed based on the current content of the index. This can save processing time when new features need to be generated after indexing parameters or dictionaries change.
- Caching of source
documents removed -
Due to the new index structure, there is no need for explicit caching of source documents. Therefore, the
cacheSource
parameter has been removed from project descriptor and thel4g.cacheSource
system property controlling this feature is no longer recognized. - Incremental indexing
interfaces -
Initial version of the incremental indexing API has been added (
IIncremental
interface), with two example implementations in the examples (JsonDocumentSource
andJsonRecordsDocumentSource
).The incremental indexing API is still experimental and may change in the future.
Project descriptor changes
- Removed properties
-
A number of descriptor properties have been removed, usually as a consequence of refactorings and removal of Lingo4G features. See the API changes section for a detailed description of the related functional changes.
Removed properties:
name
Human-readable project name. documents.arrangement.maxDocuments
documents.arrangement.trimToMaxDocuments
Document-clustering specific scope limit, now replaced with a general scope limit. labels.relationships
documents.relationships
Configuration of the output of label-label and document-document relationships. labels.probabilities.probabilityRatioPreference
labels.probabilities.probabilityRatioThreshold
labels.probabilities.probabilityRatioPreferenceStrength
labels.probabilities.probabilityRatioMaxRelativeScopeSize
labels.scorers.probabilityRatioScorerWeight
Parameters of probability ratio-based label scoring. cacheSource
Configuration of source document caching. - Scope specification
refactored -
The parameter specifying scope size limit has been moved from
scope.byQuery.limit
toscope.limit
.Additionally, scope selector definition has been moved to the dedicated
scope.selector
element to signify that the scope limit applies to all types of selectors. Seescope
element documentation for the current structure of scope definition. - Arrays replaced
with objects -
indexer.features
element is an object, not an array. Previously the declaration of feature extractors required an array of objects, each with akey
attribute to identify the feature extractor. Thefeatures
section is now an object, with each key denoting a unique extractor and the associated value containing its definition. So, a previous JSON of:"indexer": { "features": [ { "key": "coarse-phrases", "type": "phrases", ... } ] }
now becomes:
"indexer": { "features": { "coarse-phrases": { "type": "phrases", ... } } }
Top-level
queryParsers
element is an object, not an array. Previously the declaration of query parsers required an array of objects, each with akey
attribute to identify the feature extractor. ThequeryParsers
section is now an object, with each key denoting a unique query parser and the associated value containing its definition. So, a previous JSON of:"queryParsers": [ { "key": "standard", "type" : "standard", ... } ]
now becomes:
"queryParsers": { "standard": { "type": "phrases", ... } }
Top-level
dictionaries
element is an object, not an array. Previously the declaration of dictionaries required an array of objects, each with akey
attribute to identify the dictionary. Thedictionaries
section is now an object, with each key denoting a unique dictionary and the associated value containing its definition. So, a previous JSON of:"dictionaries": [ { "key": "default", "type" : "simple", ... } ]
now becomes:
"dictionaries": { "default": { "type": "simple", ... } }
-
fields
moved to top level -
The declaration of fields, previously:
{ "source": { "fields": { ... } } }
is now moved to top-level:
{ "fields": { ... }, "source": { ... } }
- More strict
fields specification -
The field specification is now more strict with respect to which attributes are permitted for each type of field: only
text
fields can declare analyzers, thestored
attribute has been removed without replacement (all fields are now stored in the index to allow incremental indexing and recomputation of features). - Document identifiers
-
A Boolean
id
attribute can be defined on exactly one field to indicate the identifier of a document required for document updates in incremental indexing. -
classpath
element moved -
Top-level
classpath
element has been moved undersource.classpath
. -
Project directory
configuration consolidated -
Top-level
projectDirectory
,workDirectory
,resultsDirectory
,indexDirectory
elements have been removed. There is a new top-level elementdirectories
declaring key project directories, but changes or overrides to this element at the project-descriptor level are discouraged. -
Substitution
variables renamed -
Substitution variable
project.directory
, pointing at the directory of the project descriptor, has been renamed tol4g.project.dir
. The only other substitution variable available isl4g.home
pointing at the installation directory of Lingo4G. No other project locations are exposed as substitution variables. - Document source
classes renamed -
Names of document source classes shipped with Lingo4G have changed. If your project descriptor makes use of any Lingo4G document source (from the examples) then the naming convention has changed from
...DocumentSourceFragment
to...DocumentSourceModule
, reflecting the fact that these classes implement Guice modules that provide an implementation of a document source.
Bugs
- Incorrect sorting of
document cluster sets -
Previous versions of Lingo4G Explorer would incorrectly sort document cluster sets. Now, cluster sets are sorted by the decreasing number of documents in the cluster set.
Version 1.5.2
12-12-2017Version 1.5.2 fixes a significant issue with label exclusion dictionaries. Update is strongly recommended.
Compatibility
- Reindexing
- Not required. This release will work with indexes created by any prior 1.5.x version.
- Project descriptors
- Updates not required.
- Custom document sources
-
Document sources are compatible with any 1.5.x-compatible version.
Bugs
- Label exclusion dictionaries
-
A significant bug was present in the cache fingerprint calculation routine of simple label exclusion dictionaries. This could manifest itself in hash collisions between dictionaries that had different exclusion rules, but an identical set of terms. For example, the following two dictionaries would be considered identical:
* foo bar foo
foo bar
While these are obviously different exclusion rules (that would apply to different patterns), Lingo4G would compile the first dictionary it sees and reuse it for any subsequent requests. This could manifest itself in hard-to-reproduce odd behavior at runtime.
Version 1.5.1
08-08-2017Version 1.5.1 provides a number of minor improvements and bug fixes.
Compatibility
- Reindexing
- Not required. Lingo4G 1.5.1 will work with indexes created by version 1.5.0.
- Project descriptors
- Updates not required.
- Custom document sources
-
Document sources compatible with version 1.5.0 will work with version 1.5.1.
Bugs
- Incorrect document cluster
selection highlighting -
Lingo4G Explorer 1.5.0 might highlight a cluster different than the one the user clicked to select.
Improvements
- IE compatibility view
-
A special meta tag has been added to Lingo4G Explorer's front page to allow it to bypass Internet Explorer's intranet "compatibility mode" policy.
Version 1.5.0
30-06-2017Version 1.5.0 introduces hierarchical clustering of documents and implements a number of minor improvements and bug fixes.
Compatibility
- Reindexing
- Recommended. Lingo4G 1.5.0 enables document length normalization, which will take effect only after re-indexing.
- Project descriptors
- Updates required. A number of parameters have been moved, descriptors need to be updated to account for those changes.
- Custom document sources
-
Document sources compatible with version 1.4.0 will work with version 1.5.0.
New features
- Hierarchical document
clustering -
Version 1.5.0 adds support for hierarchical clustering of documents. Similarly to label clustering, relationships between document clusters are established by allowing cluster exemplars to be themselves members of other clusters.
Like with label clustering, Lingo4G Explorer presents document arrangements as a flattened two-level structure of cluster sets and clusters. Hierarchical clustering of documents can be controlled by the softening parameter.
Improvements
- Analysis scope size
limit in Explorer -
Version 1.5.0 adds the possibility to limit the number of analyzed documents in Lingo4G Explorer. To apply the limit, check the check box and choose the maximum number of documents to process.
The same limit can be applied when calling Lingo4G REST API by setting the limit parameter.
- Document length norms
-
Version 1.5.0 enabled the storage of document length normalization factors for feature fields. This will help to compute more consistent document similarity values when the index contains length-imbalanced documents.
Document normalization factors are computed and stored during indexing. Therefore, for normalization to take effect, data needs to be re-indexed.
- minDocumentLabels
parameter added -
The minDocumentLabels parameter has been added to control which documents are included in document relationship computation and clustering.
- Output of label
similarities -
1.6.0 You can now set
labels.relationships.enabled
totrue
to retrieve the matrix of similarities between labels. The output of the relationships matrix can also be enabled in Lingo4G Explorer's Export window. - Multivalued fields
and trimming -
If a field with multiple values is trimmed because of the limit set in maxValues, it will be returned with an additional value equal to the truncation marker (from in the highlighting configuration). This can be used to make it more explicit that the returned set of values is not a complete value of the field.
Bugs
- Dates and numbers in
queries causing exceptions -
Date or numeric field types will no longer result in a runtime exception when used in a query using qualified field notation (
field:value
).
API changes
- Moved parameters
-
Parameters controlling computation of relationships between labels and documents have been moved under a dedicated section of the project descriptor.
Parameters of co-occurrence based relationship computations have been moved to their dedicated
cooccurrences
section:1.4.0 and earlier:"arrangement": { "relationship": { "type": "cooccurrences", "similarityWeighting": "RR", "threads": "auto" } }
1.5.0:"arrangement": { "relationship": { "type": "cooccurrences", "cooccurrences": { "similarityWeighting": "RR", "threads": "auto" } } }
Similarly, parameters of document relationship computations have been moved to a dedicated
mlt
section:1.4.0 and earlier:"arrangement": { "relationship": { "type": "mlt", "maxSimilarDocuments": 10, "maxQueryLabels": 20, ... } }
1.5.0:"arrangement": { "relationship": { "type": "mlt", "mlt": { "maxSimilarDocuments": 10, "maxQueryLabels": 20, ... } } }
If the project descriptor contains non-default values of the above parameters, it must be updated to move the parameters to their dedicated sections as above.
- Document relationship
matrix format change - 1.6.0 Version 1.5.0 switches the format of the document similarity matrix to use row / column indices in the matrix rather than document identifiers. This lowers the size of the responses containing similarity matrices.
Version 1.4.0
09-05-2017Version 1.4.0 adds the possibility to cancel in-progress analysis requests, introduces caching of source documents and provides a number of minor improvements and bug fixes.
Compatibility
- Reindexing
- Not required. Lingo4G 1.4.0 will work with indexes created with Lingo4G 1.3.0.
- Project descriptors
- Updates not required.
- Custom document sources
-
Document sources compatible with version 1.3.0 will work with version 1.4.0. Please, however, note the source document caching feature introduced in version 1.4.0.
New features
- Deleting or cancelling
analyses -
The analysis endpoint of the REST API now supports the HTTP DELETE method, which you can use to cancel in-progress analyses or delete the completed ones from server caches.
Lingo4G Explorer's user interface has been modified to allow users to cancel analyses in progress.
- Source document caching
-
1.6.0 As of version 1.4.0, Lingo4G indexing will fetch documents from the document source only once, storing all the required fields in a local, compressed cache file. The cache file will then be reused for all subsequent indexing phases and removed once indexing completes.
This feature will be useful for document sources where data fetching is costly (decompression, network access, file parsing) or can result in loading much overhead data, not referenced from any feature extractors or data fields later stored in Lingo4G indexes.
Size of the cache file will typically be 50–70% of the total size of text returned by the document source for indexing.
-Dl4g.cacheSource=false
While caching is enabled by default, it can be switched off by passing the option to the index command or by permanently changing thecacheSource
option in the project descriptor. - License file information
- Version 1.4.0 adds license validity and limits information to the /about REST API endpoint. The same information can be displayed in Lingo4G Explorer.
Improvements
- Optimized number of
source scans -
The indexer's source document scanning has been optimized to minimize the number of required full passes over the input.
This improvement is applicable for projects with more than one feature extractor; for a single feature extractor the behavior has not changed.
- Lucene 6.5.0
-
The underlying Lucene search engine has been upgraded to version 6.5.0. Existing indexes are compatible and do not require reindexing.
- Jetty 9.3.19
-
Built-in Jetty server has been upgraded to version 9.3.19.v20170502.
Bugs
sourceFields
was ignored
in phrase extractor-
The
sourceFields
property was ignored in phrase feature extractor. Instead, alltargetFields
of were used as the source of features. This bug affected only those setups wheresourceFields
andtargetFields
were different. --max-docs
console output-
When
--max-docs
was used with indexing the output progress was not updated properly (was stuck at partial progress forever). - Document content output
in Lingo4G Explorer - Previous versions would always export the default set of fields even if some of those fields were deselected in the dialog.
Version 1.3.0
15-03-2017The 1.3.0 release adds the Wikipedia and NIH research projects datasets, adjusts indexing concurrency to account for drive speed, fixes a bug that prevented Lingo4G Explorer from loading in IE 11 and Firefox and provides a number of other minor improvements and bug fixes.
Compatibility
- Reindexing
- Not required. Lingo4G 1.3.0 will work with indexes created with Lingo4G 1.2.1.
- Project descriptors
-
Updates required. Version 1.2.1 deprecated the
jsonFile
property of the JSON document source in favor of theinputs
property. ThejsonFile
property has been removed in version 1.3.0. - Custom document sources
and logging configuration -
Updates may be required due to the upgrade of the default logging system to log4j2. We always recommend using slf4j logging facade so that such changes are application-transparent.
Log4j2 has a slightly different syntax of configuration files, so any customizations or changes have to be applied to these new files.
New features
- Indexing improvements
-
Lucene index optimization has been rewritten to automatically adjust to drive capabilities. The number of index merging threads now follows the general specification given in the threads attribute.
- Wikipedia dataset
- Version 1.3.0 comes with the Wikipedia dataset that you can use to index and analyze the contents of Wikipedia.
- NIH research
projects dataset - Version 1.3.0 comes with the NIH research projects dataset which indexes titles and abstracts of research projects funded by the US National Institutes of Health.
Improvements
- Indexing speedups
-
Smaller number of phrase normalization passes required should result in minor speedups during indexing.
- Phrase normalization
and numerics -
Phrase normalization will no longer create aliases for phrases consisting solely of numeric tokens. For example, previously if token sequences:
10 124
,10,124
and10124
occurred frequently, they were considered an alias of the same number. This is no longer the case as it could lead to confusing results.Note that for mixed alphanumeric tokens, such as:
μ ct 40
,μct40
,μ ct40
such aliases are still created (this is frequently the case with measurement units, acronyms and proper names). - Output of document
similarities -
1.6.0 You can now set
documents.relationships.enabled
totrue
to retrieve the matrix of similarities between documents in scope. The output of the relationships matrix can also be enabled in Lingo4G Explorer's Export window.
Bugs
- Errors in IE11
-
Lingo4G Explorer would not load in Internet Explorer 11 and older version of Firefox. Version 1.3.0 fixes the problem.
API changes
- JSON document source
-
The deprecated
jsonFile
property of the JSON document source has been removed in favor of theinputs
property.
Version 1.2.1
31-01-2017Version 1.2.1 adds support for indexing multiple JSON files, fixes a major bug in label clustering and provides a number of other minor bug fixes and improvements.
Compatibility
- Reindexing
- Not required. Lingo4G 1.2.1 will work with indexes created with Lingo4G 1.2.0.
- Project descriptors
-
Updates recommended. Version 1.2.1 deprecates the
jsonFile
property of the JSON document source in favor of theinputs
property that allows specifying multiple JSON files. ThejsonFile
property will be removed in version 1.3.0. - Custom document sources
-
Updates may be required due to
commons-compress
andxz
dependency upgrades. - Minor dependency
upgrades -
commons-compress
has been upgraded to version 1.13,xz
has been upgraded to version 1.6.Lingo4G now ships with Apache Xerces, which changes the default XML parser from Java's default.
These changes should not affect clustering results, but may require changes to custom document sources.
Improvements
- Firewall helpers
-
A FAQ entry and unpack utility have been added for those behind corporate firewalls (where automatic download of data files didn't work).
- AutoIndex file formats
-
Documented supported autoindex document source file formats.
- Multi-file support
in JSON example -
The JSON example has been modified to support multiple JSON files on input. The old
jsonFile
descriptor attribute is deprecated and will be removed in version 1.3.0. You should update your project descriptors to use the newinputs
attribute (see the updated JSON example section for syntax).
Bugs
- Incorrect co-occurrence
counting -
Versions prior to 1.2.1 would incorrectly take the cooccurrenceCountingAccuracy threshold relative to scope size rather than relative to the total number of documents in the index. As a result, when processing small documents subsets with low values of
cooccurrenceCountingAccuracy
, co-occurrence counts would be sparse and inaccurate, which might lead to many unclustered labels. Version 1.2.1 fixes the issue. - Missing progress reporting
-
Label fetching phase could omit job progress information. This is a regression that was introduced in version 1.2.0. Version 1.2.1 fixes the issue.
- Wildcard queries not
highlighted -
Wildcard queries were not properly highlighted after changes introduced in version 1.2.0. Version 1.2.1 fixes the problem.
- Fixed runtime exception in
Progress updater -
A runtime exception could be printed during indexing (while closing the index). The exception didn't affect indexing results. The issue is fixed in the 1.2.1 release.
Version 1.2.0
16-01-2017The 1.2.0 release introduces support for indexing PDF, Word and HTML files, adds automatic handling of concurrency during indexing, improves highlighting of complex queries and adds a number of other smaller improvements and bug fixes.
Compatibility
- Reindexing
- Recommended. Lingo4G 1.2.0 will work with indexes created with Lingo4G 1.1.0, although certain changes to stop label extraction algorithms may bring some label improvements after reindexing.
- Project descriptors
-
Updates recommended. Deprecated
indexer.stopLabelExtractor.threads
andindexer.stopLabelExtractor.accuracy
attributes. They should be removed from project descriptors (values will be ignored and will trigger a warning on the console).If used,
indexer.threads
attribute should be set toauto
(remove any fixed number of threads override if you have it). This enables automatic thread management which adjusts to hardware automatically (HDDs, SSDs, number of CPU cores). - Custom document sources
- Updates may be required if your custom document source used some of the public Lingo4G utility classes (for parallel processing, for example) or Google Guava, which has been updated to a newer version.
New features
- PDF/Word document
source example -
An new document source has been added that automatically extracts text from PDFs, Microsoft Word, OpenOffice and other file formats. The discovery of file format and text extraction are done using Apache Tika library. Full source code is included to allow modifications or for use out of the box to index local files.
- Automatic concurrency
in indexing -
The indexer will try to automatically maximize throughput taking into account the number of available CPU cores and the speed of the drive(s) used for indexing. The threads attribute should be set to
auto
for this feature to work.When the automatic adjustment isn't a good fit or lower CPU consumption is required, a global system property
l4g.concurrency
can be set at startup to override the defaults (using-Dl4g.concurrency=...
syntax) or the threads attribute can be modified directly in the project descriptor (this is discouraged).See the threads attribute's description for the permitted syntax of threading specification.
- Cell highlighting in
document clusters treemap -
Stating with the 1.2.0 release you can have Lingo4G Explorer highlight same-color or same-label cells in the document clusters treemap. The following video demonstrates the feature.
Improvements
- Scope highlighting
-
Scope highlighting has been rewritten from scratch and should now fully support phrase and fuzzy queries. Previously any term of a phrase query would be highlighted, after this change only actual terms involved in the phrase (or a matching term span) will receive the highlight.
- Lucene upgrade
-
Apache Lucene has been upgraded to version 6.3.0.
- New analysis
scope types -
The 1.2.0 release adds three scope types that make it possible to use complex criteria for document selection: by-id selection, complementary and composite scope definitions.
- Stop label
extraction -
Better detection of nonsensical stop label extraction conditions and reporting. Automatically detected stop labels may change as a result of this adjustment.
- Phrase normalization
feedback -
Improved console progress feedback (indexing phase): it now shows a progress bar on larger data sets.
- Guava 2.0
-
Google Guava dependency update to version 20.0.
Bugs
- Exceptions when indexing
Unicode -
Documents containing non-ASCII Unicode characters could result in unhandled exceptions thrown during indexing. This is a regression bug affecting version 1.1.0 and 1.1.1.
- Inconsistent label selection
between server restarts -
In previous versions, labels selected for the same analysis scope might differ between Lingo4G REST API restarts and command line analysis invocations. The 1.2.0 fixes the issue, so that label selection results are always the same.
- License information
-
l4g version
command wasn't able to display valid license information properly. - Label sorting broken
-
Labels were not sorted properly in Lingo4G Explorer. This regression bug was introduced in version 1.1.0 and is fixed in the 1.2.0 release.
- Stats output broken
-
The stats command skipped index component statistics.
API changes
- Output of labels
and documents -
The labels part of the response will contain the list property only when the output of labels is requested by setting the output.labels.enabled parameter to
true
. Otherwise, the list property will not be present.Similarly, the documents part of the response will contain the list property only when the output of documents is requested by setting the output.documents.enabled parameter to
true
.
Version 1.1.1
08-12-2016Version 1.1.1 fixes a major bug in fetching of the contents of documents.
Compatibility
- Reindexing
- Not required. Lingo4G 1.1.1 will work with indexes created with Lingo4G 1.1.0.
- Project descriptors
- Updates not required. Version 1.1.1 does not change project descriptors.
- Custom document sources
- Updates not required.
Bug fixes
- Content of incorrect
documents fetched -
In version 1.1.0, when document content was requested simultaneously with the onlyWithLabels or onlyAssignedToLabels parameters set to
true
, for some documents incorrect content could be fetched.Version 1.1.1 fixes this issue.
Version 1.1.0
25-10-2016Version 1.1.0 improves conflation of different spelling variants of the same label, adds more control over heuristic English stemming, fixes a number of bugs and extends documentation.
Compatibility
- Reindexing
- Recommended. Lingo4G 1.1.0 will work with indexes created with Lingo4G 1.0.0, but reindexing is strongly recommended because of improvements in automatic label detection.
- Project descriptors
- Updates not required. Version 1.1.0 does not change project descriptors.
- Custom document sources
- Updates not required.
New features
- Numeric ranges
-
Consistent support for numerics and numeric ranges in both the standard and complex query parser.
- Stemming control
-
An option called
useHeuristicStemming
was added to disable heuristic stemming in the English analyzer. - Spelling variants
-
Phrase feature extraction has been improved to automatically detect and merge spelling variants of labels written with or without dashes and as a compound word or multi-term phrase. For example, the following spelling variants would be unified now:
fast boot, fast-boot, fastboot web page, webpage, web-page magical jelly bean, magical jellybean, magical jelly-bean
- Relevance score
-
The query-relevance
score
attribute was added to each document in document retrieval API.
Improvements
- Documentation
-
We added documentation of default analyzers and their options.
- Dictionaries cleanup
-
The default dictionaries have been cleaned up and renamed consistently. Example projects make use of the default dictionaries and additionally project-specific dictionaries, where applicable.
- Licensing
-
Licenses will be reloaded automatically when no active licenses are found. This permits hot-swapping of licenses while the server is running.
- Other internal cleanups
-
A number of other internal issues have been fixed.
Bugs
- Small input crashes
-
There was a possibility of a runtime exception being hit on analysis of small inputs.
- Terminal crash
-
There was a possibility of an exception being thrown on non-updateable terminals.
Version 1.0.2
19-10-2016Version 1.0.2 is a maintenance release that addresses minor software bugs.
Bugs
- Terminal crash
-
There was a possibility of an exception being thrown on non-updateable terminals.
Version 1.0.1
28-09-2016Version 1.0.1 is a maintenance release that addresses minor software bugs and documentation deficiencies.
Compatibility
- Reindexing
- Not required, version 1.0.1 will work with index created by version 1.0.0.
- Project descriptors
- Updates not required.
- Custom document sources
- Updates not required.
Improvements
- Hiding zero-sized docs
in cluster treemap -
Version 1.0.1 adds the possibility to hide zero-sized groups in the document cluster treemap.
Version 1.0.0
22-09-2016Version 1.0.0 is the first official release of Lingo4G. Version 1.0.0 comes with dictionary-based filtering of labels reworked and documented, improved label selection stability and minor improvements to Lingo4G Explorer and documentation.
Compatibility
- Reindexing
- Required. Lingo4G 1.0.0 updates index storage format, indices created by the 0.11.x versions will not work with version 1.0.0.
- Project descriptors
- Updates required. Version 1.0.0 changes the way label dictionaries are defined and applied.
- Custom document sources
- Updates not required.
New features
- Dictionaries
-
Version 1.0.0 introduces a common definition of label dictionaries that can be used, for example, to exclude specific labels from analysis. This release comes with two dictionary implementations: the simple and efficient word-based matching and more powerful but expensive to apply regular expression based matching. The dictionaries parameter documentation describes how to define your own dictionaries.
Additionally, the newly introduced dictionaries framework allows defining ad-hoc (per analysis request) dictionaries, which you can use to let the users tune or add their own label exclusions without restarting Lingo4G REST API server. Lingo4G Explorer comes with a simple implementation of this idea.
Improvements
- Label selection
stability improvements -
In previous version of Lingo4G, excluding a single label from analysis could trigger a cascade of other changes to the label list with many other unrelated labels being removed and replaced. Version 1.0.0 improves label selection stability to prevent such situations.
- Hash-based analysis ids
-
As of version 1.0.0, the REST API will use 64-bit hash strings as identifiers of asynchronously handled analyses. This will minimize the chances of getting stale analysis results in case Lingo4G REST API is restarted between initiating the analysis and fetching its results.
This change should not require any changes in the code of your application, unless it relies on the structure of the analysis results URL returned by the REST API in the
Location
header.
- Partial results statistics
-
Version 1.0.0 changes the way the REST API reports processing progress. As of this release, the result of the /v1/analysis/{id} method will follow the structure of the complete analysis result returned by the /v1/analysis/{id}/result. The difference between the two methods is that the former will only return processing progress information and certain labels and document statistics, while the latter will return the complete analysis result.
- Analysis status
and parameters
in output response -
As of version 1.0.0, the analysis result response includes the processing status and parameters used to produce the analysis. These two pieces of data are especially useful for debugging the specific analysis result.
Version 0.11.0
02-08-2016Version 0.11.0 improves the stability of label selection, adds more detailed performance logging and introduces working index versioning.
Compatibility
- Reindexing
- Required. Starting with version 0.11.0, Lingo4G will automatically determine whether the existing index is compatible with the version of Lingo4G you are running. Re-indexing is required for this feature to work.
- Project descriptors
- Updates not required.
- Custom document sources
- Updates required. Version 0.11.0 introduces improved APIs for progress reporting, custom document sources need to be updated to use those APIs.
Improvements
- Progress and performance
logging improvements -
Version 0.11.0 comes with significantly improved reporting and logging of progress information. For each analysis requests, logs will now contain a detailed break down of the performed tasks.
[Task] [Time] [%] Resolving selector query 129ms 3.2% Fetching candidate labels 1s 651ms 40.7% TermVectorScan 1s 619ms 39.9% @ Segments: 7 @ Documents: 8,355 @ Threads: 8 @ Labels fetched: 7,994 @ Speed: 5.16ki docs/s Scoring candidate labels 214ms 5.3% @ Labels scored: 7,994 @ Labels selected: 1,000 @ Speed: 38.25ki labels/s Counting co-occurrences 1s 298ms 32.0% @ Threads: 8 @ Speed: 770 labels/s Computing label similarities 22ms 0.5% Clustering labels 398ms 9.8% @ Similarity density: 20.84% @ Similarity pruning gain: 1.98% @ Similarity pruning time: 77ms @ Similarity used: original @ Iterations: 155 (7.8% of max) Computing coverage 348ms 8.6% @ Segments: 7 @ Labels: 1,000 @ Threads: 8 @ Speed: 2.87ki labels/s
- Working index versioning
-
Starting with version 0.11.0, Lingo4G will automatically determine whether the existing index is compatible with the version of Lingo4G you are running. If index format is too old, you will need to re-index your data before you can run analyses.
Heads up!
When you run Lingo4G 0.11.0
analyze
,server
orstats
command with a working index created by a previous version, you will see the following message:The current index is too old, reindex your data.
Please re-index your data to be able to run analyses with version 0.11.0.
- Maximum indexed
documents option -
Since version 0.11.0 you can pass the --max-docs option to the index command to limit the number of documents to index.
Bug fixes
- Label selection
stability improvements -
Prior versions might select different labels for the same set of parameters. This release ensures that the same set of labels is selected, also for different numbers of processing threads.
Version 0.10.2
24-06-2016Version 0.10.2 fixes a critical bug in license validation routines.
Compatibility
- Reindexing
- Not required, version 0.10.2 will work with the index created by the 0.10.x releases.
- Project descriptors
- Updates not required.
- Custom document sources
- Updates not required.
Bug fixes
- License validation
-
A bug has been fixed in license validation routines that could result in valid licenses being omitted.
Version 0.10.1
17-06-2016Version 0.10.1 fixes a bug in presentation of the document cluster members in treemap view.
Compatibility
- Reindexing
- Not required, version 0.10.1 will work with the index created by the 0.10.x releases.
- Project descriptors
- Updates not required.
- Custom document sources
- Updates not required.
Bug fixes
- Incorrect member count
in document clusters -
In version 0.10.0 Lingo4G Explorer may incorrectly report the number of members of document clusters in the treemap view. Version 0.10.1 fixes the issue.
Version 0.10.0
16-06-2016Version 0.10.0 introduces highlighting of scope query and selected labels in document texts and more options for the document clusters treemap display in Lingo4G Explorer.
Compatibility
- Reindexing
- Required, version 0.10.0 will work with the index created by the 0.9.x releases, but highlighting will be off. For this reason, we highly recommend to reindex your project from scratch.
- Project descriptors
-
Field content specification has changed,
maxTotalLength
property has been removed. The defaults have been slightly adjusted to return shorter snippets. - Custom document sources
- Recompilation required due to updated binary dependencies of Lingo4G.
New features
- Label highlighting
-
Version 0.10.0 makes it possible to highlight occurrences of scope query and selected labels in the text of documents retrieved using ad-hoc document retrieval.
- Document clusters
treemap configuration -
Version 0.10.0 adds new features to the document clusters treemap, including coloring, sizing and labeling of document cells based on the selected document fields.
Improvements
- Document clustering for
subset of in-scope
documents -
Versions prior to 0.10.0 would refuse to apply document clustering when the scope contained more than
maxDocuments
documents. As of version 0.10.0, if you set thetrimToMaxDocuments
parameter totrue
, Lingo4G will proceed with clustering a subset of the in-scope documents of sizemaxDocuments
. - Label selection
improvements -
Version 0.10.0 simplifies and improves the performance and memory foot-print of label selection. An important change is the option to introduce a configurable amount of randomness to the label selection process, so that some of the less frequent and lower-scoring labels have a chance to be included in the analysis. The randomized label selection process is controlled by the following newly-added parameters: randomRatio, randomSeed. Please also see below for the API changes related to this improvement.
API changes
- Removed and renamed
parameters -
As a result of label selection improvements, the following parameters have been removed:
analysis.labels.maxLabelsOverhead
labels.surface.partOfSpeechFiltering
labels.frequencies.minRelativeDfDeviation
labels.frequencies.maxRelativeDfDeviation
labels.cooccurrences.isolationThreshold
labels.cooccurrences.isolationThresholdWidth
labels.scorers.isolationRatioScorerWeight
labels.cooccurrences.maxOverlap
labels.cooccurrences.maxOverlapDeviation
labels.scorers.overlapRankScorerWeight
labels.scorers.childCountScorerWeight
labels.scorers.dfScorerWeight
labels.scorers.candidateLabelScorerWeight
debug.logBaseLabelPartialScores
The following parameters have been renamed:
-
analysis.labels.cooccurrences.cooccurrenceWindowSize
renamed to analysis.labels.arrangement.relationship.cooccurrenceWindowSize -
analysis.performance.cooccurrenceCountingAccuracy
renamed to analysis.labels.arrangement.relationship.cooccurrenceCountingAccuracy
Version 0.9.0
31-03-2016Version 0.9.0 introduces label arrangements, major improvements to document indexing, many new features in Lingo4G Explorer and much improved documentation.
Compatibility
- Reindexing
- Required, version 0.9.0 comes with major improvements to indexing that removes noisy labels and decreases the disk size of the index.
- Project descriptors
- Updates required, certain areas of the descriptor have been reorganized, a number of parameters removed.
- Custom document sources
- Updates not required.
New features
- Label arrangement
-
Version 0.9.0 makes it possible to arrange related labels into clusters. Label clusters themselves can be organized into higher-level structures.
Apart from treemap-based presentation, Lingo4G Explorer can show label clusters as a textual list and as a graph.
- New public data sets
-
Version 0.9.0 comes with support for two new public data sets:
- Questions and answers from a StackExchange Q&A site, such as superuser.com.
- Summaries of research projects funded by the US National Science Foundation and NASA between 2007 and 2015, as available from research.gov.
For more information, see the summary of example data sets.
- Documentation updates
-
Version 0.9.0 comes with significantly more documentation, including conceptual overview of Lingo4G and description of Lingo4G Explorer. Minor documentation additions concern feature extractors and analysis result response syntax.
As of version 0.9.0, all practical examples in the documentation are based on the superuser.com StackExchange data set.
- Parameter experiments
in Lingo4G Explorer -
Version 0.9.0 Lingo4G Explorer adds the Experiments window you can use to investigate the impact of various parameter changes on the properties of the analysis result.
Improvements
- Indexing improvements
-
Version 0.9.0 brings significant improvements in the document indexing phase, including:
- Keeping numeric tokens in labels, configured by the parameter.
- Improved accounting of compound terms that should eliminate truncated labels, such as high-energy x [rays].
- Normalization of various kinds of apostrophes.
- Removal of globally frequent labels, configured by the parameter.
- Decreased disk size of the index.
- curl command export
-
You can now obtain a
curl
command invocation that will fetch the analysis result data configured in the Lingo4G result export window. - Composite criteria
in document retrieval -
You can now retrieve the content of documents using composite criteria that allow building complex Boolean queries.
API changes
- Document arrangement
section reorganized - The section of the descriptor have been reorganized to group the algorithm-specific parameters under a unique property. Lingo4G currently comes with one document clustering algorithm, Affinity Propagation, whose parameters are now available in the section.
- scope section removed
from result response -
The
scope
section has been removed from the analysis result response output, thedocumentsInScope
property has been moved to thesummary
section of the output.
Version 0.8.0
2015-11-13Version 0.8.0 improves the performance of document clustering introduced in version 0.7.0. Additionally, it brings a number of small improvements to Lingo4G Explorer.
Compatibility
- Reindexing
- Not required, index created by version 0.7.0 will work with version 0.8.0.
- Project descriptors
- Updates not required, descriptors created for version 0.7.0 with work with version 0.8.0.
- Custom document sources
- Updates not required.
Improvements
- Faster document clustering
-
Version 0.8.0 adds multi-threaded document clustering. Additionally, in certain cases performance can be further improved by pruning of relationships matrix.
- More export options
-
As of version 0.8.0 you can now choose which document fields to output in the Excel/JSON/XML report. Additionally, you opt for including documents without labels in the output.
- Current label view as CSV
-
You can now copy the contents of the label view, including the added/removed/common status, to clipboard as CSV.
- Processing time details
and estimates -
Version 0.8.0 adds remaining time estimates for long-running tasks. You can see the detailed breakdown of the processing time by hovering with mouse pointer over the total elapsed time statistic.
Version 0.7.0
2015-08-18Version 0.7.0 is a major new release that adds experimental support for arranging and visualizing documents as flat non-overlapping clusters.
Compatibility
- Reindexing
- Not required, index created by version 0.6.x will work with version 0.7.0.
- Project descriptors
- Updates not required, descriptors created for version 0.6.x with work with version 0.7.0.
- Custom document sources
- Updates not required.
- Java 8 required
-
As of version 0.7.0, Lingo4G requires Java version 8 or later to run.
New features
- Document arrangement
-
Version 0.7.0 makes it possible to arrange documents into flat non-overlapping clusters. Please see the quick start video for an overview and the
documents.arrangement
configuration section for a brief description of the involved parameters.
Version 0.6.0
released on 2015-07-06Version 0.6.0 is a major new release that brings improvements in document indexing, improves label selection and adds document content retrieval to Lingo4G REST API and Explorer application.
Compatibility
- Reindexing
- Required, index created by version 0.5.x will not work with version 0.6.0.
- Project descriptors
- Updates required. The 0.6.0 release removes a number of obsolete parameters. Please see the release notes for more details or contact Carrot Search for an updated project descriptor.
- Custom document sources
- Updates required. Version 0.6.0 updates a number of third-party dependencies and therefore the 0.5.x custom document sources may not work with version 0.6.0.
New features
- Improved label selection
-
Version 0.6.0 improves the quality of label selection by introducing automatic discovery of collection-specific stop labels, accompanied by collection probability label scoring and significant improvements in document text tokenization.
- Document content
retrieval API -
The 0.6.0 release introduces a REST API method for document content retrieval. Additionally, you can now browse the contents of documents in Lingo4G Explorer.
- Result export
in Lingo4G Explorer -
As of the 0.6.0 release, you can export the analysis result directly from Lingo4G Explorer and save it as an Excel, XML or JSON file.
- New fields in IMDb
and PubMed data sets -
Version 0.6.0 parses more fields when indexing the IMDb and PubMed data sets. The new fields for IMDb are:
country
,rating
,keywords
,director
andgenre
. The new fields for PubMed are:journal
,author
,keywords
,date
,journalName
andsubject
.
Project descriptor changes
- New output folder
-
The output from the
analyze
command is now saved to a dedicated directory calledresults
(directly under the project's directory). Results were previously saved to work directory, which is now exclusive for internal use by the application. - TF/DF ratio scoring
removed -
Version 0.6.0 replaces TD/DF ratio scoring with automatic discovery of stop labels and probability
ratio scoring. You will need to remove the
minTfDfRatio
andminTfDfRatioDeviation
parameters from your project descriptors. - Original label format
is now the default -
Version 0.6.0 changes the default value of the
labelFormat
parameter fromLABEL_CAPITALIZED
toORIGINAL
to avoid confusion when tuning label surface scoring, such asacronymLabelWeight
.
Bug fixes
- Label filtering not applied
for small scopes -
Version 0.6.0 fixes a bug that prevented Lingo4G from applying label filtering (label surface and frequency parameters) when analyzing small subsets of the collection.
Version 0.5.0
released on 2015-05-14Version 0.5.0 is a major new release that adds an initial implementation of Lingo4G REST API along with a simple browser-based tuning application. Also, in preparation for further development, the 0.5.0 release restructures a number of the basic concepts behind Lingo4G and makes a number of backward-incompatible changes.
Compatibility
- Reindexing
- Not required, index created by version 0.4.1 will work with 0.5.0.
- Project descriptors
- Updates required. The 0.5.0 release significantly restructures certain areas of the project descriptor. Please see the release notes for more details or contact Carrot Search for an updated project descriptor.
- Custom document sources
- Updates not required, custom document source binaries created for version 0.4.x will work with version 0.5.0.
New features
- REST API
-
The 0.5.0 release introduces an initial implementation of Lingo4G REST API. You can use the API to invoke Lingo4G text analysis from your favourite programming language or directly from a browser. You can start the REST API server using the
server
command. - Lingo4G Explorer
-
You can use Lingo4G Explorer to interactively tune Lingo4G parameters directly in your browser. See the quick start section for instructions on running Lingo4G Explorer.
Conceptual changes
In preparation for further development, the 0.5.0 release needed to restructure some fundamental concepts behind Lingo4G.
- Labels and documents
-
The two basic entities involved in Lingo4G processing are now labels and documents. Document is a basic unit of information processed by Lingo4G, a label is a specific human-readable feature that occurs in one or more documents. Further releases of Lingo4G will allow arranging both labels and documents into higher-level structures such as clusters or graphs.
- Clustering becomes
analysis -
To accommodate further additions to Lingo4G, such as embedding of labels and documents in 2d spaces, the 0.5.0 release replaces the notion of clustering with the more general analysis. Currently, analysis consists of selecting a set of labels that best describe the subset of documents submitted for analysis.
As a consequence of the clustering to analysis transition, the
l4g cluster
command has been renamed toanalyze
and theclustering
section of the project descriptor has become theanalysis
section.
Project descriptor changes
The 0.5.0 release introduces a number backwards-incompatible changes in the project descriptor.
- "analysis" section
-
The
clustering
section has been renamed toanalysis
. Furthermore, to account for the changed definition of the "clustering" concept, parameters found in theclustering
subsection have been moved to thelabels
subsection. Thelabels
subsection has been subdivided to thesurface
,frequencies
andcooccurrences
subsections. - "labelSource" section
-
The
labelSource
section has been renamed tosource
and put as a subsection of thelabels
section. The list of featurefields
to fetch labels from is now represented by an array of field descriptors rather than two arrays of field names and field weights. - Renamed parameters
-
A number of parameters in the former
clustering.labels
(now inanalysis.labels.surface
) have been renamed:Old name New name preferredLabelLength
preferredWordCount
preferredLabelLengthDeviation
preferredWordCountDeviation
minLabelTokens
minWordCount
maxLabelTokens
maxWordCount
minLabelCharacters
minCharacterCount
minLabelTokenCharacters
minWordCharacterCountAverage
- "output" section
-
The
output
section has been restructured to reflect the introduction of labels and documents entities. - Output of label
co-occurrences
temporarily removed - Release 0.5.0 temporarily removes the option to output label co-occurrences. Further releases will allow outputting generalized relationships between labels and documents, one of which will be label co-occurrences.
Improvements
- Minor quality fixes
- The 0.5.0 fixes minor bugs that could deteriorate the quality of label selection.
- JSON override in
l4g analyze -
As of the 0.5.0 release it is possible to override arbitrary analysis parameters when invoking the
analyze
command using the-j
command line parameter. This can be particularly handy when you export the JSON override strings from Lingo4G Explorer .
Version 0.4.1
released on 2015-04-17Version 0.4.1 comes with an important bug fix in the clustering algorithm.
Compatibility
- Reindexing
- Not required, index created by version 0.4.0 will work with 0.4.1.
- Project descriptors
- Updates not required, project descriptor created for version 0.4.0 will work with version 0.4.1.
- Custom document sources
- Updates not required, custom document source binaries created for version 0.4.0 will work with version 0.4.1.
Bug fixes
- Empty cluster set when
clustering a subset
of the collection -
Lingo4G may erroneously create an empty cluster list when processing a subset of the collection and print a misleading message saying No candidate labels found, try lowering the DF cut-offs.. The 0.4.1 release fixes this issue.
- Pure negative
queries supported -
Version 0.4.1 adds support for pure negative queries in the
cluster
command. For example,-s "-summary:foo"
would select all documents that do not contain the termfoo
in thesummary
field. - Assertion errors in indexer
-
Previous version might throw an assertion error when the number of segments to optimize was equal to 1. Version 0.4.1 fixes this issue.
Version 0.4.0
released on 2015-02-13Version 0.4.0 comes with major rewrite of the indexing infrastructure, resulting in optimized memory, better phrase extraction and tuned resource utilization.
Compatibility
- Reindexing
- Not strictly required (index created by version 0.3.x will work with 0.4.0), but strongly recommended as the resulting output should contain better features.
- Removed indexer options
-
Several options have been removed from the
indexer
section of the project descriptor. Project descriptors still carrying these attributes will fail to parse properly.- Indexer type
sequential
has been removed. Remove the type of the indexer entirely, if it is present in your descriptor file. indexWriter
attribute (and all children attributes) has been removed. The index writer, its buffers and memory allocation, is now adjusted automatically.- Phrase feature contributor's
minPhraseDfAtPartialMerge
anddiskCounterMaxBufferSizeMb
attributes have been removed.
- Indexer type
- Query parser
- The default query parser's operator has been changed from
OR
toAND
to be more similar to modern search engines.
Improvements
- Indexing
-
The
index
command has been rewritten to utilize memory and disk more efficiently. - Phrase extraction
-
A number of improvements to automatic phrase extraction yields better label candidates and clustering output as a result.
- Common terms handling
-
Phrases with leading or trailing common terms could be incorrectly indexed and show up as cluster labels.
Version 0.3.1
released on 2015-01-19
Version 0.3.1 allows specifying the list of fields to cluster on as a parameter of the l4g
cluster
command and fixes a minor bug in parsing command line arguments.
Compatibility
- Reindexing
- Not required, index created by version 0.3.0 will work with 0.3.1.
- Project descriptors
- Updates not required, project descriptor created for version 0.3.0 will work with version 0.3.1.
- Custom document sources
- Updates not required, custom document source binaries created for version 0.3.0 will work with version 0.3.1.
Improvements
- l4g cluster --feature-fields
-
You can now pass the list of feature fields to use during clustering using the --feature-fields option.
Bug fixes
- Incorrect parsing of quoted
command line parameters -
It was impossible in earlier version of Lingo4G to pass a command line parameter enclosed in double quotes. For example, the selector query of
l4g cluster -s "\"phrase query\""
would be interpreted asphrase query
rather than"phrase query"
. Version 0.3.0 fixes this issue.
Version 0.3.0
released on 2014-12-12Version 0.3.0 fixes a number of major bugs and introduces two small improvements.
Compatibility
- Reindexing
- Recommended if the source of the data was a Lucene index, see the bug fixes section for details.
- Project descriptors
- Project descriptors using custom document sources may require an update. Carrot Search will provide the updated project descriptor if needed.
- Custom document sources
- Updates required, Carrot Search will provide binaries of your custom document sources that will work with the 0.3.0 version.
Improvements
- Improved l4g stats
-
The
l4g stats
command has received a number of improvements and changes:- Reporting of the size of document term vectors has been added, with may be a useful piece of input for performance tuning.
-
Reporting the raw text statistics is disabled by default, the term vector statistics
are much more useful for performance tuning. You can get the raw text statistics by passing
the
--analyze-text-fields
option. -
The default accuracy of statistics gathering has been lowered from
1.0
to0.1
. The lowered accuracy is still large enough to get a very good estimate of the statistics and leads to much faster processing in case of large indices. You can set a different accuracy using the-a
option.
- More flexible
l4g cluster -o option -
As of Lingo4G 0.3.0 you can also pass a file name to the
-o
option ofl4g cluster
to save the clustering results directly to the provided file.
Bug fixes
- Some documents sourced
from a Lucene index
may not get indexed -
Earlier versions of Lingo4G may ignore during indexing some documents from a source Lucene index that consists of multiple segments or has deleted documents. Version 0.3.0 fixes this issue.
If the source of documents was a Lucene index, re-indexing is required for Lingo4G to include all the desired documents in its index.
- Exception when generating
document-cluster
assignments -
Lingo4G 0.2.0 would throw an exception when the project descriptor had the
output.components.assignments.enabled
property set totrue
, which effectively prevented generating document-to-cluster assignments. Version 0.3.0 fixes this issue. - Use 24-hour clock
in log file names - Version 0.3.0 switches to the 24-hour clock for log fie names, so that sorting by file name produces a chronological order of log files.
- Incorrect total
time in log files - Version 0.2.0 would always report zero total processing time in log files, version 0.3.0 fixes the issue.
Version 0.2.0
released on 2014-12-04Compatibility
- Reindexing
- Recommended. Version 0.2.0 introduces more flexible configuration of document field indexing. It is recommended to re-index your data to keep the index synchronized with the updated project descriptor.
- Project descriptors
- Update required to convert the 0.1.x-style document field indexing definition to the syntax updated in 0.2.x. Carrot Search will provide the updated project descriptors for early adopters of Lingo4G.
- Custom document sources
- Updates required, Carrot Search will provide binaries of your custom document sources that will work with the 0.2.0 version.
New features
- Improved document field
indexing configuration -
Version 0.2.0 changes the document field indexing configuration syntax to allow more flexibility. With the new syntax it will be possible to reduce the size of Lingo4G index by not storing the original text of the field and/or its search index while retaining the possibly to apply clustering to that field. Please see the documentation of the
fields
section for more details and examples.Carrot Search will provide the updated project descriptors for early adopters of Lingo4G.
- Dedicated by-identifier
document selection syntax - Version 0.2.0 adds dedicated syntax for selecting documents for clustering based on their identifiers. You can use this syntax to efficiently select thousands, tens and hundreds of thousands of documents by some identifier field value.
- Automatic minPerSegmentDf
-
Version 0.2.0 adds support for the
auto
value for theperformance.minPerSegmentDf
parameter, in which case the appropriate value will be computed based on theclusters.minClusterSize
parameter. In most cases, theauto
setting will improve the clustering performance.
Improvements
- Maximum field length
in label-document
assignment result -
Lingo4G will limit the maximum number of characters written for each field in the label-document
assignment result to prevent from accidentally writing very large amounts of content to the result
file. You can change the default length limit using the
output.components.assignments.maxFieldLength
option.
Bug fixes
- Shell scripts return code 0
for empty cluster lists -
Lingo4G 0.1.x
l4g
shell scripts would return a non-zero code when the list of clusters was empty. To reserve the non-zero codes for actual execution errors, as of version 0.2.0 Lingo4G launch scripts will return zero also when the execution completes successfully but with an empty cluster list.
Version 0.1.2
released on 2014-11-25Version 0.1.2 fixes a major bug in cluster label candidate selection present in the 0.1.0 and 0.1.1 releases.
Compatibility
- Reindexing
- Not required, index created by previous 0.1.x versions will work with 0.1.2.
- Project descriptors
- Updates not required, project descriptor created for earlier 0.1.x releases will work with 0.1.2.
- Custom document sources
- Updates not required, custom document source binaries created for previous 0.1.x versions will work with version 0.1.2.
Bug fixes
- No clusters when clustering
a subset of the index - Versions 0.1.0 and 0.1.1 may occasionally generate an empty cluster list when processing a fairly large subset of the index. In such cases the No candidate labels found, try lowering the DF cut-offs. message would be printed. Version 0.1.2 fixes the issue.
Version 0.1.1
released on 2014-11-24Version 0.1.1 introduces a number of small improvements, bug fixes and documentation clarifications.
Compatibility
- Reindexing
- Not required, index created by version 0.1.0 will work with 0.1.1.
- Project descriptors
- Updates not required, project descriptor created for version 0.1.0 will work with version 0.1.1.
- Custom document sources
- Updates not required, custom document source binaries created for version 0.1.0 will work with version 0.1.1.
Improvements
- Re-indexing into
non-empty index requires
explicit confirmation - Version 0.1.0 would silently discard the existing index when re-indexing. To avoid accidental deletion of the index, version 0.1.1 will only overwrite the existing non-empty index if the --force option is provided.
- Cygwin and Mingw
-
When running Lingo4G in Cygwin on Mingw, use the
l4g.cmd
script so that Lingo4G can correctly resolve file paths. As of version 0.1.1 thel4g
Bash launch script will refuse to run under Cygwin and Mingw. - Version information
-
You can now get detailed Lingo4G version information by running
l4g version
. - Unlimited number of
clauses in selection query -
The document selection query can now use an unlimited number of clauses, which makes it possible to select large numbers of documents for clustering for example by their identifiers (
id:d1 OR id:d5 OR id:d47 ...
).planned The performance of selecting thousands of documents using the
OR
syntax is currently very low. Further releases of Lingo4G will come with a dedicated syntax for by-id selection and much better performance characteristics.
Bug fixes
- l4g shell scripts
return codes -
As of version 0.1.1, the
l4g
shell scripts correctly return execution status codes.
Version 0.1.0
released on 2014-11-07Initial alpha release.