Analysis

During analysis, Lingo4G executes the text mining tasks you declare in the analysis request.

With Lingo4G's highly-modular text analysis architecture, you can assemble simple components into pipelines to perform text processing tasks ranging from simple query-based document searches to time series analysis. The following is a far from exhausting list of what is currently possible with Lingo4G:

  • Query-based document search. Using Lingo4G's Lucene-based powerful querying syntax, you can make queries ranging from simple word and phrase containment up to complex queries involving word order, positions and proximity.

  • Document text retrieval with query match highlighting. You can request Lingo4G to return the text of specific documents. You can also have Lingo4G highlight text regions matching the query you provide.

  • Collecting labels from documents. Lingo4G can extract salient labels, such as words or phrases, describing one or more documents of your choice. Using powerful label filtering mechanisms, you can shape the lists of labels to your liking based on the label's length in words or occurrence count statistics.

  • Finding semantically-similar labels. Using multidimensional embedding vectors, Lingo4G can find labels that are semantically similar to one or more words or phrases you provide. The similarity is based on the label occurrence patterns found in the document collection you indexed in Lingo4G.

  • Finding semantically-related documents. By searching multidimensional embedding vectors of documents, Lingo4G can find documents that are similar to the example document you provide. Lingo4G can identify similar documents even if they don't share any common words or phrases.

  • Clustering labels or documents. Lingo4G can group labels or documents into clusters in such a way that similar labels or documents end up in the same cluster.

  • 2d mapping of labels or documents. Lingo4G can organize labels and/or documents into 2d maps in such a way that semantically-similar labels and documents concentrate in the same are of the map. Likewise, Lingo4G puts dissimilar labels or documents far apart on the map.

  • Duplicate content detection. You can use Lingo4G to identify pairs of documents with overlapping content. The degree of overlap can range from entire documents (exact duplicates), almost all the content (near duplicates) or just partial overlap (sentences, paragraphs). Lingo4G can also highlight the overlapping areas of documents for easier inspection of the results.

  • Time-series analysis. By applying the same basic text processing task, such as label collection of 2d mapping, on documents falling within a sequence of date-based windows, you can get insight into time based trends.

You can combine Lingo4G text processing components in numerous ways to perform a wide variety of text processing tasks. See the Analysis JSON tutorial for a systematic explanation of how to build Lingo4G analysis requests.