Creating a project from scratch
To start analyzing your data, you'll need to set up a Lingo4G project. This section shows you how to do it.
Process overview
To set up a typical Lingo4G project, you will need to perform the following general steps:
Identification of the data source, format conversions and adaptations.
Identification of data fields and feature fields.
-
Project directory setup, writing the project descriptor to specify:
- data source,
- fields,
- analyzers and query parsers,
- indexing parameters,
- shared analytical components, predefined requests.
-
Quality-tuning and feedback loop (performed repeatedly until the results are of satisfying quality):
- indexing or reindexing input documents, learning embeddings,
- running analyses, identifying problems,
- tuning of indexing parameters, dictionary resources and requests.
Data source
Lingo4G project set up starts with identifying where your data comes from and how Lingo4G can access it. Lingo4G cannot directly operate on remote databases or document repositories. Before analyzing any data, Lingo4G needs to copy the data to its internal storage and build the required data structures.
The easiest way to get your data into Lingo4G is to convert or export the data to the JSON format and use Lingo4G's built-in JSON document source. The alternative approach is implementing a custom Lingo4G document source in Java. In most scenarios, however, the JSON format should be just fine.
For the needs of this chapter, we picked the database of american movies extracted from Wikipedia. It is conveniently available as JSON already, so no additional conversion steps are needed. You can download the raw JSON database using curl, aria2 or wget:
$ wget https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json
Here is a fragment of the data file (// ...
indicates omitted fragments):
The input is a JSON array and each document is an object with a set of fields. We will be interested in a subset of those fields: the title, summary, list of actors and perhaps the movie's year and genre. We have all the information needed to proceed to set up the project folder and descriptor now.
Project descriptor
Create a folder structure for the project and its resources:
$ mkdir dataset-movies
$ cd dataset-movies
$ touch movies.project.json
then open movies.project.json
with
vi
your favorite editor.
The project descriptor you have just created is a JSON file that resides at the top of each Lingo4G project and declares where Lingo4G should read the documents from, what fields the documents contain, and how Lingo4G should store and process those fields. You can write the project descriptor from scratch or copy and modify one of the example data sets. In this walk-through, we will write a complete project descriptor. By the end of this chapter, you should end up with the descriptor file identical to the movies dataset example, which is included in Lingo4G distribution.
Document source
We previously identified the document source and the content we're interested in. This is how you express this
information as a json-records
feed to import movie records from a JSON file and to map their content to a flat list of fields:
{
"source": {
"feed": {
"type": "json-records",
"input": {
"dir": "${input.dir:data}",
"match": "*.json",
"onMissing": [
[
"https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json"
]
]
},
"fieldMapping": {
"title": ".title",
"summary": ".extract",
"genre": ".genres.*",
"cast": ".cast.*",
"year": ".year",
"href": ".href"
}
}
}
}
There are a few bits to explain in the above. The highlighted
input
block tells the
json-records
document source to look for files matching the provided pattern (*.json
)
under the directory defined by the "${input.dir:​data}"
expression. By default, Lingo4G resolves the
expression to point to the data
folder inside your project directory. You can use a different
directory by providing the input.dir
system property at
runtime. You can create the data folder and download the source file yourself, or Lingo4G can do this for you:
this is what the
on​Missing
property does: it provides an external URL to download first, if no input files are present.
Now take a look at the
field​Mapping
element. The input JSON files may be structured differently than the flat field-value list Lingo4G expects.
We're also interested in a subset of all the information the movies database provides. The descriptor uses the
field​Mapping
element to pick and select a subset of data in the input JSON and to map field names
to values deep in the JSON hierarchy (using
JSON paths). A side effect of our declaration
above is that we can rename parts of the input to more convenient names (like extract
to
summary
).
By now you have a document source that parses JSON files and emits a set of documents, each containing a flat list of string field names and values. We now need to add type information for those fields.
Field definitions
Lingo4G requires additional information on how to store and process (tokenize) fields acquired from the document source. The fields definition block of the project descriptor serves this purpose.
This is what the fields
block looks like for our movie database example (remember to add a comma
after the previous source
block to keep the JSON valid).
"fields": {
"title": {
"analyzer": "english"
},
"summary": {
"analyzer": "english"
},
"genre": {
"analyzer": "keyword"
},
"cast": {
"analyzer": "person"
},
"year": {
"type": "date",
"inputFormat": "yyyy",
"indexFormat": "yyyy"
},
"href": {
"id": true,
"analyzer": "literal"
}
}
All document fields in Lingo4G can be divided into two main categories: fields used for scope filtering and document selection (like identifiers, categories, types, names) and fields from which Lingo4G extracts features for use in analytical requests (typically "natural" text, such as movie title or summary). The latter must declare an analyzer to divide the text into smaller pieces, called tokens.
Most document selection fields will be typically of primitive
type
(an integer or a date). In our example,
year
is a date
field,
genre
is an (implicit) text
field
with the
keyword
analyzer that indexes entire case-insensitive
values. Finally, the href
field is declared as each document's unique identifier: a
text
field with a
literal
analyzer.
The title
and summary
fields are text
fields that use the predefined
english
analyzer tuned to process texts in English. We will use these two fields as the source of features in analytical
requests.
Occasionally, the predefined set of analyzers is not sufficient. The
cast
field in our movie database contains actor names. It seems like a text field with the
english
analyzer may be a good pick for this field. Unfortunately, the
english
analyzer may try to omit certain frequently occurring words (called stop words) or
try to combine different words into single tokens (for example by removing the trailing s from
plurals). We don't want any of these transformations to take place, so we declare a custom analyzer named
person
. Lingo4G does not know anything about such an analyzer so we need to provide its definition
in the analyzers
section of the project descriptor. Here is what it looks like:
"analyzers": {
"person": {
"type": "english",
"requireResources": true,
"stopwords": [],
"stemmerDictionary": null,
"useHeuristicStemming": false
}
}
This completes field definitions. You are now ready to add indexer and feature extraction configuration settings.
Indexer
Indexing imports document fields into Lingo4G and creates all the data structures required for searching documents and analyzing them. The indexer section has a lot of tuning knobs, but it's best to start simple, look at the results and tune according to what's needed. The aim of indexer tuning is typically to reduce the size of the underlying data structures rather than optimize the quality of extracted features. You can tune the quality of the results at analysis time without recomputing the full index.
Add the indexer section for our movie database project to the project descriptor:
"indexer": {
"features": {
"phrases": {
"type": "phrases",
"sourceFields": [
"title",
"summary"
],
"targetFields": [
"title",
"summary"
],
"minTermDf": 5,
"minPhraseDf": 10
}
},
"stopLabelExtractor": {
"categoryFields": [
"genre"
],
"featureFields": [
"title$phrases"
]
},
"embedding": {
"labels": {
"enabled": "true"
},
"documents": {
"enabled": "true"
}
}
}
There are three main configuration blocks (highlighted), which we will discuss separately.
features
-
The
features
block configures feature extractors. Features are small units of text, such as words or phrases, that characterize the content of a document. Lingo4G uses features to perform most analytical processing, including clustering, 2d embedding or finding similar documents. Features and feature extractors are a broad subject, feel free to follow up on it later in this dedicated chapter.For the movie database, we have only one feature extractor called
phrases
. This feature extractor looks at terms and phrases that occur frequently in a set of input fields (across all documents), then decides which ones are significant enough to apply as document labels, finally marking any of their occurrences. We will use two fields as both the source and target for the discovered labels:title
andsummary
.Feature extractors of type
phrases
are almost entirely automatic, but they do benefit from minor hints regarding minimum and maximum frequency of terms and phrases to consider as labels. Themin​Term​Df
andmin​Phrase​Df
parameters set the minimum term and phrase document frequency of each label. Setting it too low will cause many insignificant labels to be indexed, setting it too high will result in only the most common (and perhaps obvious) labels to be included. stop​Label​Extractor
-
The
stop​Label​Extractor
block configures automatic detection of common boilerplate text (labels). These common phrases are often domain-specific and would require significant effort to include manually in label dictionaries (which we will discuss later).In this example, we have configured automatic stop label extraction to use the
genre
field as the category which separates documents that should (in theory) talk about different subjects but share common (uninteresting) vocabulary. Thetitle$phrases
denotes a feature field: a result of applying thephrases
feature extractor to thetitle
field. Stop label extractor should look at that field detect additional document separation axes. embedding
-
Enables computation of label and document embeddings together with indexing. You could use a separate learn-embeddings command but this is simpler.
Query parsers
Many components in Lingo4G use text queries to slice and dice all indexed documents into a smaller subsets. Lingo4G needs to convert these string, keyword-based queries into a format understood by the underlying search engine (Lucene) and it is the job of a query parser to perform this conversion.
It is good to specify at least one configuration of the default
query parser, enumerating a set of default fields to which
unqualified search terms will be applied. Let's configure the
enhanced
query parser to search in the
title
or summary
fields and use the conjunction of query clauses by default. Add this
configuration block to the project descriptor you're editing:
"queryParsers": {
"enhanced": {
"type": "enhanced",
"defaultOperator": "AND",
"defaultFields": [
"title",
"summary"
]
}
}
This ends the required part of the project descriptor, but before we run indexing, let's configure a few optional descriptor elements: label exclusion dictionaries and default components for the analysis API v2.
Dictionaries
Label exclusion dictionaries are a way of globally excluding certain undesired features (labels) that slip through Lingo4G's automated feature extraction algorithm. You will typically build the exclusions dictionary in a quality feedback loop of indexing, running analysis requests and tweaking the exclusions to permanently remove certain labels. For the needs of this example, we will add a few dictionary entries to just show you how to do it.
First, while in your project directory, create a subfolder named
resources
and create a dictionary file called stoplabels.utf8.txt
:
$ mkdir resources
$ touch resources/stoplabels.utf8.txt
Then copy and paste the following wildcard expressions into the dictionary file:
#
# Project-local label exclusion dictionary.
#
* american
starring *
stars *
title role
leading role
These rules use the
wildcard expressions syntax of the
glob
dictionary type. Other
dictionary types
are available but this one is relatively fast and intuitive. For example, the first rule (starring *
) rejects all labels with the leading term starring
.
Next, we need to add the dictionary component to the project and reference the rules file. Add the following dictionaries block to the project descriptor.
"dictionaries": {
"default": {
"type": "glob",
"files": [
"${l4g.project.dir}/resources/stoplabels.utf8.txt",
"${l4g.home}/resources/analysis/stoplabels.utf8.txt"
]
}
}
Note we load the dictionary with two input files: one from the project (the one you just created) and the one from Lingo4G distribution (common rules for texts in English).
Default request components
This final step is entirely optional, but we include it to complete the picture of setting up a full project.
Lingo4G analysis requests
consist of stages and components. Many of these components are repetitive, so you can declare them once and then
reuse in multiple requests. You can use the
analysis_v2
configuration block of the project descriptor to configure such defaults across the entire project.
Copy and paste the following JSON fragment to your project descriptor. These components declare the default
implementation of the
content​Fields:​*
type,
feature​Fields:​*
type, and an
auto​Stop​Labels
label filter.
"analysis_v2": {
"components": {
"contentFields": {
"type": "contentFields:simple",
"fields": {
"title": {},
"summary": {}
}
},
"featureFields": {
"type": "featureFields:simple",
"fields": [
"title$phrases",
"summary$phrases"
]
},
"labelFilter": {
"type": "labelFilter:composite",
"labelFilters": {
"auto": {
"type": "labelFilter:autoStopLabels"
},
"project": {
"type": "labelFilter:dictionary",
"exclude": [
{
"type": "dictionary:all"
}
]
}
}
}
}
}
Indexing
Run the index
command and point at your project directory (or
descriptor):
$ ./l4g index -p path/to/movies.project.json
Wait a few moments until the indexing completes; you should see a message similar to this:
...
> Processed 36,273 documents, the index contains 34,548 documents.
> Done. Total time: 18s.
The project is now ready to run analysis requests.
Running analyses
The movies data set we used in this chapter is relatively small (and the content fields are short), but we can run two small requests to show that the setup works. The first request searches for movies similar to Alien using vector embeddings. Let's edit and run this request without starting the server first.
In the project directory, create the following file:
$ mkdir -p web/requests
$ touch web/requests/01-similar-to-alien.json
Open the file you just created with a text editor and paste the following analysis API v2 request:
{
"name": "Search similar documents using content embedding vector",
"comment": "Retrieves the content of documents that are most similar to the top query document, based on multidimensional document embedding similarity.",
"variables": {
"searchQuery": {
"name": "Search query",
"comment": "Source document search",
"value": "title:alien"
}
},
"stages": {
"queryDocuments": {
"type": "documents:byQuery",
"query": {
"type": "query:string",
"query": {
"@var": "searchQuery"
}
}
},
"similarDocuments": {
"type": "documents:embeddingNearestNeighbors",
"vector": {
"type": "vector:documentEmbedding",
"documents": {
"type": "documents:reference",
"use": "queryDocuments"
}
},
"limit": 10
},
"queryDocumentsContents": {
"type": "documentContent",
"documents": {
"type": "documents:reference",
"use": "queryDocuments"
}
},
"similarDocumentsContents": {
"type": "documentContent",
"documents": {
"type": "documents:reference",
"use": "similarDocuments"
}
}
},
"output": {
"stages": [
"queryDocumentsContents",
"similarDocumentsContents",
"queryDocuments",
"similarDocuments"
]
}
}
Now execute the request using the run-request
command:
$ ./l4g run-request -p path/to/movies.project.json path/to/web/requests/01-similar-to-alien.json
Lingo4G saves the response next to the source request, you can display it in the terminal using:
$ cat path/to/web/requests/01-similar-to-alien.result
The raw API response in JSON is quite unappealing. Luckily, Lingo4G comes with an API sandbox web application which can render results of certain stages in a web browser. Start Lingo4G server and open the JSON Sandbox application at http://localhost:8080/apps/explorer/v2/#/code:
$ ./l4g server -p path/to/movies.project.json
You should see the analysis JSON request editor into which you could copy-paste the request above. However, when
we asked you to place the source request under web/requests
, we took advantage of the built-in
project requests repository. The sandbox app should display this
request under the project request list on the left side panel. When you click it and hit the execute button, you
should see the list of similar titles on the right side:
The second example request — perhaps a bit more entertaining — is trying to answer this question:
which movies (with completely different titles) had the most number of actors in common? If you copy and
paste the request below (or copy the
02-most-actors-in-common.json
from the dataset-movies project provided in the distribution), you'll see the answer.
{
"name": "Most actors in common in movies that have different words in the title",
"components": {
"actorsInCommon": {
"type": "featureSource:values",
"fields": {
"type": "fields:simple",
"fields": [
"cast"
]
}
}
},
"stages": {
"duplicates": {
"type": "documentPairs:duplicates",
"query": {
"type": "query:all"
},
"hashGrouping": {
"features": {
"type": "featureSource:reference",
"use": "actorsInCommon"
},
"pairing": {
"maxHashBitsDifferent": 0,
"maxHashGroupSize": 200
}
},
"validationFilters": [
{
"pairwiseSimilarity": {
"type": "pairwiseSimilarity:featureIntersectionSize",
"features": {
"type": "featureSource:flatten",
"source": {
"type": "featureSource:words",
"fields": {
"type": "fields:simple",
"fields": [
"title"
]
}
}
}
},
"min": 0,
"max": 0
}
],
"validation": {
"pairwiseSimilarity":{
"type": "pairwiseSimilarity:featureIntersectionSize",
"features": {
"type": "featureSource:reference",
"use": "actorsInCommon"
}
},
"min": 6
}
},
"documents": {
"type": "documentContent",
"documents": {
"type": "documents:fromDocumentPairs"
},
"fields": {
"type": "contentFields:simple",
"fields": {
"title": {},
"summary": {},
"year": {},
"cast": {
"maxValues": 25
}
}
}
}
}
}