Creating a project from scratch

To start analyzing your data, you'll need to set up a Lingo4G project. This section shows you how to do it.

Process overview

To set up a typical Lingo4G project, you will need to perform the following general steps:

  1. Identification of the data source, format conversions and adaptations.

  2. Identification of data fields and feature fields.

  3. Project directory setup, writing the project descriptor to specify:

    • data source,
    • fields,
    • analyzers and query parsers,
    • indexing parameters,
    • shared analytical components, predefined requests.
  4. Quality-tuning and feedback loop (performed repeatedly until the results are of satisfying quality):

    • indexing or reindexing input documents, learning embeddings,
    • running analyses, identifying problems,
    • tuning of indexing parameters, dictionary resources and requests.

Data source

Lingo4G project set up starts with identifying where your data comes from and how Lingo4G can access it. Lingo4G cannot directly operate on remote databases or document repositories. Before analyzing any data, Lingo4G needs to copy the data to its internal storage and build the required data structures.

The easiest way to get your data into Lingo4G is to convert or export the data to the JSON format and use Lingo4G's built-in JSON document source. The alternative approach is implementing a custom Lingo4G document source in Java. In most scenarios, however, the JSON format should be just fine.

For the needs of this chapter, we picked the database of american movies extracted from Wikipedia. It is conveniently available as JSON already, so no additional conversion steps are needed. You can download the raw JSON database using curl, aria2 or wget:

$ wget https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json

Here is a fragment of the data file (// ... indicates omitted fragments):

[
  // ...
  {
    "title": "Death at a Funeral",
    "year": 2010,
    "cast": [
      "Peter Dinklage",
      "Martin Lawrence",
      "James Marsden",
      "Tracy Morgan",
      // ...
    ],
    "genres": [
      "Comedy"
    ],
    "href": "Death_at_a_Funeral_(2010_film)",
    "extract": "Death at a Funeral is a 2010 American black comedy film directed by Neil LaBute with a screenplay by Dean Craig. It is a remake of the 2007 British film of the same name that Craig wrote. The film features an ensemble cast including Chris Rock, Martin Lawrence, Danny Glover, Regina Hall, Peter Dinklage, James Marsden, Tracy Morgan, Loretta Devine, Zoë Saldaña, Columbus Short, Luke Wilson, Keith David, Ron Glass and Kevin Hart; Dinklage is the only actor to appear in both films. The film was released in the United States on April 16, 2010.",
    "thumbnail": "https://upload.wikimedia.org/wikipedia/en/d/d7/Death_at_a_Funeral_2010_Poster.jpg",
    "thumbnail_width": 259,
    "thumbnail_height": 385
  },
  // ...
]

An overview of the structure and an example record from the american movie list database. If you haven't seen the British original, you should.

The input is a JSON array and each document is an object with a set of fields. We will be interested in a subset of those fields: the title, summary, list of actors and perhaps the movie's year and genre. We have all the information needed to proceed to set up the project folder and descriptor now.

Project descriptor

Create a folder structure for the project and its resources:

$ mkdir dataset-movies
$ cd dataset-movies
$ touch movies.project.json

then open movies.project.json with vi your favorite editor.

The project descriptor you have just created is a JSON file that resides at the top of each Lingo4G project and declares where Lingo4G should read the documents from, what fields the documents contain, and how Lingo4G should store and process those fields. You can write the project descriptor from scratch or copy and modify one of the example data sets. In this walk-through, we will write a complete project descriptor. By the end of this chapter, you should end up with the descriptor file identical to the movies dataset example, which is included in Lingo4G distribution.

Document source

We previously identified the document source and the content we're interested in. This is how you express this information as a json-records feed to import movie records from a JSON file and to map their content to a flat list of fields:

{
  "source": {
    "feed": {
      "type": "json-records",
      "input": {
        "dir": "${input.dir:data}",
        "match": "*.json",
        "onMissing": [
          [
            "https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json"
          ]
        ]
      },
      "fieldMapping": {
        "title": ".title",
        "summary": ".extract",
        "genre": ".genres.*",
        "cast": ".cast.*",
        "year": ".year",
        "href": ".href"
      }
    }
  }
}

There are a few bits to explain in the above. The highlighted input block tells the json-records document source to look for files matching the provided pattern (*.json) under the directory defined by the "${input.dir:​data}" expression. By default, Lingo4G resolves the expression to point to the data folder inside your project directory. You can use a different directory by providing the input.dir system property at runtime. You can create the data folder and download the source file yourself, or Lingo4G can do this for you: this is what the on​Missing property does: it provides an external URL to download first, if no input files are present.

Now take a look at the field​Mapping element. The input JSON files may be structured differently than the flat field-value list Lingo4G expects. We're also interested in a subset of all the information the movies database provides. The descriptor uses the field​Mapping element to pick and select a subset of data in the input JSON and to map field names to values deep in the JSON hierarchy (using JSON paths). A side effect of our declaration above is that we can rename parts of the input to more convenient names (like extract to summary).

By now you have a document source that parses JSON files and emits a set of documents, each containing a flat list of string field names and values. We now need to add type information for those fields.

Field definitions

Lingo4G requires additional information on how to store and process (tokenize) fields acquired from the document source. The fields definition block of the project descriptor serves this purpose.

This is what the fields block looks like for our movie database example (remember to add a comma after the previous source block to keep the JSON valid).

"fields": {
  "title": {
    "analyzer": "english"
  },
  "summary": {
    "analyzer": "english"
  },
  "genre": {
    "analyzer": "keyword"
  },
  "cast": {
    "analyzer": "person"
  },
  "year": {
    "type": "date",
    "inputFormat": "yyyy",
    "indexFormat": "yyyy"
  },
  "href": {
    "id": true,
    "analyzer": "literal"
  }
}

All document fields in Lingo4G can be divided into two main categories: fields used for scope filtering and document selection (like identifiers, categories, types, names) and fields from which Lingo4G extracts features for use in analytical requests (typically "natural" text, such as movie title or summary). The latter must declare an analyzer to divide the text into smaller pieces, called tokens.

Most document selection fields will be typically of primitive type (an integer or a date). In our example, year is a date field, genre is an (implicit) text field with the keyword analyzer that indexes entire case-insensitive values. Finally, the href field is declared as each document's unique identifier: a text field with a literal analyzer.

The title and summary fields are text fields that use the predefined english analyzer tuned to process texts in English. We will use these two fields as the source of features in analytical requests.

Occasionally, the predefined set of analyzers is not sufficient. The cast field in our movie database contains actor names. It seems like a text field with the english analyzer may be a good pick for this field. Unfortunately, the english analyzer may try to omit certain frequently occurring words (called stop words) or try to combine different words into single tokens (for example by removing the trailing s from plurals). We don't want any of these transformations to take place, so we declare a custom analyzer named person. Lingo4G does not know anything about such an analyzer so we need to provide its definition in the analyzers section of the project descriptor. Here is what it looks like:

"analyzers": {
  "person": {
    "type": "english",
    "requireResources": true,
    "stopwords": [],
    "stemmerDictionary": null,
    "useHeuristicStemming": false
  }
}

This completes field definitions. You are now ready to add indexer and feature extraction configuration settings.

Indexer

Indexing imports document fields into Lingo4G and creates all the data structures required for searching documents and analyzing them. The indexer section has a lot of tuning knobs, but it's best to start simple, look at the results and tune according to what's needed. The aim of indexer tuning is typically to reduce the size of the underlying data structures rather than optimize the quality of extracted features. You can tune the quality of the results at analysis time without recomputing the full index.

Add the indexer section for our movie database project to the project descriptor:

"indexer": {
  "features": {
    "phrases": {
      "type": "phrases",
      "sourceFields": [
        "title",
        "summary"
      ],
      "targetFields": [
        "title",
        "summary"
      ],
      "minTermDf": 5,
      "minPhraseDf": 10
    }
  },
  "stopLabelExtractor": {
    "categoryFields": [
      "genre"
    ],
    "featureFields": [
      "title$phrases"
    ]
  },
  "embedding": {
    "labels": {
      "enabled": "true"
    },
    "documents": {
      "enabled": "true"
    }
  }
}

There are three main configuration blocks (highlighted), which we will discuss separately.

features

The features block configures feature extractors. Features are small units of text, such as words or phrases, that characterize the content of a document. Lingo4G uses features to perform most analytical processing, including clustering, 2d embedding or finding similar documents. Features and feature extractors are a broad subject, feel free to follow up on it later in this dedicated chapter.

For the movie database, we have only one feature extractor called phrases. This feature extractor looks at terms and phrases that occur frequently in a set of input fields (across all documents), then decides which ones are significant enough to apply as document labels, finally marking any of their occurrences. We will use two fields as both the source and target for the discovered labels: title and summary.

Feature extractors of type phrases are almost entirely automatic, but they do benefit from minor hints regarding minimum and maximum frequency of terms and phrases to consider as labels. The min​Term​Df and min​Phrase​Df parameters set the minimum term and phrase document frequency of each label. Setting it too low will cause many insignificant labels to be indexed, setting it too high will result in only the most common (and perhaps obvious) labels to be included.

stop​Label​Extractor

The stop​Label​Extractor block configures automatic detection of common boilerplate text (labels). These common phrases are often domain-specific and would require significant effort to include manually in label dictionaries (which we will discuss later).

In this example, we have configured automatic stop label extraction to use the genre field as the category which separates documents that should (in theory) talk about different subjects but share common (uninteresting) vocabulary. The title$phrases denotes a feature field: a result of applying the phrases feature extractor to the title field. Stop label extractor should look at that field detect additional document separation axes.

embedding

Enables computation of label and document embeddings together with indexing. You could use a separate learn-embeddings command but this is simpler.

Query parsers

Many components in Lingo4G use text queries to slice and dice all indexed documents into a smaller subsets. Lingo4G needs to convert these string, keyword-based queries into a format understood by the underlying search engine (Lucene) and it is the job of a query parser to perform this conversion.

It is good to specify at least one configuration of the default query parser, enumerating a set of default fields to which unqualified search terms will be applied. Let's configure the enhanced query parser to search in the title or summary fields and use the conjunction of query clauses by default. Add this configuration block to the project descriptor you're editing:

"queryParsers": {
  "enhanced": {
    "type": "enhanced",
    "defaultOperator": "AND",
    "defaultFields": [
      "title",
      "summary"
    ]
  }
}

This ends the required part of the project descriptor, but before we run indexing, let's configure a few optional descriptor elements: label exclusion dictionaries and default components for the analysis API v2.

Dictionaries

Label exclusion dictionaries are a way of globally excluding certain undesired features (labels) that slip through Lingo4G's automated feature extraction algorithm. You will typically build the exclusions dictionary in a quality feedback loop of indexing, running analysis requests and tweaking the exclusions to permanently remove certain labels. For the needs of this example, we will add a few dictionary entries to just show you how to do it.

First, while in your project directory, create a subfolder named resources and create a dictionary file called stoplabels.utf8.txt:

$ mkdir resources
$ touch resources/stoplabels.utf8.txt

Then copy and paste the following wildcard expressions into the dictionary file:

#
# Project-local label exclusion dictionary.
#
* american
starring *
stars *
title role
leading role

These rules use the wildcard expressions syntax of the glob dictionary type. Other dictionary types are available but this one is relatively fast and intuitive. For example, the first rule (starring *) rejects all labels with the leading term starring.

Next, we need to add the dictionary component to the project and reference the rules file. Add the following dictionaries block to the project descriptor.

"dictionaries": {
  "default": {
    "type": "glob",
    "files": [
      "${l4g.project.dir}/resources/stoplabels.utf8.txt",
      "${l4g.home}/resources/analysis/stoplabels.utf8.txt"
    ]
  }
}

Note we load the dictionary with two input files: one from the project (the one you just created) and the one from Lingo4G distribution (common rules for texts in English).

Default request components

This final step is entirely optional, but we include it to complete the picture of setting up a full project.

Lingo4G analysis requests consist of stages and components. Many of these components are repetitive, so you can declare them once and then reuse in multiple requests. You can use the analysis_v2 configuration block of the project descriptor to configure such defaults across the entire project.

Copy and paste the following JSON fragment to your project descriptor. These components declare the default implementation of the content​Fields:​* type, feature​Fields:​* type, and an auto​Stop​Labels label filter.

"analysis_v2": {
  "components": {
    "contentFields": {
      "type": "contentFields:simple",
      "fields": {
        "title": {},
        "summary": {}
      }
    },
    "featureFields": {
      "type": "featureFields:simple",
      "fields": [
        "title$phrases",
        "summary$phrases"
      ]
    },
    "labelFilter": {
      "type": "labelFilter:composite",
      "labelFilters": {
        "auto": {
          "type": "labelFilter:autoStopLabels"
        },
        "project": {
          "type": "labelFilter:dictionary",
          "exclude": [
            {
              "type": "dictionary:all"
            }
          ]
        }
      }
    }
  }
}

Indexing

Run the index command and point at your project directory (or descriptor):

$ ./l4g index -p path/to/movies.project.json

Wait a few moments until the indexing completes; you should see a message similar to this:

...
> Processed 36,273 documents, the index contains 34,548 documents.
> Done. Total time: 18s.

The project is now ready to run analysis requests.

Running analyses

The movies data set we used in this chapter is relatively small (and the content fields are short), but we can run two small requests to show that the setup works. The first request searches for movies similar to Alien using vector embeddings. Let's edit and run this request without starting the server first.

In the project directory, create the following file:

$ mkdir -p web/requests
$ touch web/requests/01-similar-to-alien.json

Open the file you just created with a text editor and paste the following analysis API v2 request:

{
  "name": "Search similar documents using content embedding vector",
  "comment": "Retrieves the content of documents that are most similar to the top query document, based on multidimensional document embedding similarity.",
  "variables": {
    "searchQuery": {
      "name": "Search query",
      "comment": "Source document search",
      "value": "title:alien"
    }
  },
  "stages": {
    "queryDocuments": {
      "type": "documents:byQuery",
      "query": {
        "type": "query:string",
        "query": {
          "@var": "searchQuery"
        }
      }
    },
    "similarDocuments": {
      "type": "documents:embeddingNearestNeighbors",
      "vector": {
        "type": "vector:documentEmbedding",
        "documents": {
          "type": "documents:reference",
          "use": "queryDocuments"
        }
      },
      "limit": 10
    },
    "queryDocumentsContents": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "queryDocuments"
      }
    },
    "similarDocumentsContents": {
      "type": "documentContent",
      "documents": {
        "type": "documents:reference",
        "use": "similarDocuments"
      }
    }
  },
  "output": {
    "stages": [
      "queryDocumentsContents",
      "similarDocumentsContents",
      "queryDocuments",
      "similarDocuments"
    ]
  }
}

Now execute the request using the run-request command:

$ ./l4g run-request -p path/to/movies.project.json path/to/web/requests/01-similar-to-alien.json

Lingo4G saves the response next to the source request, you can display it in the terminal using:

$ cat path/to/web/requests/01-similar-to-alien.result

The raw API response in JSON is quite unappealing. Luckily, Lingo4G comes with an API sandbox web application which can render results of certain stages in a web browser. Start Lingo4G server and open the JSON Sandbox application at http://localhost:8080/apps/explorer/v2/#/code:

$ ./l4g server -p path/to/movies.project.json

You should see the analysis JSON request editor into which you could copy-paste the request above. However, when we asked you to place the source request under web/requests, we took advantage of the built-in project requests repository. The sandbox app should display this request under the project request list on the left side panel. When you click it and hit the execute button, you should see the list of similar titles on the right side:

Lingo4G JSON sandbox app showing movies similar to 'Alien'.
Lingo4G JSON sandbox app showing movies similar to 'Alien'.

Lingo4G JSON sandbox app showing movies similar to 'Alien'.

The second example request — perhaps a bit more entertaining — is trying to answer this question: which movies (with completely different titles) had the most number of actors in common? If you copy and paste the request below (or copy the 02-most-actors-in-common.json from the dataset-movies project provided in the distribution), you'll see the answer.

{
  "name": "Most actors in common in movies that have different words in the title",
  "components": {
    "actorsInCommon": {
      "type": "featureSource:values",
      "fields": {
        "type": "fields:simple",
        "fields": [
          "cast"
        ]
      }
    }
  },
  "stages": {
    "duplicates": {
      "type": "documentPairs:duplicates",
      "query": {
        "type": "query:all"
      },
      "hashGrouping": {
        "features": {
          "type": "featureSource:reference",
          "use": "actorsInCommon"
        },
        "pairing": {
          "maxHashBitsDifferent": 0,
          "maxHashGroupSize": 200
        }
      },
      "validationFilters": [
        {
          "pairwiseSimilarity": {
            "type": "pairwiseSimilarity:featureIntersectionSize",
            "features": {
              "type": "featureSource:flatten",
              "source": {
                "type": "featureSource:words",
                "fields": {
                  "type": "fields:simple",
                  "fields": [
                    "title"
                  ]
                }
              }
            }
          },
          "min": 0,
          "max": 0
        }
      ],
      "validation": {
        "pairwiseSimilarity":{
          "type": "pairwiseSimilarity:featureIntersectionSize",
          "features": {
            "type": "featureSource:reference",
            "use": "actorsInCommon"
          }
        },
        "min": 6
      }
    },
    "documents":  {
      "type": "documentContent",
      "documents": {
        "type": "documents:fromDocumentPairs"
      },
       "fields": {
         "type": "contentFields:simple",
         "fields": {
           "title": {},
           "summary": {},
           "year": {},
           "cast": {
             "maxValues": 25
           }
         }
       }
    }
  }
}