Example data sets
These example projects should be helpful in getting started with Lingo4G, as well as a starting point for rolling out your own projects (by copying and modifying the project descriptor that approximates your data most).
We highly recommend starting with a smaller data set and progressing towards larger data sets. The following table contains the name of each example, an approximate number of documents it contains, the required disk size and time of indexing.
|Project||Number of docs||Disk space1||Indexing time2||Embedding time3|
||4.5M||136GB||1h 24m||1h 30m|
1 Disk space taken by the final index. Does not include the source data or temporary files created during indexing.
2 Time required to index the data set, once downloaded (excludes download time). The times are reported for indexing executed on the following hardware: AMD Ryzen Threadripper 3970X, 32-cores (64 logical cores), 3693 Mhz. 128 GB RAM, SSD drive.
3 The timeout of label embedding learning time set in the project descriptor. A machine with a large number of CPU cores (8 or more) will likely complete learning before the timeout is reached.
4 Unlike for other data sets, USPTO data indexing time is reported as executed on the following hardware: Intel Core i9-7960X (16 cores), 64 GB RAM, Samsung SSD 850 Evo 4TB.
The following example data sets projects are provided.
Overview of examples
dataset-arxiv example explores the open arXiv.org data set of
research publications (abstracts, titles, authors). The input descriptor is set up to consume a JSON records
data dump file using the built-in
This document source is a good start if you have a large collection of documents that cannot be easily converted to a more structured format like JSON.
Explores (anonymized) patient records from clinical trials data available at the clinicaltrials.gov web site.
Contains a database of movie and TV show descriptions from imdb.com.
json and json-records files
dataset-json-records projects contain project descriptors set up
to ingest data in a structured JSON format. You can use these projects as a starting point for indexing your own
data. See Creating a project pages for a walk-through.
dataset-json project contains a small sample of the StackExchange data set, converted to a
straightforward key-value JSON format.
dataset-json-records project is a bit more complex because it shows how to parse JSON "record"
files: a common format for various database dumps. A JSON record file is a sequence of independent JSON objects
lined up contiguously in one (or more) physical files. Such format is used by, for example,
elasticsearch-dump. This project also
demonstrates the use of
to extract key-value pairs from JSON records.
dataset-movies example is what you should end up with after completing the tutorial
demonstrating how to write a project descriptor from scratch. The source
data is a JSON file with English movie titles, cast and a short overview, sourced from
NIH research projects
This project explores summaries of research projects funded by the US National Institutes of Health, as available from NIH ExPORTER.
This project demonstrates the use of document sampling to speed up feature indexing.
Explores the research projects funded by the US National Science Foundation since circa 2007, as available from nsf.gov.
This project contains medical article abstracts from the OHSUMED collection. The data set is rather small, but historically relevant in information retrieval.
This project allows you to index the "open access" subset of the PubMed Central database of medical paper abstracts.
PubMed data is large. This project makes use of various techniques allowing handling large document collections, for example document sampling to speed up feature indexing.
Due to the large size of the data set, Lingo4G does not download it automatically by default. Please see
datasets/dataset-pubmed/README.txt for detailed instructions.
Explores the content of the selected categories of StackExchange Q/A site.
By default, the project uses the
data. You can pass the
-Dstackexchange.site=<site> property to choose a different StackExchange category to process.
Depending on your interests, try one or more of the following:
boardgames: Board & Card Games QA, boardgames.stackexchange.com
cooking: cooking QA, cooking.stackexchange.com
diy: home improvement QA, diy.stackexchange.com
rpg: role-playing games QA, rpg.stackexchange.com
scifi: science fiction & fantasy QA, scifi.stackexchange.com
travel: travel QA, travel.stackexchange.com
ux: User Experience QA, ux.stackexchange.com
You can also see the
full list of available sites in XML format
TinyName attribute of each record would be the value passed to
stackexchange.site property) or a more human-friendly
list of archived site dumps, noting that the document
source automatically truncates
stackexchange.com.7z suffix (to fetch
outdoors.stackexchange.com.7z you should pass
Provides access to patent grants and applications available from the US Patent and Trademark Office. Lingo4G supports parsing files from the "Patent Grant Full Text Data (No Images)" and "Patent Application Full Text Data (No Images)" sections.
This project makes use of
to speed up indexing. Additionally, it sets the
parameter to use only the top 160 most frequent labels per each patent field, which limits the index size and
speeds up analysis with a negligible loss of results accuracy.
Due to the large size of the original data set, Lingo4G does not download it automatically by default. Please
for detailed instructions on how to download the data manually.
Downloads and indexes Wikipedia data dumps.
This project makes use of document sampling to speed up indexing.
Due to the large size of Wikipedia data dumps, Lingo4G does not download it automatically by default. Please
datasets/dataset-wikipedia/README.txt for detailed instructions.