Example data sets

The L4​G_​H​O​M​E/datasets directory contains several example projects set up to download, index and analyze publicly available document sets, such as arXiv or PubMed.

These example projects should be helpful in getting started with Lingo4G, as well as a starting point for rolling out your own projects (by copying and modifying the project descriptor that approximates your data most).

We highly recommend starting with a smaller data set and progressing towards larger data sets. The following table contains the name of each example, an approximate number of documents it contains, the required disk size and time of indexing.

Project Number of docs Disk space1 Indexing time2 Embedding time3
dataset-arxiv 2.5M 5.2GB 12m 12m
dataset-autoindex 7 9kB 1s —5
dataset-clinicaltrials 200k 2GB 3m 8m
dataset-imdb 570k 830MB 2m 6m
dataset-json 251 1MB 3s —5
dataset-json-records 251 1MB 2s —5
dataset-nih.gov 2.6M 15GB 36m 35m
dataset-nsf.gov 270k 1.1GB 3m 5m
dataset-ohsumed 350k 700MB 2m 5m
dataset-pubmed 4.5M 136GB 1h 24m 1h 30m
dataset-stackexchange 298k 800MB 3m 7m
dataset-uspto 7.86M 474GB 4h 1m4 6h
dataset-wikipedia 6.5M 74GB 1h 2h

1 Disk space taken by the final index. Does not include the source data or temporary files created during indexing.

2 Time required to index the data set, once downloaded (excludes download time). The times are reported for indexing executed on the following hardware: AMD Ryzen Threadripper 3970X, 32-cores (64 logical cores), 3693 Mhz. 128 GB RAM, SSD drive.

3 The timeout of label embedding learning time set in the project descriptor. A machine with a large number of CPU cores (8 or more) will likely complete learning before the timeout is reached.

4 Unlike for other data sets, USPTO data indexing time is reported as executed on the following hardware: Intel Core i9-7960X (16 cores), 64 GB RAM, Samsung SSD 850 Evo 4TB.

5 The dataset-autoindex, dataset-json and dataset-json-records datasets come by default with very small amounts of data, not enough to compute meaningful label embeddings.

The following example data sets projects are provided.

Overview of examples

arXiv

The dataset-arxiv example explores the open arXiv.org data set of research publications (abstracts, titles, authors). The input descriptor is set up to consume a JSON records data dump file using the built-in json-records document source.

autoindex

The dataset-autoindex document source extracts text content from local HTML, PDF and other documents using Apache Tika. See indexing text documents for more information.

This document source is a good start if you have a large collection of documents that cannot be easily converted to a more structured format like JSON.

clinical trials

Explores (anonymized) patient records from clinical trials data available at the clinicaltrials.gov web site.

IMDb

Contains a database of movie and TV show descriptions from imdb.com.

json and json-records files

The dataset-json and dataset-json-records projects contain project descriptors set up to ingest data in a structured JSON format. You can use these projects as a starting point for indexing your own data. See Creating a project pages for a walk-through.

The dataset-json project contains a small sample of the StackExchange data set, converted to a straightforward key-value JSON format.

The dataset-json-records project is a bit more complex because it shows how to parse JSON "record" files: a common format for various database dumps. A JSON record file is a sequence of independent JSON objects lined up contiguously in one (or more) physical files. Such format is used by, for example, Apache Drill and elasticsearch-dump. This project also demonstrates the use of JSON paths to extract key-value pairs from JSON records.

Movies

The dataset-movies example is what you should end up with after completing the tutorial demonstrating how to write a project descriptor from scratch. The source data is a JSON file with English movie titles, cast and a short overview, sourced from wikipedia-movie-data repository.

NIH research projects

This project explores summaries of research projects funded by the US National Institutes of Health, as available from NIH ExPORTER.

This project demonstrates the use of document sampling to speed up feature indexing.

NSF grants

Explores the research projects funded by the US National Science Foundation since circa 2007, as available from nsf.gov.

OHSUMED

This project contains medical article abstracts from the OHSUMED collection. The data set is rather small, but historically relevant in information retrieval.

PubMed

This project allows you to index the "open access" subset of the PubMed Central database of medical paper abstracts.

PubMed data is large. This project makes use of various techniques allowing handling large document collections, for example document sampling to speed up feature indexing.

Data requires manual download

Due to the large size of the data set, Lingo4G does not download it automatically by default. Please see datasets/dataset-pubmed/​R​E​A​D​M​E.txt for detailed instructions.

stack exchange

Explores the content of the selected categories of StackExchange Q/A site.

By default, the project uses the superuser.com data. You can pass the -​Dstackexchange.site=<site> property to choose a different StackExchange category to process. Depending on your interests, try one or more of the following:

You can also see the full list of available sites in XML format (where Tiny​Name attribute of each record would be the value passed to stackexchange.site property) or a more human-friendly list of archived site dumps, noting that the document source automatically truncates stackexchange.com.7z suffix (to fetch outdoors.stackexchange.com.7z you should pass -​Dstackexchange.site=outdoors).

US patents

Provides access to patent grants and applications available from the US Patent and Trademark Office. Lingo4G supports parsing files from the "Patent Grant Full Text Data (No Images)" and "Patent Application Full Text Data (No Images)" sections.

This project makes use of document sampling to speed up indexing. Additionally, it sets the max​Phrases​Per​Field parameter to use only the top 160 most frequent labels per each patent field, which limits the index size and speeds up analysis with a negligible loss of results accuracy.

Data requires manual download

Due to the large size of the original data set, Lingo4G does not download it automatically by default. Please see datasets/dataset-uspto/​R​E​A​D​M​E.txt for detailed instructions on how to download the data manually.

Wikipedia

Downloads and indexes Wikipedia data dumps.

This project makes use of document sampling to speed up indexing.

Data requires manual download

Due to the large size of Wikipedia data dumps, Lingo4G does not download it automatically by default. Please see datasets/dataset-wikipedia/​R​E​A​D​M​E.txt for detailed instructions.