Example data sets

The L4​G_​H​O​M​E/datasets directory contains several example projects set up to download, index and analyze publicly available document sets, such as arXiv or PubMed.

These example projects should be helpful in getting started with Lingo4G, as well as a starting point for rolling out your own projects (by copying and modifying the project descriptor that approximates your data most).

We highly recommend starting with a smaller data set and progressing towards larger data sets. The following table contains the name of each example, an approximate number of documents it contains, the required disk size and time of indexing.

Project Number of docs Disk space1 Indexing time2 Embedding time3
dataset-arxiv 2.4M 6.0GB 4m 09s 2m 04s
dataset-autoindex 7 9kB 1s —4
dataset-clinicaltrials 200k 1.6G 1m 58s 1m 20s
dataset-imdb 475k 810MB 1m 1m 30s
dataset-json 251 1MB 3s —4
dataset-json-records 251 1MB 2s —4
dataset-nih.gov 2.7M 17GB 14m 10m
dataset-nsf.gov 514k 1.9GB 3m 2m 15s
dataset-ohsumed 350k 820MB 53s 55s
dataset-pubmed 4.93M 109GB 51m 1h 8m
dataset-stackexchange 298k 778MB 59s 25s
dataset-uspto 11M 469GB 3h 3h 31m
dataset-wikipedia 6.76M 81GB 47m 50m

1 Disk space taken by the final index. Does not include the source data or temporary files created during indexing.

2 Time required to index the data set, once downloaded (excludes download time). The times are reported for indexing executed on the following hardware: Intel i9-13900K, 24-cores (32 threads), 3.0 GHz, 128 GB RAM, SSD drive.

3 The time required to compute label and document embeddings.

4 The dataset-autoindex, dataset-json and dataset-json-records datasets come by default with very small amounts of data, not enough to compute meaningful label embeddings.

The following example data sets projects are provided.

Overview of examples

arXiv

The dataset-arxiv example explores the open arXiv.org data set of research publications (abstracts, titles, authors). The input descriptor is set up to consume a JSON records data dump file using the built-in json-records document source.

autoindex

The dataset-autoindex document source extracts text content from local HTML, PDF and other documents using Apache Tika. See indexing text documents for more information.

This document source is a good start if you have a large collection of documents that cannot be easily converted to a more structured format like JSON.

clinical trials

Explores (anonymized) patient records from clinical trials data available at the clinicaltrials.gov web site.

CSV importer

This example contains a document source that can import document fields from Excel CSV files.

CSV files must be comma-separated, UTF-8 encoded text files. The first row must contain column names that correspond to document fields, which you'll still have to specify in the fields section of the project descriptor.

IMDb

Contains a database of movie and TV show descriptions from imdb.com.

json and json-records files

The dataset-json and dataset-json-records projects contain project descriptors set up to ingest data in a structured JSON format. You can use these projects as a starting point for indexing your own data. See Creating a project pages for a walk-through.

The dataset-json project contains a small sample of the StackExchange data set, converted to a straightforward key-value JSON format.

The dataset-json-records project is a bit more complex because it shows how to parse JSON "record" files: a common format for various database dumps. A JSON record file is a sequence of independent JSON objects lined up contiguously in one (or more) physical files. Such format is used by, for example, Apache Drill and elasticsearch-dump. This project also demonstrates the use of JSON paths to extract key-value pairs from JSON records.

Movies

The dataset-movies example is what you should end up with after completing the tutorial demonstrating how to write a project descriptor from scratch. The source data is a JSON file with English movie titles, cast and a short overview, sourced from wikipedia-movie-data repository.

NIH research projects

This project explores summaries of research projects funded by the US National Institutes of Health, as available from NIH ExPORTER.

This project demonstrates the use of document sampling to speed up feature indexing.

NSF grants

Explores the research projects funded by the US National Science Foundation since circa 2007, as available from nsf.gov.

OHSUMED

This project contains medical article abstracts from the OHSUMED collection. The data set is rather small, but historically relevant in information retrieval.

PubMed

This project allows you to index the "open access" subset of the PubMed Central database of medical paper abstracts.

PubMed data is large. This project makes use of various techniques allowing handling large document collections, for example document sampling to speed up feature indexing.

Data requires manual download

Due to the large size of the data set, Lingo4G does not download it automatically by default. Please see datasets/dataset-pubmed/​R​E​A​D​M​E.txt for detailed instructions.

stack exchange

Explores the content of the selected categories of StackExchange Q/A site.

By default, the project uses the superuser.com data. You can pass the -​Dstackexchange.site=<site> property to choose a different StackExchange category to process. Depending on your interests, try one or more of the following:

You can also see the full list of available sites in XML format (where Tiny​Name attribute of each record would be the value passed to stackexchange.site property) or a more human-friendly list of archived site dumps, noting that the document source automatically truncates stackexchange.com.7z suffix (to fetch outdoors.stackexchange.com.7z you should pass -​Dstackexchange.site=outdoors).

US patents

Provides access to patent grants and applications available from the US Patent and Trademark Office. Lingo4G supports parsing files from the "Patent Grant Full Text Data (No Images)" and "Patent Application Full Text Data (No Images)" sections.

This project makes use of document sampling to speed up indexing. Additionally, it sets the max​Phrases​Per​Field parameter to use only the top 160 most frequent labels per each patent field, which limits the index size and speeds up analysis with a negligible loss of results accuracy.

Data requires manual download

Due to the large size of the original data set, Lingo4G does not download it automatically by default. Please see datasets/dataset-uspto/​R​E​A​D​M​E.txt for detailed instructions on how to download the data manually.

Wikipedia

Downloads and indexes Wikipedia data dumps.

This project makes use of document sampling to speed up indexing.

Data requires manual download

Due to the large size of Wikipedia data dumps, Lingo4G does not download it automatically by default. Please see datasets/dataset-wikipedia/​R​E​A​D​M​E.txt for detailed instructions.