Example data sets
The L4​G_​H​O​M​E/datasets
directory contains several example projects set up to download, index and analyze
publicly available document sets, such as arXiv or
PubMed.
These example projects should be helpful in getting started with Lingo4G, as well as a starting point for rolling out your own projects (by copying and modifying the project descriptor that approximates your data most).
We highly recommend starting with a smaller data set and progressing towards larger data sets. The following table contains the name of each example, an approximate number of documents it contains, the required disk size and time of indexing.
Project | Number of docs | Disk space1 | Indexing time2 | Embedding time3 |
---|---|---|---|---|
dataset-arxiv
|
2.4M | 6.0GB | 4m 09s | 2m 04s |
dataset-autoindex
|
7 | 9kB | 1s | —4 |
dataset-clinicaltrials
|
200k | 1.6G | 1m 58s | 1m 20s |
dataset-imdb
|
475k | 810MB | 1m | 1m 30s |
dataset-json
|
251 | 1MB | 3s | —4 |
dataset-json-records
|
251 | 1MB | 2s | —4 |
dataset-nih.gov
|
2.7M | 17GB | 14m | 10m |
dataset-nsf.gov
|
514k | 1.9GB | 3m | 2m 15s |
dataset-ohsumed
|
350k | 820MB | 53s | 55s |
dataset-pubmed
|
4.93M | 109GB | 51m | 1h 8m |
dataset-stackexchange
|
298k | 778MB | 59s | 25s |
dataset-uspto
|
11M | 469GB | 3h | 3h 31m |
dataset-wikipedia
|
6.76M | 81GB | 47m | 50m |
1 Disk space taken by the final index. Does not include the source data or temporary files created during indexing. 2 Time required to index the data set, once downloaded (excludes download time). The times are reported for indexing executed on the following hardware: Intel i9-13900K, 24-cores (32 threads), 3.0 GHz, 128 GB RAM, SSD drive. 3 The time required to compute label and document embeddings.
4
The |
The following example data sets projects are provided.
Overview of examples
arXiv
The dataset-arxiv
example explores the open arXiv.org data set of
research publications (abstracts, titles, authors). The input descriptor is set up to consume a JSON records
data dump file using the built-in
json-records
document source.
autoindex
The dataset-autoindex
document source extracts text content from local HTML, PDF and other
documents using Apache Tika. See
indexing text documents for more information.
This document source is a good start if you have a large collection of documents that cannot be easily converted to a more structured format like JSON.
clinical trials
Explores (anonymized) patient records from clinical trials data available at the clinicaltrials.gov web site.
CSV importer
This example contains a document source that can import document fields from Excel CSV files.
CSV files must be comma-separated, UTF-8 encoded text files. The first row must contain column names that correspond to document fields, which you'll still have to specify in the fields section of the project descriptor.
IMDb
Contains a database of movie and TV show descriptions from imdb.com.
json and json-records files
The dataset-json
and dataset-json-records
projects contain project descriptors set up
to ingest data in a structured JSON format. You can use these projects as a starting point for indexing your own
data. See Creating a project pages for a walk-through.
The dataset-json
project contains a small sample of the StackExchange data set, converted to a
straightforward key-value JSON format.
The dataset-json-records
project is a bit more complex because it shows how to parse JSON "record"
files: a common format for various database dumps. A JSON record file is a sequence of independent JSON objects
lined up contiguously in one (or more) physical files. Such format is used by, for example,
Apache Drill
and
elasticsearch-dump. This project also
demonstrates the use of
JSON paths
to extract key-value pairs from JSON records.
Movies
The dataset-movies
example is what you should end up with after completing the tutorial
demonstrating how to write a project descriptor from scratch. The source
data is a JSON file with English movie titles, cast and a short overview, sourced from
wikipedia-movie-data
repository.
NIH research projects
This project explores summaries of research projects funded by the US National Institutes of Health, as available from NIH ExPORTER.
This project demonstrates the use of document sampling to speed up feature indexing.
NSF grants
Explores the research projects funded by the US National Science Foundation since circa 2007, as available from nsf.gov.
OHSUMED
This project contains medical article abstracts from the OHSUMED collection. The data set is rather small, but historically relevant in information retrieval.
PubMed
This project allows you to index the "open access" subset of the PubMed Central database of medical paper abstracts.
PubMed data is large. This project makes use of various techniques allowing handling large document collections, for example document sampling to speed up feature indexing.
Due to the large size of the data set, Lingo4G does not download it automatically by default. Please see
datasets/dataset-pubmed/​R​E​A​D​M​E.txt
for detailed instructions.
stack exchange
Explores the content of the selected categories of StackExchange Q/A site.
By default, the project uses the
superuser.com
data. You can pass the
-​Dstackexchange.site=<site>
property to choose a different StackExchange category to process.
Depending on your interests, try one or more of the following:
-
boardgames
: Board & Card Games QA, boardgames.stackexchange.com -
cooking
: cooking QA, cooking.stackexchange.com -
diy
: home improvement QA, diy.stackexchange.com -
rpg
: role-playing games QA, rpg.stackexchange.com -
scifi
: science fiction & fantasy QA, scifi.stackexchange.com -
travel
: travel QA, travel.stackexchange.com -
ux
: User Experience QA, ux.stackexchange.com
You can also see the
full list of available sites in XML format
(where Tiny​Name
attribute of each record would be the value passed to
stackexchange.site
property) or a more human-friendly
list of archived site dumps, noting that the document
source automatically truncates stackexchange.com.7z
suffix (to fetch
outdoors.stackexchange.com.7z
you should pass -​Dstackexchange.site=outdoors
).
US patents
Provides access to patent grants and applications available from the US Patent and Trademark Office. Lingo4G supports parsing files from the "Patent Grant Full Text Data (No Images)" and "Patent Application Full Text Data (No Images)" sections.
This project makes use of
document sampling
to speed up indexing. Additionally, it sets the
max​Phrases​Per​Field
parameter to use only the top 160 most frequent labels per each patent field, which limits the index size and
speeds up analysis with a negligible loss of results accuracy.
Due to the large size of the original data set, Lingo4G does not download it automatically by default. Please
see
datasets/dataset-uspto/​R​E​A​D​M​E.txt
for detailed instructions on how to download the data manually.
Wikipedia
Downloads and indexes Wikipedia data dumps.
This project makes use of document sampling to speed up indexing.
Due to the large size of Wikipedia data dumps, Lingo4G does not download it automatically by default. Please
see
datasets/dataset-wikipedia/​R​E​A​D​M​E.txt
for detailed instructions.