Indexing text documents

Lingo4G can index text from documents in typical formats, such as Word, PDF, OpenOffice or HTML.

To index your collection of files, perform the following steps:

  1. Copy all files you'd like to index to a single folder.

    Lingo4G will scan and index all files under that folder, including subfolders. Note that Lingo4G may not support all types of content, for example encrypted PDFs or very old Microsoft Word formats. See the list of supported formats for more details.

  2. In the command console, run the following command:

    l4g index -p datasets/dataset-autoindex -Dinput.dir=[absolute folder path]

    replacing [absolute folder path] with the path to the folder containing your documents.

    If Lingo4G cannot parse a file, it will print a warning to the console.

  3. Run the following command to start Lingo4G server:

    l4g server -p datasets/dataset-autoindex
  4. Point your browser at localhost:​8080 and choose a demo application to explore your data.

    Lingo4G makes the following document fields available for your analyses and document selection queries:

    fileName

    the last path segment of the input file

    contentType

    the auto-detected MIME type of the file

    title

    document title, optional

    content

    contents of the document

Supported formats

dataset-autoindex implements a document source that converts the common document formats to plain text, which Lingo4G requires on input for indexing. Lingo4G uses subset of the Apache Tika library to perform this conversion and allows indexing of the following file formats.

File type Typical file extensions Description
PDF *.pdf

Adobe PDF files. Note that PDF files may contain remapped fonts or outline glyphs and then text extraction (without applying OCR techniques) is impossible. Text extraction from secured or signed PDFs may not be possible.

plain text *.txt

Plain text files. The encoding will be autodetected by Tika, the heuristic may make mistakes for encodings where byte distribution is similar.

HTML files *.html, *.htm

Hypertext documents. Note that Tika doesn't attempt to render the page, only sanitizes and extracts content from HTML tags.

Open Document formats *.odt, *.odf

OpenOffice, LibreOffice and other Open Document format documents.

Rich Text Format *.rtf

Rich text format documents.

Microsoft Office *.doc, *.docx

Microsoft Office documents, including MS Office 9x and later.

Other files *.*

Tika will try to auto-detect the format of each input file, so AutoIndex can parse and import other file formats supported by Tika. However, to keep Lingo4G distribution size smaller, we trimmed down several Tika dependencies. If you need to support some exotic file format, add the required dependencies manually to the data source's lib folder.

Limitations

Limited quality of text extraction

Tika can use heuristics to extract plain text from files with an unknown or undefined character encoding. In such cases, the quality of text extraction may be unsatisfactory.

Automatic stopword detection

Automatic text extraction using Tika provides only the title and content of a document, so the options for automatic discovery of stopwords are very limited. Edit label exclusion dictionaries and reindex to improve the quality of analysis results.