Indexing text documents
Lingo4G can index text from documents in typical formats, such as Word, PDF, OpenOffice or HTML.
To index your collection of filed, perform the following steps:
Copy all files you'd like to index to a single folder.
Lingo4G will scan and index all files under that folder, including subfolders. Note that Lingo4G may not support all types of content, for example encrypted PDFs or very old Microsoft Word formats. See the list of supported formats for more details.
In the command console, run the following command:
l4g index -p datasets/dataset-autoindex -Dinput.dir=[absolute folder path]
[absolute folder path]with the path to the folder containing your documents.
If Lingo4G cannot parse a file, it will print a warning to the console.
Run the following command to start Lingo4G server:
l4g server -p datasets/dataset-autoindex
Lingo4G makes the following document fields available for your analyses and document selection queries:
the last path segment of the input file
the auto-detected MIME type of the file
document title, optional
contents of the document
implements a document source that converts the common document formats to plain text, which Lingo4G requires on
input for indexing. Lingo4G uses subset of the Apache Tika library to
perform this conversion and allows indexing of the following file formats.
|File type||Typical file extensions||Description|
Adobe PDF files. Note that PDF files may contain remapped fonts or outline glyphs and then text extraction (without applying OCR techniques) is impossible. Text extraction from secured or signed PDFs may not be possible.
Plain text files. The encoding will be autodetected by Tika, the heuristic may make mistakes for encodings where byte distribution is similar.
Hypertext documents. Note that Tika doesn't attempt to render the page, only sanitizes and extracts content from HTML tags.
|Open Document formats||
OpenOffice, LibreOffice and other Open Document format documents.
|Rich Text Format||
Rich text format documents.
Microsoft Office documents, including MS Office 9x and later.
Tika will try to auto-detect the format of each input file, so AutoIndex can parse and import
other file formats supported by Tika. However, to
keep Lingo4G distribution size smaller, we trimmed down several Tika dependencies. If you need to support
some exotic file format, add the required dependencies manually to the data source's
Tika can use heuristics to extract plain text from files with an unknown or undefined character encoding. In such cases, the quality of text extraction may be unsatisfactory.