Feature indexing
The indexer
section configures feature indexing and feature extractors.
The indexer
section of the project descriptor has the following structure:
{
"embedding": {},
"features": {},
"indexCompression": "lz4",
"maxCacheableFst": "500M",
"samplingRatio": 1,
"stopLabelExtractor": {},
"threads": "auto"
}
The most important part of the indexer
configuration section is the definition of feature extractors
and thus feature fields in the features
field. If you wish to use feature
embeddings, refer to the embedding
section of the configuration. Large data
sets may benefit from using document sampling to extract features.
Any changes you make to the indexer
section will most likely require running a
full reindexing cycle.
embedding
Configures learning aspects of multidimensional vector representations of labels and documents.
{
"documents": {},
"labels": {}
}
documents
Configures learning of document embeddings.
{
"enabled": false,
"index": {},
"input": {}
}
enabled
If true
, Lingo4G learns document embedding as part of indexing or
reindexing.
If false
, indexing or reindexing does not include the document embedding learning step. At a
later point, you can add document embeddings to an existing index by invoking the
l4g learn-embeddings
command with the
--recompute-document-embeddings
option.
index
Configures the k-nearest-neighbors (kNN) index for document embeddings.
{
"maxNeighborsPerNode": 24
}
The kNN index enables Lingo4G to quickly find the most similar embedding vectors given one input vector.
max​Neighbors​Per​Node
Determines the maximum degree of the index graph nodes.
The default value is adequate in most scenarios. Increasing max​Neighbors​Per​Node
may improve the
approximate search accuracy, at the cost of larger on-disk and in-memory size of the kNN index and increased
search times.
input
Configures the input for document embedding learning.
{
"fields": null
}
fields
The feature fields to use for document embedding learning.
Lingo4G builds the embedding vector for a document by combining embedding vectors of labels present in that
document. The
fields
parameter determines from which feature fields Lingo4G reads document labels when
learning the document embedding.
By default, Lingo4G uses all available feature fields to compute document embeddings. To learn more
"focused" embeddings, you can set the fields
array to an explicit list of fields that, for
example, skips the document body field and only leaves document title and abstract.
Note that document embedding learning time is largely independent of the number of fields Lingo4G uses during the process.
labels
Configures learning of label embeddings.
{
"enabled": false,
"index": {},
"input": {},
"model": {}
}
enabled
If true
, Lingo4G learns label embedding as part of indexing
or reindexing.
If false
, indexing or reindexing does not include the label embedding learning step. At a later
point, you can add label embeddings to an existing index by invoking the
l4g learn-embeddings
command with the
--recompute-label-embeddings
option.
index
Configures the k-nearest-neighbors (kNN) index for document embeddings.
{
"maxNeighborsPerNode": 24
}
The kNN index enables Lingo4G to quickly find the most similar embedding vectors given one input vector.
max​Neighbors​Per​Node
Determines the maximum degree of the index graph nodes.
The default value is adequate in most scenarios. If your index contains more than 4M labels, you may
consider increasing
max​Neighbors​Per​Node
to 32
for increased nearest neighbor search accuracy, at the
cost of larger on-disk and in-memory size of the kNN index and increased search times.
input
Determines the subset of labels for which to learn embeddings.
{
"fields": [],
"maxDocs": null,
"maxLabelsForDirectLearning": 200000,
"minLabelTfForDirectLearning": 200,
"minLabelTfForEstimatedLearning": 40
}
To improve the quality and performance of embedding learning, Lingo4G uses two methods to compute embedding vectors for labels: direct learning and estimation. First, Lingo4G directly learns embedding vectors for the high-frequency labels. Next, Lingo4G estimates the embedding vectors for the remaining lower-frequency labels based on the directly-learnt embeddings of the high-frequency labels.
The input
section configures how Lingo4G chooses the set of labels for direct learning and
estimation.
Lingo4G prepares labels for embedding learning in the following way:
-
First, Lingo4G collects labels for direct embedding learning.
To be eligible for direct learning, a label must occur at least
min​Label​Tf​For​Direct​Learning
times in all documents in the index. To improve performance, Lingo4G will directly learn embedding vectors only for up tomax​Labels​For​Direct​Learning
of the eligible labels, moving the remaining least-frequent eligible labels to the estimated learning set. -
Then, Lingo4G identifies labels for which to estimate the embedding vectors.
To be eligible for embedding vector estimation, a label must occur at least
min​Label​Tf​For​Estimated​Learning
times across all documents in the index. Because estimation is much faster than direct learning, Lingo4G computes embedding vectors for all labels that meet the frequency threshold.
fields
The feature fields from which Lingo4G selects labels for embedding learning.
If you don't provide this property, Lingo4G uses all feature fields available in the project.
max​Docs
The maximum number of documents to scan when collecting labels for embedding vector learning.
If null
, Lingo4G collects labels from all documents in the index.
In most cases, the default value of null
is optimal. A non-null
value is useful for quick experimental runs of embedding learning applied to very large collections.
max​Labels​For​Direct​Learning
The maximum number of labels for which to directly learn embeddings.
If the number of labels in the index is larger than
max​Labels​For​Direct​Learning
, Lingo4G learns embeddings directly for
max​Labels​For​Direct​Learning
most frequent labels that also meet the
min​Label​Tf​For​Direct​Learning
threshold. For the remaining labels
min​Label​Tf​For​Direct​Learning
The minimum number of times a label must occur across all documents in the index to be selected for direct learning of its embedding vector.
min​Label​Tf​For​Estimated​Learning
The minimum number of documents in which a label must occur to be selected for embedding learning.
The minimum number of times a label must occur across all documents in the index to be selected for estimation of its embedding vector.
Lingo4G does not compute embedding vectors for labels that appear fewer than
min​Label​Tf​For​Estimated​Learning
times across all documents in the index.
model
Configures the parameters of label embedding vectors, such as vector size.
{
"contextSize": 20,
"contextSizeSampling": true,
"frequencySampling": 0.0001,
"maxIterations": 5,
"minIterations": 1,
"minUsableVectorsPercent": 99,
"model": "COMPOSITE",
"negativeSamples": 5,
"timeout": "6h",
"vectorSize": 96
}
context​Size
Size of the left and right context of the label to use for learning.
For example, with a value of 20
, Lingo4G uses 20 labels to the left and 20 label to the right
of the focus label when learning label embeddings. Increasing context size may improve the quality of
embeddings at the cost of longer learning times.
context​Size​Sampling
Enables context size sampling.
If true
, for each focus label Lingo4G uses a context size equal to a uniformly distributed
random number in the [1...context​Size]
range. This significantly reduces learning time with a
negligible loss of embedding quality.
frequency​Sampling
Determines the amount of sampling to apply to high-frequency labels.
Lingo4G can significantly reduce the label embedding learning time by processing only a fraction of
high-frequency labels. For example, the value of 1e-4
for this parameter results in moderate
sampling. Lower values, such as 1e-5
result in less sampling, longer learning time and a
possibility of increased embedding quality. Larger values, such as 1e-3
result in heavier
sampling, faster learning times and lowered embedding quality for high-frequency terms. A reasonable value
range for this parameter is [1e-3...1e-5]
.
max​Iterations
The maximum number of label embedding learning iterations to perform.
The larger the number of iterations, the higher the quality of embedding and the longer the learning time.
For collections with very short tweet-sized documents or numbers of documents lower than 100k, to learn decent-quality embeddings you may need to increase the number of iterations to 10 or 20. Similarly, for large collections of long documents, one iteration may be enough to learn good-quality embeddings.
Note that this parameter accepts floating point values, so you can have Lingo4G perform 2.5 iterations, for example. In that case, Lingo4G will process certain documents 2 times and others 3 times.
Also note that depending on the value of the
timeout
and
min​Usable​Vectors​Percent
parameters, Lingo4G may not perform the number of iterations you request.
min​Iterations
The minimum number of label embedding learning iterations to perform.
Forces Lingo4G to perform at least the specified number of iterations over all documents in the index when
learning label embeddings, even if the learning reaches the
min​Usable​Vectors​Percent
threshold. You can use this property to ensure that Lingo4G process each document at least once when
learning label embeddings.
min​Usable​Vectors​Percent
The percentage of high-quality label embedding vectors beyond which Lingo4G can stop the learning process.
For example, the value of 98
means that once 98% of the embedding vectors achieve acceptable
quality, Lingo4G can stop embedding learning, even if it has not yet reached
maxIterations.
It is usually impractical to generate accurate embeddings for 100% of the labels. Lingo4G discards embedding vectors that did not achieve the required quality level to ensure they don't degrade the quality of subsequent analytical requests.
For very large collections, it is usually beneficial to lower
min​Usable​Vectors​Percent
to 85
or less. This can significantly lower the learning
time at the cost of Lingo4G discarding embeddings for some low-frequency labels.
model
The embedding model to use for learning label embeddings.
You can use the following models:
C​B​O​W
-
Very fast to learn, produces accurate embeddings for high-frequency labels, but low-frequency labels (with document frequency less than 1000) usually get inaccurate, low-quality embeddings.
Use this model only for learning embeddings for high-frequency labels.
S​K​I​P_​G​R​A​M
-
Produces accurate embeddings for labels of all frequencies, slow to learn.
C​O​M​P​O​S​I​T​E
-
A combined model that learns
C​B​O​W
-like embeddings for high-frequency labels andS​K​I​P_​G​R​A​M
-like embeddings for low-frequency labels. This model is faster to train thanS​K​I​P_​G​R​A​M
and is a good default choice in most scenarios.
negative​Samples
The number of negative context samples to take when learning label embedding.
Embedding learning time is linear in the number of negative context samples. That is, increasing the number of negative context samples by a factor of 2 also increases the learning time by a factor of 2.
The default number of negative context samples is adequate in most scenarios. Increasing the number may improve the quality of embeddings, at the cost of increased learning time.
timeout
The maximum time allocated for learning label embeddings.
To avoid spending too much time learning embeddings, you can specify the maximum time the process can take.
The format of this parameter is H​Hh​M​Mm​S​Ss
, where H​H
is the number of hours (use
values larger than 24 for days), M​M
is the number of minutes and S​S
is the number
of seconds.
vector​Size
Size of the label embedding vector.
Embedding learning time is linear in the vector side. That is, increasing vector size by a factor of 2, increases learning time by a factor of 2.
The default vector size is sufficient for most small projects with no more than 500k labels selected for embedding. For projects with more than 500k labels, a vector size of 128 may increase the quality of embeddings. For largest projects, with more than 1M label embeddings, vector size of 160 may further increase the quality of embedding, at the cost of longer learning time and larger in-memory size of the embedding.
features
Defines feature extractors and fields they generate features for (feature fields).
The features
block must be an object with string keys and values of the following types:
-
dictionary
-
Dictionary-based feature extraction (requires an external source of features).
-
phrases
-
Automatically identifies frequent word- and phrase-based features in the text of documents stored in the index.
dictionary
This feature extractor adds feature fields that index phrases or terms sourced from a fixed, predefined dictionary. This can be useful when the set of features (labels) should be limited to a specific vocabulary or ontology. Another practical use case is indexing geographical locations, references of objects or people.
{
"type": "dictionary",
"intervalueGap": 64,
"labels": [],
"maxPhrasesPerField": 0,
"targetFields": null
}
The dictionary extractor requires one or more feature dictionaries. A feature dictionary file is a JSON file listing all known features (and their spelling variants) which should be annotated in indexed documents. Feature dictionaries are provided to the dictionary extractor using the features.dictionary.labels property.
An example content of the feature dictionary file can look as shown below.
[
{
"label": "Animals",
"match": [
"hound",
"dog",
"fox",
"foxy"
]
},
{
"label": "Foxes",
"match": [
"fox",
"foxy",
"furry foxy"
]
}
]
Each feature is an object, with a string description (the label
property) and a set of strings with different variants of the feature's appearance. Note that:
-
Each dictionary feature must have a non-empty and unique visual description (a
label
). This label will be used to represent the feature (it will be the feature's label). -
A single feature may contain a number of different string variants. These variants can be terms or phrases.
-
If two or more features contain the same matching string (as it is the case with
fox
andfoxy
in the example above), all those features will be indexed at the position those matching strings occur at.
For example, given the above dictionary, an input text field with the english
analyzer, and input document with the following text field value:
The quick brown fox jumps over the lazy dog.
The following underlined text spans would be indexed as the feature Animals
:
The quick brown fox jumps over the lazy dog.
And the following text span would be indexed as the occurrence of the feature Foxes
:
The quick brown fox jumps over the lazy dog.
The text of a document's field is split into tokens according to the featureAnalyzer specification provided for that field. When the dictionary extractor is applied to a field, its matching strings are also tokenized using the same analyzer. The dictionary extractor looks for identical sequences of tokens and creates a match when a feature's token sequence exists in the document's field somewhere.
Analyzers that normalize the input text somehow (for example convert it to lower case), will therefore require only one spelling or case-insensitive variant of a given label. Analyzers that preserve letter case and surface token forms need all potential spelling variants of the given feature.
intervalue​Gap
A synthetic feature position padding added between values of multi-valued fields.
This is an expert setting. The inter-value gap only matters for proximity queries executed on feature fields (very uncommon).
labels
A string or an array of strings with JSON files containing feature dictionaries. Paths are relative to the project's directory.
Each JSON file must contain an array of features and their matching rules, as explained in the overview of the dictionary extractor.
max​Phrases​Per​Field
This property limits the number of features (labels) indexed for each field to the provided maximum number of most-frequent labels. This option can be used to reduce the number of indexed features while keeping the most frequent features in a document. About a hundred most-frequent features are typically enough to achieve analysis results similar to those obtained with a full feature set.
If the value of this option is zero, all discovered labels will be indexed for the specified set of target fields.
target​Fields
An array of field names to which this extractor will be applied. For each provided field, Lingo4G will create
a corresponding feature field named
<source-field-name>$<extractor-key>
.
All fields must have a defined feature analyzer.
phrases
This feature extractor can be used when features should be discovered in the text automatically, without any prior knowledge.
{
"type": "phrases",
"allowStopwordsInsideLabels": false,
"intervalueGap": 64,
"maxPhraseDfRatio": 0.33,
"maxPhraseTermCount": 5,
"maxPhraseTf": "unlimited",
"maxPhrases": 0,
"maxPhrasesPerField": 0,
"maxTermLength": 240,
"minPhraseDf": 4,
"minPhraseTf": 1,
"minTermDf": 4,
"omitLabelsWithNumbers": false,
"omitTruncatedLabels": false,
"skipSubphrases": true,
"sourceFields": null,
"targetFields": null
}
The frequent phrase extractor has the following characteristics:
-
it automatically discovers and indexes terms and phrases that occur frequently in input documents,
-
it attempts to filter out common structural part of the language (like stop words or other frequent boilerplate phrases),
-
it can normalize minor differences in the appearance of the surface form of a phrase, choosing the most frequent variant as the feature's label, for example:
web page
,web pages
,webpage
orweb-page
would all be normalized into a single feature labelled with the most frequent of these surface forms.
The phrase feature extractor works by collecting and counting all terms and phrases (phrases are n-grams of terms). A term or phrase is counted only once per document, regardless of how many times it is repeated within that document. Once the counting of candidate terms and phrases completes, the feature selection proceeds by:
-
selecting terms that occurred in more than minTermDf documents (and fulfill other filtering criteria),
-
selecting phrases that occurred in more than minPhraseDf documents (and fulfill other filtering criteria).
Note that the discovered features (terms and phrases) can overlap or be a subset of one another. For example, in a sentence like this one:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
all the following features could be discovered and indexed independently (the whole input is repeated for clarity, features are underlined):
Lorem Ipsum is simply dummy text of the
printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and
typesetting industry.
Lorem Ipsum is simply dummy text of the
printing and typesetting industry.
Lorem Ipsum is simply dummy text of the
printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and
typesetting industry.
Noisy features The phrase feature extractor may create many redundant or noisy features. Such features should be naturally eliminated later by specific analysis requests (like clustering requests).
allow​Stopwords​Inside​Labels
If true
, phrase features discovered by the extractor may include terms marked as stop words by
the source field's feature analyzer.
This setting allows discovery of phrase features that include stop words (like Bank of England, where of is a stop word) at the cost of increasing index size and potentially discovering many meaningless features that are accidental word co-occurrences.
intervalue​Gap
A synthetic feature position padding added between values of multi-valued fields.
This is an expert setting. The inter-value gap only matters for proximity queries executed on feature fields (very uncommon).
max​Phrase​Df​Ratio
If a phrase or term exists in more than this ratio of documents, it will be ignored. A ratio of 0.5 means 50% of documents in the index, a ratio of 1 means 100% of documents in the index.
Typically, phrases that occur in more than 30% of all the documents in a collection are either boilerplate headers or structural elements of the language (not informative) and can be safely dropped from the index. This improves speed and decreases index size.
max​Phrase​Term​Count
The maximum number of non-stop-words to allow in a phrase. Phrases longer than the specified limit will not be extracted.
Raising max​Phrase​Term​Count
above the default value of 5
will significantly increase
the index size, indexing and clustering time.
max​Phrase​Tf
Maximum allowed number of occurrences of a phrase within a single document. Phrases that reach this threshold or exceed it, are ignored within that document.
max​Phrases
This attribute limits the total number of features allowed in the index to top-N most frequent features detected in the entire input. For many types of analytical requests there is very little difference in quality if the feature set is limited to a million or two million features (many larger data sets will discover feature sets much larger than this). The benefit of limiting the number of features to the most common ones is in accelerated feature indexing and reduced index size.
The default value of this attribute (0) means all features passing other criteria are allowed.
max​Phrases​Per​Field
This property limits the number of features (labels) indexed for each field to the provided maximum number of most-frequent labels. This option can be used to reduce the number of indexed features while keeping the most frequent features in a document. About a hundred most-frequent features are typically enough to achieve analysis results similar to those obtained with a full feature set.
If the value of this option is zero, all discovered labels will be indexed for the specified set of target fields.
max​Term​Length
The maximum length of single-term features, in characters. Words longer than the specified limit will be ignored.
min​Phrase​Df
The minimum required number of occurrences of a phrase. Phrases appearing in fewer than the specified number of documents will be ignored.
Increasing min​Phrase​Df
threshold will filter out noisy phrases, decrease the size of the index
and significantly speed-up indexing and clustering.
min​Phrase​Tf
Minimum allowed number of occurrences of a phrase within a single document. Phrases that occur less frequently than this threshold are ignored (within that document).
This parameter can be set to 2 to filter out phrases that occur once in a document. Such phrases can still become labels if they occur more frequently in at least one other document.
min​Term​Df
The minimum required number of occurrences of a single-term feature. Words appearing in fewer than the specified number of documents will be ignored.
Increasing the min​Term​Df
threshold will help to filter out noisy words, decrease the size of the
index and speed-up indexing and clustering. For efficient noise removal on large data sets, consider bumping
the
min​Phrase​Df
threshold as well.
omit​Labels​With​Numbers
If set to true
, any terms or phrases containing numeric tokens will be omitted from the index.
While this option drops a significant amount of features, it should be used with care as certain potential
valid features contain numbers (Windows 10, Terminator 2).
omit​Truncated​Labels
If set to true
, labels that are sub-sequences of other labels and have an identical number of
occurrences within the document will be omitted (not indexed). This setting can be used to clean up shorter,
"truncated" phrases. For example, Computing Machinery in
Association for Computing Machinery.
This option may improve per-document labels but may also affect the results in subtle ways. For example, a scope selector for documents containing the Computing Machinery label will not return a document containing only the full Association for Computing Machinery.
skip​Subphrases
Setting this option to true
causes the extractor to emit only phrases of possibly maximum length.
For example, if the document only contains full mentions of Association for Computing Machinery, that
entire phrase would be contributed, without shorter subparts, such as like Computing Machinery.
We don't recommend setting this option to false
, as it may lead to longer processing times and
lower-quality results.
source​Fields
An array of field names used as the source of text in which frequent phrases and terms are discovered.
All fields must have a defined feature analyzer.
target​Fields
An array of field names to which this extractor will be applied. For each provided field, Lingo4G will create
a corresponding feature field named
<source-field-name>$<extractor-key>
.
All fields must have a defined feature analyzer.
index​Compression
This property controls how the document index is compressed. Better compression typically requires more processing resources during indexing but results in smaller indexes on disk (and these can be more efficiently cached by operating system I/O caches).
The index​Compression
property supports the following values:
lz4
-
Uses lz4 compression. Favors indexing and document retrieval speed over disk size.
zip
-
Uses zip (deflate) compression. May increase indexing time slightly (by 10%) but should reduce document index size by ~25% (depends on how well documents compress).
max​Cacheable​Fst
This is a very low-level setting that only affects indexing performance in a minor way. Leave at the default value.
Provides the maximum size of the finite state automaton used for feature matching, which can undergo further data structure optimizations (arc-hashing optimization). Optimized automata are slightly faster to apply during document indexing.
The default value of this attribute is 500​M
bytes.
sampling​Ratio
Declares the sampling ratio over all the input documents that feature extractors use to extract features. This is useful to limit the time required to extract features from large data sets or data sets with a very large set of features.
The value of sampling​Ratio
must be a number between 0 (exclusive) and 1 (inclusive) and indicates the
probability with which each document is processed in each required document scan. For example, a
sampling​Ratio
of 0.25 used together with the
phrase extractor
will result in terms and phrases discovered from a random subset of 25% of the original documents.
The default value of this attribute is 1
(all documents are processed in each scan).
stop​Label​Extractor
Configures the automatic discovery of collection-specific stop labels.
{
"categoryFields": [],
"featureFields": [],
"maxLabelsPerPartitionQuery": 10000,
"maxPartitionQueries": 100,
"minStopLabelCoverage": 0.7,
"partitionQueryMaxRelativeDf": 0.15,
"partitionQueryMinRelativeDf": 0.001,
"threads": "auto"
}
During indexing, Lingo4G will attempt to discover collection-specific stop labels, that is labels that poorly characterize documents in the collection. Typically, such stop labels will include generic terms or phrases that are not good indicators of the document content and occur randomly across all documents. Such stop labels may include phrases such as taking place, soon discovers or starts. While such phrases could be collected once, an automated detection during data indexing has an advantage of being dynamic and adjusting to the data set. For example, a medical research data set could include domain-specific stop labels that are not universally meaningless, but occur very frequently within that particular domain like indicate, studies suggest or control.
Ideally, the category​Fields
should include fields
that separate all documents into fairly independent, smaller subsets. Good examples are tags, company divisions,
or institution names. If no such fields exist in the collection, or if they don't provide enough information for
stop label extraction, feature​Fields
should be used to specify fields contributed by feature extractors.
All other parameters are expert-level settings and typically will not require tuning.
The full process of computing stop labels works as detailed below. Please note that the details of the algorithm are an implementation detail and may change without notice.
-
First, the algorithm attempts to determine terms (at most
max​Partition​Queries
of them) that slice the indexed collection of documents into potentially independent subsets. These "slicing" terms are first taken from fields declared incategory​Fields
attribute, followed by terms from feature fields declared in thefeature​Fields
attribute.Only terms that cover a fraction of all input documents between
partition​Query​Min​Relative​Df
andpartition​Query​Max​Relative​Df
will be accepted as partitioning queries. -
Then, a maximum of
max​Labels​Per​Partition​Query
labels for each partitioning query are collected and the algorithm computes which terms each of those labels was relevant to, as well as the chance of that term being a "frequent" or "popular" phrase across all documents. -
The topmost "frequent" labels relevant to at least a ratio of
min​Stop​Label​Coverage
of all slicing terms are selected as stop labels. For example,min​Stop​Label​Coverage
of 0.2 andmax​Partition​Queries
of 200 would mean the label was present in documents matched by at least 40 slicing terms.
category​Fields
This property should be used to provide an array of field names whose values separate all documents into fairly topic-independent, smaller subsets. Leave empty if no such fields exist.
feature​Fields
This property contains additional feature fields from which document-partitioning terms are collected if there
are insufficient terms in category​Fields
.
Note that feature field names are a concatenation of the target field they were applied to, the
$
character and the feature extractor that
produced them.
max​Labels​Per​Partition​Query
The maximum number of labels collected per document partition.
max​Partition​Queries
The maximum number of partitioning queries.
min​Stop​Label​Coverage
The minimum frequency of stop labels must reach with respect to all slicing terms. For example,
min​Stop​Label​Coverage
of 0.2 and max​Partition​Queries
of 200 would mean the stop label
must be present in documents matched by at least 40 partitioning queries.
partition​Query​Max​Relative​Df
Limits partitioning term queries to only those that cover a fraction of all input documents between
partition​Query​Min​Relative​Df
and partition​Query​Max​Relative​Df
.
partition​Query​Min​Relative​Df
Limits partitioning term queries to only those that cover a fraction of all input documents between
partition​Query​Min​Relative​Df
and partition​Query​Max​Relative​Df
.
threads
Specifies the number of threads that should be used for stop label discovery.
threads
Determines the number of threads Lingo4G uses to perform indexing.
Faster disk drives (SSD or NVMe) permit higher concurrency levels, while conventional spinning drives typically perform very poorly with multiple threads reading from different disk regions concurrently. There are several ways to express the permitted concurrency level:
auto
- The number of threads used for indexing will be automatically and dynamically adjusted to maximize indexing throughput.
- n
-
A fixed number of n threads will be used for indexing. For spinning drives, this should be set to 1 (or
auto
). For SSD drives and NVMe drives, the number of threads should be close to the number of available CPU cores. - n–m
-
The number of threads will be automatically adjusted in the range between
n and m to maximize indexing throughput. For example
1–4
will result in any number of concurrent threads between 1 and 4. This syntax can be used to decrease system load if automatic throughput management attempts to use all available CPUs.
The default (and strongly recommended) value of this attribute is auto
.