Feature indexing

The indexer section configures feature indexing and feature extractors.

The indexer section of the project descriptor has the following structure:

{
  "embedding": {},
  "features": {},
  "indexCompression": "lz4",
  "maxCacheableFst": "500M",
  "samplingRatio": 1,
  "stopLabelExtractor": {},
  "threads": "auto"
}

The most important part of the indexer configuration section is the definition of feature extractors and thus feature fields in the features field. If you wish to use feature embeddings, refer to the embedding section of the configuration. Large data sets may benefit from using document sampling to extract features.

Any changes you make to the indexer section will most likely require running a full reindexing cycle.

embedding

Configures learning aspects of multidimensional vector representations of labels and documents.

{
  "documents": {},
  "labels": {}
}

documents

Configures learning of document embeddings.

{
  "enabled": false,
  "index": {},
  "input": {}
}

enabled

Type
boolean
Default
false
Required
no

If true, Lingo4G learns document embedding as part of indexing or reindexing.

If false, indexing or reindexing does not include the document embedding learning step. At a later point, you can add document embeddings to an existing index by invoking the l4g learn-embeddings command with the --recompute-document-embeddings option.

index

Configures the k-nearest-neighbors (kNN) index for document embeddings.

{
  "constructionNeighborhoodSize": 256,
  "maxNeighborsPerNode": 24
}

The kNN index enables Lingo4G to quickly find the most similar embedding vectors given one input vector.

construction​Neighborhood​Size
Type
integer
Default
256
Constraints
value >= 1
Required
no

Determines the accuracy of the kNN index building process.

The default value is adequate for small and medium-sized indices with fewer than 1M document embeddings. In scenarios with more than 1M documents, consider increasing the value of this parameter to 384 or 512, which should increase the accuracy of the index at the cost of longer index building time.

max​Neighbors​Per​Node
Type
integer
Default
24
Constraints
value >= 1
Required
no

Determines the maximum degree of the index graph nodes.

The default value is adequate in most scenarios. If your index contains more than 4 million documents, you may consider increasing max​Neighbors​Per​Node to 32 for increased nearest neighbor search performance, at the cost of larger on-disk and in-memory size of the kNN index.

input

Configures the input for document embedding learning.

{
  "fields": null
}
fields
Type
object or null
Default
null
Required
no

The feature fields to use for document embedding learning.

Lingo4G builds the embedding vector for a document by combining embedding vectors of labels present in that document. The fields parameter determines from which feature fields Lingo4G reads document labels when learning the document embedding.

By default, Lingo4G uses all available feature fields to compute document embeddings. To learn more "focused" embeddings, you can set the fields array to an explicit list of fields that, for example, skips the document body field and only leaves document title and abstract.

Note that document embedding learning time is largely independent of the number of fields Lingo4G uses during the process.

labels

Configures learning of label embeddings.

{
  "enabled": false,
  "index": {},
  "input": {},
  "model": {}
}

enabled

Type
boolean
Default
false
Required
no

If true, Lingo4G learns label embedding as part of indexing or reindexing.

If false, indexing or reindexing does not include the label embedding learning step. At a later point, you can add label embeddings to an existing index by invoking the l4g learn-embeddings command with the --recompute-label-embeddings option.

index

Configures the k-nearest-neighbors (kNN) index for document embeddings.

{
  "constructionNeighborhoodSize": 256,
  "maxNeighborsPerNode": 24
}

The kNN index enables Lingo4G to quickly find the most similar embedding vectors given one input vector.

construction​Neighborhood​Size
Type
integer
Default
256
Constraints
value >= 1
Required
no

Determines the accuracy of the kNN index building process.

The default value is adequate for small and medium-sized indices with fewer than 1M label embeddings. For very large indices with more than 1M labels, consider increasing the value of this parameter to 384 or 512, which should increase the accuracy of the index at the cost of longer index building time.

max​Neighbors​Per​Node
Type
integer
Default
24
Constraints
value >= 1
Required
no

Determines the maximum degree of the index graph nodes.

The default value is adequate in most scenarios. If your index contains more than 4M labels, you may consider increasing max​Neighbors​Per​Node to 32 for increased nearest neighbor search performance, at the cost of larger on-disk and in-memory size of the kNN index.

input

Determines the subset of labels for which to learn embeddings.

{
  "fields": [],
  "maxDocs": null,
  "maxLabelsForDirectLearning": 200000,
  "minLabelTfForDirectLearning": 200,
  "minLabelTfForEstimatedLearning": 40
}

To improve the quality and performance of embedding learning, Lingo4G uses two methods to compute embedding vectors for labels: direct learning and estimation. First, Lingo4G directly learns embedding vectors for the high-frequency labels. Next, Lingo4G estimates the embedding vectors for the remaining lower-frequency labels based on the directly-learnt embeddings of the high-frequency labels.

The input section configures how Lingo4G chooses the set of labels for direct learning and estimation.

Lingo4G prepares labels for embedding learning in the following way:

  1. First, Lingo4G collects labels for direct embedding learning.

    To be eligible for direct learning, a label must occur at least min​Label​Tf​For​Direct​Learning times in all documents in the index. To improve performance, Lingo4G will directly learn embedding vectors only for up to max​Labels​For​Direct​Learning of the eligible labels, moving the remaining least-frequent eligible labels to the estimated learning set.

  2. Then, Lingo4G identifies labels for which to estimate the embedding vectors.

    To be eligible for embedding vector estimation, a label must occur at least min​Label​Tf​For​Estimated​Learning times across all documents in the index. Because estimation is much faster than direct learning, Lingo4G computes embedding vectors for all labels that meet the frequency threshold.

fields
Type
array of string
Default
[]
Required
no

The feature fields from which Lingo4G selects labels for embedding learning.

If you don't provide this property, Lingo4G uses all feature fields available in the project.

max​Docs
Type
integer or null
Default
null
Required
no

The maximum number of documents to scan when collecting labels for embedding vector learning.

If null, Lingo4G collects labels from all documents in the index.

In most cases, the default value of null is optimal. A non-null value is useful for quick experimental runs of embedding learning applied to very large collections.

max​Labels​For​Direct​Learning
Type
integer
Default
200000
Constraints
value >= 1
Required
no

The maximum number of labels for which to directly learn embeddings.

If the number of labels in the index is larger than max​Labels​For​Direct​Learning, Lingo4G learns embeddings directly for max​Labels​For​Direct​Learning most frequent labels that also meet the min​Label​Tf​For​Direct​Learning threshold. For the remaining labels

min​Label​Tf​For​Direct​Learning
Type
number
Default
200
Constraints
value >= 0
Required
no

The minimum number of times a label must occur across all documents in the index to be selected for direct learning of its embedding vector.

min​Label​Tf​For​Estimated​Learning
Type
number
Default
40
Constraints
value >= 0
Required
no

The minimum number of documents in which a label must occur to be selected for embedding learning.

The minimum number of times a label must occur across all documents in the index to be selected for estimation of its embedding vector.

Lingo4G does not compute embedding vectors for labels that appear fewer than min​Label​Tf​For​Estimated​Learning times across all documents in the index.

model

Configures the parameters of label embedding vectors, such as vector size.

{
  "contextSize": 20,
  "contextSizeSampling": true,
  "frequencySampling": 0.0001,
  "maxIterations": 5,
  "minIterations": 1,
  "minUsableVectorsPercent": 99,
  "model": "COMPOSITE",
  "negativeSamples": 5,
  "timeout": "6h",
  "vectorSize": 96
}
context​Size
Type
integer
Default
20
Constraints
value >= 0
Required
no

Size of the left and right context of the label to use for learning.

For example, with a value of 20, Lingo4G uses 20 labels to the left and 20 label to the right of the focus label when learning label embeddings. Increasing context size may improve the quality of embeddings at the cost of longer learning times.

context​Size​Sampling
Type
boolean
Default
true
Required
no

Enables context size sampling.

If true, for each focus label Lingo4G uses a context size equal to a uniformly distributed random number in the [1...context​Size] range. This significantly reduces learning time with a negligible loss of embedding quality.

frequency​Sampling
Type
number
Default
0.0001
Constraints
value >= 0
Required
no

Determines the amount of sampling to apply to high-frequency labels.

Lingo4G can significantly reduce the label embedding learning time by processing only a fraction of high-frequency labels. For example, the value of 1e-4 for this parameter results in moderate sampling. Lower values, such as 1e-5 result in less sampling, longer learning time and a possibility of increased embedding quality. Larger values, such as 1e-3 result in heavier sampling, faster learning times and lowered embedding quality for high-frequency terms. A reasonable value range for this parameter is [1e-3...1e-5].

max​Iterations
Type
number
Default
5
Constraints
value >= 0
Required
no

The maximum number of label embedding learning iterations to perform.

The larger the number of iterations, the higher the quality of embedding and the longer the learning time.

For collections with very short tweet-sized documents or numbers of documents lower than 100k, to learn decent-quality embeddings you may need to increase the number of iterations to 10 or 20. Similarly, for large collections of long documents, one iteration may be enough to learn good-quality embeddings.

Note that this parameter accepts floating point values, so you can have Lingo4G perform 2.5 iterations, for example. In that case, Lingo4G will process certain documents 2 times and others 3 times.

Also note that depending on the value of the timeout and min​Usable​Vectors​Percent parameters, Lingo4G may not perform the number of iterations you request.

min​Iterations
Type
number
Default
1
Constraints
value >= 0
Required
no

The minimum number of label embedding learning iterations to perform.

Forces Lingo4G to perform at least the specified number of iterations over all documents in the index when learning label embeddings, even if the learning reaches the min​Usable​Vectors​Percent threshold. You can use this property to ensure that Lingo4G process each document at least once when learning label embeddings.

min​Usable​Vectors​Percent
Type
number
Default
99
Constraints
value >= 0 and value <= 100
Required
no

The percentage of high-quality label embedding vectors beyond which Lingo4G can stop the learning process.

For example, the value of 98 means that once 98% of the embedding vectors achieve acceptable quality, Lingo4G can stop embedding learning, even if it has not yet reached maxIterations.

It is usually impractical to generate accurate embeddings for 100% of the labels. Lingo4G discards embedding vectors that did not achieve the required quality level to ensure they don't degrade the quality of subsequent analytical requests.

For very large collections, it is usually beneficial to lower min​Usable​Vectors​Percent to 85 or less. This can significantly lower the learning time at the cost of Lingo4G discarding embeddings for some low-frequency labels.

model
Type
string
Default
"COMPOSITE"
Constraints
one of [CBOW, SKIP_GRAM, COMPOSITE]
Required
no

The embedding model to use for learning label embeddings.

You can use the following models:

C​B​O​W

Very fast to learn, produces accurate embeddings for high-frequency labels, but low-frequency labels (with document frequency less than 1000) usually get inaccurate, low-quality embeddings.

Use this model only for learning embeddings for high-frequency labels.

S​K​I​P_​G​R​A​M

Produces accurate embeddings for labels of all frequencies, slow to learn.

C​O​M​P​O​S​I​T​E

A combined model that learns C​B​O​W-like embeddings for high-frequency labels and S​K​I​P_​G​R​A​M-like embeddings for low-frequency labels. This model is faster to train than S​K​I​P_​G​R​A​M and is a good default choice in most scenarios.

negative​Samples
Type
integer
Default
5
Constraints
value >= 0
Required
no

The number of negative context samples to take when learning label embedding.

Embedding learning time is linear in the number of negative context samples. That is, increasing the number of negative context samples by a factor of 2 also increases the learning time by a factor of 2.

The default number of negative context samples is adequate in most scenarios. Increasing the number may improve the quality of embeddings, at the cost of increased learning time.

timeout
Type
string
Default
"6h"
Required
no

The maximum time allocated for learning label embeddings.

To avoid spending too much time learning embeddings, you can specify the maximum time the process can take. The format of this parameter is H​Hh​M​Mm​S​Ss, where H​H is the number of hours (use values larger than 24 for days), M​M is the number of minutes and S​S is the number of seconds.

vector​Size
Type
integer
Default
96
Constraints
value >= 1
Required
no

Size of the label embedding vector.

Embedding learning time is linear in the vector side. That is, increasing vector size by a factor of 2, increases learning time by a factor of 2.

The default vector size is sufficient for most small projects with no more than 500k labels selected for embedding. For projects with more than 500k labels, a vector size of 128 may increase the quality of embeddings. For largest projects, with more than 1M label embeddings, vector size of 160 may further increase the quality of embedding, at the cost of longer learning time and larger in-memory size of the embedding.

features

Defines feature extractors and fields they generate features for (feature fields).

The features block must be an object with string keys and values of the following types:

dictionary

Dictionary-based feature extraction (requires an external source of features).

phrases

Automatically identifies frequent word- and phrase-based features in the text of documents stored in the index.

dictionary

This feature extractor adds feature fields that index phrases or terms sourced from a fixed, predefined dictionary. This can be useful when the set of features (labels) should be limited to a specific vocabulary or ontology. Another practical use case is indexing geographical locations, references of objects or people.

{
  "type": "dictionary",
  "intervalueGap": 64,
  "labels": [],
  "maxPhrasesPerField": 0,
  "targetFields": null
}

The dictionary extractor requires one or more feature dictionaries. A feature dictionary file is a JSON file listing all known features (and their spelling variants) which should be annotated in indexed documents. Feature dictionaries are provided to the dictionary extractor using the features.dictionary.labels property.

An example content of the feature dictionary file can look as shown below.

[
  {
    "label": "Animals",
    "match": [
      "hound",
      "dog",
      "fox",
      "foxy"
    ]
  },
  {
    "label": "Foxes",
    "match": [
      "fox",
      "foxy",
      "furry foxy"
    ]
  }
]

Each feature is an object, with a string description (the label property) and a set of strings with different variants of the feature's appearance. Note that:

  • Each dictionary feature must have a non-empty and unique visual description (a label). This label will be used to represent the feature (it will be the feature's label).

  • A single feature may contain a number of different string variants. These variants can be terms or phrases.

  • If two or more features contain the same matching string (as it is the case with fox and foxy in the example above), all those features will be indexed at the position those matching strings occur at.

For example, given the above dictionary, an input text field with the english analyzer, and input document with the following text field value:

The quick brown fox jumps over the lazy dog.

The following underlined text spans would be indexed as the feature Animals:

The quick brown fox jumps over the lazy dog.

And the following text span would be indexed as the occurrence of the feature Foxes:

The quick brown fox jumps over the lazy dog.

Analyzers, tokenization and feature matching

The text of a document's field is split into tokens according to the featureAnalyzer specification provided for that field. When the dictionary extractor is applied to a field, its matching strings are also tokenized using the same analyzer. The dictionary extractor looks for identical sequences of tokens and creates a match when a feature's token sequence exists in the document's field somewhere.

Analyzers that normalize the input text somehow (for example convert it to lower case), will therefore require only one spelling or case-insensitive variant of a given label. Analyzers that preserve letter case and surface token forms need all potential spelling variants of the given feature.

intervalue​Gap

Type
integer
Default
64
Constraints
value >= 1
Required
no

A synthetic feature position padding added between values of multi-valued fields.

This is an expert setting. The inter-value gap only matters for proximity queries executed on feature fields (very uncommon).

labels

Type
array of string
Default
[]
Required
no

A string or an array of strings with JSON files containing feature dictionaries. Paths are relative to the project's directory.

Each JSON file must contain an array of features and their matching rules, as explained in the overview of the dictionary extractor.

max​Phrases​Per​Field

Type
integer
Default
0
Constraints
value >= 0
Required
no

This property limits the number of features (labels) indexed for each field to the provided maximum number of most-frequent labels. This option can be used to reduce the number of indexed features while keeping the most frequent features in a document. About a hundred most-frequent features are typically enough to achieve analysis results similar to those obtained with a full feature set.

If the value of this option is zero, all discovered labels will be indexed for the specified set of target fields.

target​Fields

Type
array of string
Default
undefined
Required
no

An array of field names to which this extractor will be applied. For each provided field, Lingo4G will create a corresponding feature field named <source-field-name>$<extractor-key>.

All fields must have a defined feature analyzer.

phrases

This feature extractor can be used when features should be discovered in the text automatically, without any prior knowledge.

{
  "type": "phrases",
  "allowStopwordsInsideLabels": false,
  "intervalueGap": 64,
  "maxPhraseDfRatio": 0.33,
  "maxPhraseTermCount": 5,
  "maxPhraseTf": "unlimited",
  "maxPhrases": 0,
  "maxPhrasesPerField": 0,
  "maxTermLength": 240,
  "minPhraseDf": 4,
  "minPhraseTf": 1,
  "minTermDf": 4,
  "omitLabelsWithNumbers": false,
  "omitTruncatedLabels": false,
  "skipSubphrases": true,
  "sourceFields": null,
  "targetFields": null
}

The frequent phrase extractor has the following characteristics:

  • it automatically discovers and indexes terms and phrases that occur frequently in input documents,

  • it attempts to filter out common structural part of the language (like stop words or other frequent boilerplate phrases),

  • it can normalize minor differences in the appearance of the surface form of a phrase, choosing the most frequent variant as the feature's label, for example: web page, web pages, webpage or web-page would all be normalized into a single feature labelled with the most frequent of these surface forms.

The phrase feature extractor works by collecting and counting all terms and phrases (phrases are n-grams of terms). A term or phrase is counted only once per document, regardless of how many times it is repeated within that document. Once the counting of candidate terms and phrases completes, the feature selection proceeds by:

  • selecting terms that occurred in more than minTermDf documents (and fulfill other filtering criteria),

  • selecting phrases that occurred in more than minPhraseDf documents (and fulfill other filtering criteria).

Note that the discovered features (terms and phrases) can overlap or be a subset of one another. For example, in a sentence like this one:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

all the following features could be discovered and indexed independently (the whole input is repeated for clarity, features are underlined):

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Noisy features The phrase feature extractor may create many redundant or noisy features. Such features should be naturally eliminated later by specific analysis requests (like clustering requests).

allow​Stopwords​Inside​Labels

Type
boolean
Default
false
Required
no

If true, phrase features discovered by the extractor may include terms marked as stop words by the source field's feature analyzer.

This setting allows discovery of phrase features that include stop words (like Bank of England, where of is a stop word) at the cost of increasing index size and potentially discovering many meaningless features that are accidental word co-occurrences.

intervalue​Gap

Type
integer
Default
64
Constraints
value >= 1
Required
no

A synthetic feature position padding added between values of multi-valued fields.

This is an expert setting. The inter-value gap only matters for proximity queries executed on feature fields (very uncommon).

max​Phrase​Df​Ratio

Type
number
Default
0.33
Constraints
value >= 0 and value <= 1
Required
no

If a phrase or term exists in more than this ratio of documents, it will be ignored. A ratio of 0.5 means 50% of documents in the index, a ratio of 1 means 100% of documents in the index.

Typically, phrases that occur in more than 30% of all the documents in a collection are either boilerplate headers or structural elements of the language (not informative) and can be safely dropped from the index. This improves speed and decreases index size.

max​Phrase​Term​Count

Type
integer
Default
5
Constraints
value >= 1
Required
no

The maximum number of non-stop-words to allow in a phrase. Phrases longer than the specified limit will not be extracted.

Raising max​Phrase​Term​Count above the default value of 5 will significantly increase the index size, indexing and clustering time.

max​Phrase​Tf

Type
or integer
Default
"unlimited"
Required
no

Maximum allowed number of occurrences of a phrase within a single document. Phrases that reach this threshold or exceed it, are ignored within that document.

max​Phrases

Type
integer
Default
0
Constraints
value >= 0
Required
no

This attribute limits the total number of features allowed in the index to top-N most frequent features detected in the entire input. For many types of analytical requests there is very little difference in quality if the feature set is limited to a million or two million features (many larger data sets will discover feature sets much larger than this). The benefit of limiting the number of features to the most common ones is in accelerated feature indexing and reduced index size.

The default value of this attribute (0) means all features passing other criteria are allowed.

max​Phrases​Per​Field

Type
integer
Default
0
Constraints
value >= 0
Required
no

This property limits the number of features (labels) indexed for each field to the provided maximum number of most-frequent labels. This option can be used to reduce the number of indexed features while keeping the most frequent features in a document. About a hundred most-frequent features are typically enough to achieve analysis results similar to those obtained with a full feature set.

If the value of this option is zero, all discovered labels will be indexed for the specified set of target fields.

max​Term​Length

Type
integer
Default
240
Constraints
value >= 1
Required
no

The maximum length of single-term features, in characters. Words longer than the specified limit will be ignored.

min​Phrase​Df

Type
integer
Default
4
Constraints
value >= 1
Required
no

The minimum required number of occurrences of a phrase. Phrases appearing in fewer than the specified number of documents will be ignored.

Increasing min​Phrase​Df threshold will filter out noisy phrases, decrease the size of the index and significantly speed-up indexing and clustering.

min​Phrase​Tf

Type
integer
Default
1
Constraints
value >= 1
Required
no

Minimum allowed number of occurrences of a phrase within a single document. Phrases that occur less frequently than this threshold are ignored (within that document).

This parameter can be set to 2 to filter out phrases that occur once in a document. Such phrases can still become labels if they occur more frequently in at least one other document.

min​Term​Df

Type
integer
Default
4
Constraints
value >= 0
Required
no

The minimum required number of occurrences of a single-term feature. Words appearing in fewer than the specified number of documents will be ignored.

Increasing the min​Term​Df threshold will help to filter out noisy words, decrease the size of the index and speed-up indexing and clustering. For efficient noise removal on large data sets, consider bumping the min​Phrase​Df threshold as well.

omit​Labels​With​Numbers

Type
boolean
Default
false
Required
no

If set to true, any terms or phrases containing numeric tokens will be omitted from the index. While this option drops a significant amount of features, it should be used with care as certain potential valid features contain numbers (Windows 10, Terminator 2).

omit​Truncated​Labels

Type
boolean
Default
false
Required
no

If set to true, labels that are sub-sequences of other labels and have an identical number of occurrences within the document will be omitted (not indexed). This setting can be used to clean up shorter, "truncated" phrases. For example, Computing Machinery in Association for Computing Machinery.

This option may improve per-document labels but may also affect the results in subtle ways. For example, a scope selector for documents containing the Computing Machinery label will not return a document containing only the full Association for Computing Machinery.

skip​Subphrases

Type
boolean
Default
true
Required
no

Setting this option to true causes the extractor to emit only phrases of possibly maximum length. For example, if the document only contains full mentions of Association for Computing Machinery, that entire phrase would be contributed, without shorter subparts, such as like Computing Machinery.

We don't recommend setting this option to false, as it may lead to longer processing times and lower-quality results.

source​Fields

Type
array of string
Default
undefined
Required
yes

An array of field names used as the source of text in which frequent phrases and terms are discovered.

All fields must have a defined feature analyzer.

target​Fields

Type
array of string
Default
undefined
Required
yes

An array of field names to which this extractor will be applied. For each provided field, Lingo4G will create a corresponding feature field named <source-field-name>$<extractor-key>.

All fields must have a defined feature analyzer.

index​Compression

Type
string
Default
"lz4"
Constraints
one of [lz4, zip]
Required
no

This property controls how the document index is compressed. Better compression typically requires more processing resources during indexing but results in smaller indexes on disk (and these can be more efficiently cached by operating system I/O caches).

The index​Compression property supports the following values:

lz4

Uses lz4 compression. Favors indexing and document retrieval speed over disk size.

zip

Uses zip (deflate) compression. May increase indexing time slightly (by 10%) but should reduce document index size by ~25% (depends on how well documents compress).

max​Cacheable​Fst

Type
string
Default
"500M"
Required
no

This is a very low-level setting that only affects indexing performance in a minor way. Leave at the default value.

Provides the maximum size of the finite state automaton used for feature matching, which can undergo further data structure optimizations (arc-hashing optimization). Optimized automata are slightly faster to apply during document indexing.

The default value of this attribute is 500​M bytes.

sampling​Ratio

Type
number
Default
1
Constraints
value >= 0 and value <= 1
Required
no

Declares the sampling ratio over all the input documents that feature extractors use to extract features. This is useful to limit the time required to extract features from large data sets or data sets with a very large set of features.

The value of sampling​Ratio must be a number between 0 (exclusive) and 1 (inclusive) and indicates the probability with which each document is processed in each required document scan. For example, a sampling​Ratio of 0.25 used together with the phrase extractor will result in terms and phrases discovered from a random subset of 25% of the original documents.

The default value of this attribute is 1 (all documents are processed in each scan).

stop​Label​Extractor

Configures the automatic discovery of collection-specific stop labels.

{
  "categoryFields": [],
  "featureFields": [],
  "maxLabelsPerPartitionQuery": 10000,
  "maxPartitionQueries": 100,
  "minStopLabelCoverage": 0.7,
  "partitionQueryMaxRelativeDf": 0.15,
  "partitionQueryMinRelativeDf": 0.001,
  "threads": "auto"
}

During indexing, Lingo4G will attempt to discover collection-specific stop labels, that is labels that poorly characterize documents in the collection. Typically, such stop labels will include generic terms or phrases that are not good indicators of the document content and occur randomly across all documents. Such stop labels may include phrases such as taking place, soon discovers or starts. While such phrases could be collected once, an automated detection during data indexing has an advantage of being dynamic and adjusting to the data set. For example, a medical research data set could include domain-specific stop labels that are not universally meaningless, but occur very frequently within that particular domain like indicate, studies suggest or control.

Ideally, the category​Fields should include fields that separate all documents into fairly independent, smaller subsets. Good examples are tags, company divisions, or institution names. If no such fields exist in the collection, or if they don't provide enough information for stop label extraction, feature​Fields should be used to specify fields contributed by feature extractors.

All other parameters are expert-level settings and typically will not require tuning.

Stop label detection algorithm

The full process of computing stop labels works as detailed below. Please note that the details of the algorithm are an implementation detail and may change without notice.

  1. First, the algorithm attempts to determine terms (at most max​Partition​Queries of them) that slice the indexed collection of documents into potentially independent subsets. These "slicing" terms are first taken from fields declared in category​Fields attribute, followed by terms from feature fields declared in the feature​Fields attribute.

    Only terms that cover a fraction of all input documents between partition​Query​Min​Relative​Df and partition​Query​Max​Relative​Df will be accepted as partitioning queries.

  2. Then, a maximum of max​Labels​Per​Partition​Query labels for each partitioning query are collected and the algorithm computes which terms each of those labels was relevant to, as well as the chance of that term being a "frequent" or "popular" phrase across all documents.

  3. The topmost "frequent" labels relevant to at least a ratio of min​Stop​Label​Coverage of all slicing terms are selected as stop labels. For example, min​Stop​Label​Coverage of 0.2 and max​Partition​Queries of 200 would mean the label was present in documents matched by at least 40 slicing terms.

category​Fields

Type
array of string
Default
[]
Required
no

This property should be used to provide an array of field names whose values separate all documents into fairly topic-independent, smaller subsets. Leave empty if no such fields exist.

feature​Fields

Type
array of string
Default
[]
Required
no

This property contains additional feature fields from which document-partitioning terms are collected if there are insufficient terms in category​Fields.

Note that feature field names are a concatenation of the target field they were applied to, the $ character and the feature extractor that produced them.

max​Labels​Per​Partition​Query

Type
integer
Default
10000
Constraints
value >= 1
Required
no

The maximum number of labels collected per document partition.

max​Partition​Queries

Type
integer
Default
100
Constraints
value >= 1
Required
no

The maximum number of partitioning queries.

min​Stop​Label​Coverage

Type
number
Default
0.7
Constraints
value >= 0 and value <= 1
Required
no

The minimum frequency of stop labels must reach with respect to all slicing terms. For example, min​Stop​Label​Coverage of 0.2 and max​Partition​Queries of 200 would mean the stop label must be present in documents matched by at least 40 partitioning queries.

partition​Query​Max​Relative​Df

Type
number
Default
0.15
Constraints
value >= 0 and value <= 1
Required
no

Limits partitioning term queries to only those that cover a fraction of all input documents between partition​Query​Min​Relative​Df and partition​Query​Max​Relative​Df.

partition​Query​Min​Relative​Df

Type
number
Default
0.001
Constraints
value >= 0 and value <= 1
Required
no

Limits partitioning term queries to only those that cover a fraction of all input documents between partition​Query​Min​Relative​Df and partition​Query​Max​Relative​Df.

threads

Type
string or integer
Default
auto
Required
no

Specifies the number of threads that should be used for stop label discovery.

threads

Type
string or integer
Default
auto
Required
no

Determines the number of threads Lingo4G uses to perform indexing.

Faster disk drives (SSD or NVMe) permit higher concurrency levels, while conventional spinning drives typically perform very poorly with multiple threads reading from different disk regions concurrently. There are several ways to express the permitted concurrency level:

auto
The number of threads used for indexing will be automatically and dynamically adjusted to maximize indexing throughput.
n
A fixed number of n threads will be used for indexing. For spinning drives, this should be set to 1 (or auto). For SSD drives and NVMe drives, the number of threads should be close to the number of available CPU cores.
n–m
The number of threads will be automatically adjusted in the range between n and m to maximize indexing throughput. For example 1–4 will result in any number of concurrent threads between 1 and 4. This syntax can be used to decrease system load if automatic throughput management attempts to use all available CPUs.

The default (and strongly recommended) value of this attribute is auto.