Query parsers

The query​Parsers section of the project descriptor declares parsers that Lingo4G uses to prepare Lucene queries from text.

Properties in this object must be of the following types:

enhanced

Implements an enhanced syntax of Lucene's (flexible) Standard​Query​Parser. The extensions include support for interval queries.

A typical query parsers section of the project descriptor looks like this:

{
  "queryParsers": {
    "enhanced": {
      "type": "enhanced",
      "defaultFields": [
        "title",
        "abstract"
      ],
      "defaultOperator": "OR"
    }
  }
}

The query​Parsers property of the project descriptor must be an object with keys corresponding to query parser identifiers. Use that identifier to choose the query parser at analysis time, for example in the query​Parser property of the query:​string component.

Property values of the query​Parsers object represent configurations of the specific query parser. Each configuration object must contain the type property defining the kind of query parser to use.

Important

You must declare at least one query parser configuration in your project descriptor.

You can use the following query parser types in your project descriptors:

enhanced
A parser with syntax based on Lucene's standard query parser.

enhanced

This query parser implements an enhanced version of the syntax of Lucene's (flexible) Standard​Query​Parser. The extensions include support for interval queries.

{
  "type": "enhanced",
  "defaultFields": [],
  "defaultOperator": "AND",
  "sanitizeSpaces": "(?U)\\p{Blank}+",
  "validateFields": true
}

Query syntax

The text query must contain one or more clauses, optionally combined with Boolean operators AND or OR. If you don't provide any explicit operators in the query, Lingo4G uses the default​Operator to combine clauses.

A review of all types of clauses and their modifications is provided in the following sections.

Term queries

A simple term query selects documents that contain matching terms in any of the default search fields. The following list shows a few examples of different term queries.

  • test — selects documents containing the word test.

  • "test equipment" —phrase search; selects documents containing adjacent terms test equipment.

  • "test failure"~4 —proximity search; selects documents containing the words test and failure within 4 words (positions) from each other. The provided "proximity" is technically translated into "edit distance" (maximum number of atomic word-moving operations required to transform the document's phrase into the query phrase). For a more intuitive notion of proximity, use the ordered interval searches with a maximum position range constraint.

  • tes* — prefix wildcard matching; selects documents containing words starting with tes, such as: test, testing or testable.

  • /.est(s|ing)/ — documents containing words matching the regular expression you provide. Here documents containing resting or nests would both match, along with other terms ending in ests or esting.

  • nest~2 — fuzzy term matching; documents containing words within 2-edits distance (2 additions, removals or replacements of a letter) from nest, such as test, net or rests.

Fields

An unqualified term query is applied to all default search fields specified in your project descriptor. To search for terms in a specific field, prefix the term with the field name followed by a colon, for example:

  • title:​test — documents containing test in the title field.

It is also possible to group several clauses and apply them to a single field using parentheses:

  • title:​(dandelion ​O​R daisy) — documents containing dandelion or daisy in the title field.

Boolean operators

You can combine simple terms and other clauses using the AND, OR and NOT Boolean operators, for example:

  • test ​A​N​D results — selects documents containing both the word test and the word results in any of the default search fields.

  • test ​O​R suite ​O​R results — selects documents with at least one of test, suite or results in any of the default search fields.

  • title:​test ​A​N​D ​N​O​T title:​complete — selects documents containing test and not containing complete in the title field.

  • title:​test ​A​N​D (pass* ​O​R fail*) — grouping; use parentheses to specify the precedence of terms in a Boolean clause. Query will match documents containing test in the title field and a word starting with pass or fail in the default search fields.

  • title:​(pass fail skip) — uses the default operator to combine three term queries. If the default operator is an O​R then the query selects documents containing at least one of pass, fail or skip in the title field. If the default operator is an A​N​D then the query selects documents containing all of those terms in the title field.

  • title:​(+test +"result unknown") — shorthand AND notation; documents containing both pass and result unknown in the title field.

Note the operators must be written in all-capital letters.

Range operators

To search for ranges of textual or numeric values, use square or curly brackets, for example:

  • name:​[​Jones ​T​O ​Smith] — inclusive range; selects documents whose name field has any value between Jones and Smith, including boundaries.

  • score:​{2.5 ​T​O 7.3} — exclusive range; selects documents whose score field is between 2.5 and 7.3, excluding boundaries.

  • score:​{2.5 ​T​O *] — one-sided range; selects documents whose score field is larger than 2.5.

Term boosting

You can attach a floating point boost value to quoted terms, term range expressions and grouped clauses to increase their score relative to other clauses. For example:

  • jones^2 ​O​R smith^0.5 — prioritizes documents with jones term over matches on the smith term.

  • field:​(a ​O​R b ​N​O​T c)^2.5 ​O​R field:​d — applies the boost to a sub-query.

Special character escaping

You can put most search terms in double quotes to make special-character escaping unnecessary. If a search term contains the quote character (or cannot be quoted for some reason), use backslash to escape any character. For example:

  • \:​\(quoted\+term\)\: — a single search term :​(quoted+term): with escape sequences. An alternative quoted form would be simpler: ":​(quoted+term):​".

Another case when quoting may be required is to escape leading forward slashes, which are parsed as regular expressions. For example, this query will not parse correctly without quotes:

  • title:​"/daisy" — a full quote is needed here to prevent the leading forward slash character from being recognized as an (invalid) regular expression term query.
Handling of quoted expressions

The conversion from a quoted expression to a Lucene query depends on the analyzer specified for the field the quoted expression applies to. Term queries are parsed and divided into a stream of individual tokens using the same analyzer used to index the field's content. The result is a phrase query for a stream of tokens or a simple term query for a single token.

Minimum-should-match constraint

You can apply the minimum-should-match operator to a disjunction Boolean query (a query with only "OR"-subclauses), forcing the query to match documents containing at least the provided number of subclauses. For example:

  • (blue ​O​R crab ​O​R fish)@2 — matches all documents with at least two terms from the [blue, crab, fish] set (in any order).

  • ((yellow ​A​N​D blue) ​O​R crab ​O​R fish)@2 — sub-clauses of the top-level disjunction query can themselves be complex queries; here the min-should-match selects documents that match at least two of: yellow ​A​N​D blue, crab, fish.

Interval queries and functions

Interval functions are a very powerful mechanism for selecting documents based on the presence and proximity of specific regions of text. Before we explain how interval functions work, we need to show how Lingo4G and Lucene index text data. When indexing, Lucene splits the text of each field in each document into tokens. Each token has an associated position in the token stream. For example, the following sentence:

The quick brown fox jumps over the lazy dog

could be transformed into the following token stream. Note that some token positions are "blank", these positions reflect tokens omitted from the index, typically stop words.

The— quick2 brown3 fox4 jumps5 over6 the— lazy7 dog8

Intervals are contiguous spans between two token positions in a document. For example, consider this interval query for intervals between an ordered sequence of terms brown and dog: fn:​ordered(brown dog). The query covers the following interval:

The quick brown fox jumps over the lazy dog

The result of this function is the entire span of terms between brown and dog. This type of function can be called an interval selector. The second class of interval functions works on top of other intervals and provide filters (interval restrictions).

In the above example, the matching interval can be of any length — if the word brown occurred at the beginning of the document and the word dog at the very end, the interval would be very long, covering the entire document. You can restrict the matching intervals to, for example, only those with at most 3 positions between the search terms: fn:​maxgaps(3 fn:​ordered(brown dog)). There are five tokens in between the terms dog and brown (and therefore five "gaps" between the matching interval's positions) and the above query no longer matches the input document at all.

Interval filtering functions allow expressing a variety of conditions ordinary Lucene queries can't easily cover. For example, consider this interval query that searches for words lazy or quick but only if they are in the neighborhood of 1 position from the words dog or fox:

fn:​within(fn:​or(lazy quick) 1 fn:​or(dog fox))

The result of this query is correctly shown below: only the word lazy matches the query; the word quick is 2 positions away from fox but is not part of the match (it's only the interval's filtering condition).

The quick brown fox jumps over the lazy dog

Interval functions

Enhanced query parser supports the following interval functions, grouped by similar functionality:

term queries
alternatives
length restrictions
context filtering
ordering
containment
Examples

All examples in the following description of interval functions assume a single document with the following content (tokens):

The quick brown fox jumps over the lazy dog

term literals

Quoted or unquoted character sequences are converted into an interval expression based on the sequence (or graph) of tokens returned by the field's analyzer. In most cases, the interval expression will be a contiguous sequence of tokens equivalent to that returned by the field's analysis chain.

Another way to express a contiguous sequence of terms is to use the fn:​phrase function.

Examples
  • fn:​or(quick "fox")

    The quick brown fox jumps over the lazy dog

  • "quick fox" (The document does not match — no adjacent terms quick fox exist.)

    The quick brown fox jumps over the lazy dog

  • fn:​phrase(quick brown fox)

    The quick brown fox jumps over the lazy dog

fn:​wildcard

Matches the disjunction of all terms that match a wildcard glob.

Heads up, clause count limit.

The expanded wildcard can cover a lot of terms. By default, Lingo4G limits the maximum number of such "expansions" to 128. You can override the default limit, but this can lead to excessive memory use or slow query execution.

Arguments

fn:​wildcard(glob max​Expansions)

glob
term glob to expand based on the contents of the index.
max​Expansions
maximum acceptable number of term expansions before the function fails. This is an optional parameter.
Examples
  • fn:​wildcard(jump*)

    The quick brown fox jumps over the lazy dog

  • fn:​wildcard(br*n)

    The quick brown fox jumps over the lazy dog

fn:​fuzzy​Term

Matches the disjunction of all terms that are within the given edit distance from the provided base.

Heads up, clause count limit.

The expanded set of terms can cover a lot of terms. By default, Lingo4G limits the maximum number of such "expansions" to 128. You can override the default limit, but this can lead to excessive memory use or slow query execution.

Arguments

fn:​fuzzy​Term(glob max​Edits max​Expansions)

glob
the baseline term.
max​Edits
maximum number of edit operations for the transformed term to be considered equal (1 or 2).
max​Expansions
maximum acceptable number of term expansions before the function fails. This is an optional parameter.
Examples
  • fn:​fuzzy​Term(box)

    The quick brown fox jumps over the lazy dog

fn:​or

Matches the disjunction of nested intervals.

Arguments

fn:​or(sources...)

sources
sub-intervals (terms or other functions)
Examples
  • fn:​or(dog fox)

    The quick brown fox jumps over the lazy dog

fn:​at​Least

Matches documents that contain at least the provided number of source intervals.

Arguments

fn:​at​Least(min sources...)

min
an integer specifying minimum number of sub-interval arguments that must match.
sources
sub-intervals (terms or other functions)
Examples
  • fn:​at​Least(2 quick fox "furry dog")

    The quick brown fox jumps over the lazy dog

  • fn:​at​Least(2 fn:​unordered(furry dog) fn:​unordered(brown dog) lazy quick) (This query results in multiple overlapping intervals.)

    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog
    The quick brown fox jumps over the lazy dog

fn:​maxgaps

Accepts source interval if it has at most max position gaps.

Arguments

fn:​maxgaps(gaps source)

gaps
an integer specifying maximum number of source's position gaps.
source
source sub-interval.
Examples
  • fn:​maxgaps(0 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))

    The quick brown fox jumps over the lazy dog

  • fn:​maxgaps(1 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))

    The quick brown fox jumps over the lazy dog

fn:​maxwidth

Accepts source interval if it has at most the given width (position span).

Arguments

fn:​maxwidth(max source)

max
an integer specifying maximum width of source's position span.
source
source sub-interval.
Examples
  • fn:​maxwidth(2 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))

    The quick brown fox jumps over the lazy dog

  • fn:​maxwidth(3 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))

    The quick brown fox jumps over the lazy dog

fn:​phrase

Matches an ordered, gapless sequence of source intervals.

Arguments

fn:​phrase(sources...)

sources
sub-intervals (terms or other functions)
Examples
  • fn:​phrase(quick brown fox)

    The quick brown fox jumps over the lazy dog

  • fn:​phrase(fn:​ordered(quick fox) jumps)

    The quick brown fox jumps over the lazy dog

fn:​ordered

Matches an ordered span containing all source intervals, possibly with gaps in between their respective source interval positions. Source intervals must not overlap.

Arguments

fn:​ordered(sources...)

sources
sub-intervals (terms or other functions)
Examples
  • fn:​ordered(quick jumps dog)

    The quick brown fox jumps over the lazy dog

  • fn:​ordered(quick fn:​or(fox dog)) (Note only the shorter match out of the two alternatives is included in the result; the algorithm is not required to return or highlight all matching interval alternatives).

    The quick brown fox jumps over the lazy dog

  • fn:​ordered(quick jumps fn:​or(fox dog))

    The quick brown fox jumps over the lazy dog

  • fn:​ordered(fn:​phrase(brown fox) fn:​phrase(fox jumps)) (Sources overlap, no matches.)

    The quick brown fox jumps over the lazy dog

fn:​unordered

Matches an unordered span containing all source intervals, possibly with gaps in between their respective source interval positions. Source intervals may overlap.

Arguments

fn:​unordered(sources...)

sources
sub-intervals (terms or other functions)
Examples
  • fn:​unordered(dog jumps quick)

    The quick brown fox jumps over the lazy dog

  • fn:​unordered(fn:​or(fox dog) quick) (Note only the shorter match out of the two alternatives is included in the result; the algorithm is not required to return or highlight all matching interval alternatives).

    The quick brown fox jumps over the lazy dog

  • fn:​unordered(fn:​phrase(brown fox) fn:​phrase(fox jumps))

    The quick brown fox jumps over the lazy dog

fn:​unordered​No​Overlaps

Matches an unordered span containing two source intervals, possibly with gaps in between their respective source interval positions. Source intervals must not overlap.

Note that, unlike fn:​unordered, this function takes a fixed number of arguments (two).

Arguments

fn:​unordered​No​Overlaps(source1 source2)

source1
sub-interval (term or other function)
source2
sub-interval (term or other function)
Examples
  • fn:​unordered​No​Overlaps(fn:​phrase(fox jumps) brown)

    The quick brown fox jumps over the lazy dog

  • fn:​unordered​No​Overlaps(fn:​phrase(brown fox) fn:​phrase(fox jumps)) (Sources overlap, no matches.)

    The quick brown fox jumps over the lazy dog

fn:​before

Matches intervals from the source that appear before intervals from the reference.

This is a filtering function, reference intervals will not be part of the match.

Arguments

fn:​before(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​before(fn:​or(brown lazy) fox)

    The quick brown fox jumps over the lazy dog

  • fn:​before(fn:​or(brown lazy) fn:​or(dog fox))

    The quick brown fox jumps over the lazy dog

fn:​after

Matches intervals from the source that appear after intervals from the reference.

This is a filtering function, reference intervals will not be part of the match.

Arguments

fn:​after(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​after(fn:​or(brown lazy) fox)

    The quick brown fox jumps over the lazy dog

  • fn:​after(fn:​or(brown lazy) fn:​or(dog fox))

    The quick brown fox jumps over the lazy dog

fn:​extend

Matches an interval around another source, extending its span by a number of positions before and after.

This is an advanced function that allows extending the left and right "context" of another interval.

Arguments

fn:​extend(source before after)

source
source sub-interval (term or other function)
before
an integer number of positions to extend to the left of the source
after
an integer number of positions to extend to the right of the source
Examples
  • fn:​extend(fox 1 2)

    The quick brown fox jumps over the lazy dog

  • fn:​extend(fn:​or(dog fox) 2 0)

    The quick brown fox jumps over the lazy dog

fn:​within

Matches intervals of the source that appear within the provided number of positions from the intervals of the reference.

Arguments

fn:​within(source positions reference)

source
source sub-interval (term or other function)
positions
an integer number of maximum positions between source and reference
reference
reference sub-interval (term or other function)
Examples
  • fn:​within(fn:​or(fox dog) 1 fn:​or(quick lazy))

    The quick brown fox jumps over the lazy dog

  • fn:​within(fn:​or(fox dog) 2 fn:​or(quick lazy))

    The quick brown fox jumps over the lazy dog

fn:​not​Within

Matches intervals of the source that do not appear within the provided number of positions from the intervals of the reference.

Arguments

fn:​not​Within(source positions reference)

source
source sub-interval (term or other function)
positions
an integer number of maximum positions between source and reference
reference
reference sub-interval (term or other function)
Examples
  • fn:​not​Within(fn:​or(fox dog) 1 fn:​or(quick lazy))

    The quick brown fox jumps over the lazy dog

fn:​contained​By

Matches intervals of the source that are contained by intervals of the reference.

Arguments

fn:​contained​By(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​contained​By(fn:​or(fox dog) fn:​ordered(quick lazy))

    The quick brown fox jumps over the lazy dog

  • fn:​contained​By(fn:​or(fox dog) fn:​extend(lazy 3 3))

    The quick brown fox jumps over the lazy dog

fn:​not​Contained​By

Matches intervals of the source that are not contained by intervals of the reference.

Arguments

fn:​not​Contained​By(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​not​Contained​By(fn:​or(fox dog) fn:​ordered(quick lazy))

    The quick brown fox jumps over the lazy dog

  • fn:​not​Contained​By(fn:​or(fox dog) fn:​extend(lazy 3 3))

    The quick brown fox jumps over the lazy dog

fn:​containing

Matches intervals of the source that contain at least one interval of the reference.

Arguments

fn:​containing(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​containing(fn:​extend(fn:​or(lazy brown) 1 1) fn:​or(fox dog))

    The quick brown fox jumps over the lazy dog

  • fn:​containing(fn:​at​Least(2 quick fox dog) jumps)

    The quick brown fox jumps over the lazy dog

fn:​not​Containing

Matches intervals of the source that do not contain any intervals of the reference.

Arguments

fn:​not​Containing(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​not​Containing(fn:​extend(fn:​or(fox dog) 1 0) fn:​or(brown yellow))

    The quick brown fox jumps over the lazy dog

  • fn:​not​Containing(fn:​ordered(fn:​or(the ​The) fn:​or(fox dog)) brown)

    The quick brown fox jumps over the lazy dog

fn:​overlapping

Matches intervals of the source that overlap with at least one interval of the reference.

Arguments

fn:​overlapping(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​overlapping(fn:​phrase(brown fox) fn:​phrase(fox jumps))

    The quick brown fox jumps over the lazy dog

  • fn:​overlapping(fn:​or(fox dog) fn:​extend(lazy 2 2))

    The quick brown fox jumps over the lazy dog

fn:​non​Overlapping

Matches intervals of the source that do not overlap with any intervals of the reference.

Arguments

fn:​non​Overlapping(source reference)

source
source sub-interval (term or other function)
reference
reference sub-interval (term or other function)
Examples
  • fn:​non​Overlapping(fn:​phrase(brown fox) fn:​phrase(lazy dog))

    The quick brown fox jumps over the lazy dog

  • fn:​non​Overlapping(fn:​or(fox dog) fn:​extend(lazy 2 2))

    The quick brown fox jumps over the lazy dog

default​Fields

Type
array of string
Default
[]
Required
no

An array of field names to search for query terms without an explicit field name qualifier. For example, the data title:​mining query contains one unqualified term: foo and one with an explicit field qualifier: title:​mining. If default​Fields was equal to ["title", "abstract"], Lingo4G would rewrite the query to (summary:​foo ​O​R description:​bar) title:​bar.

If you do not provide default​Fields or set it to an empty array, Lingo4G will raise errors for queries containing terms without explicit field qualifiers.

default​Operator

Type
string
Default
"AND"
Constraints
one of [AND, OR]
Required
no

The default Boolean operator Lingo4G applies to each clause of the query, unless you explicitly provide the operator to use.

For example, with the default​Operator equal to A​N​D, Lingo4G rewrites the data mining query to data ​A​N​D mining.

The default​Operator property supports the following values:

A​N​D

Conjunction operator.

O​R

Disjunction operator.

sanitize​Spaces

Type
string
Default
"(?U)\\p{Blank}+"
Required
no

Before parsing the query, Lingo4G replaces each occurrence of the regular expression pattern you provide in the sanitize​Spaces property with a single space character. The default pattern normalizes any sequence of Unicode white space characters into one plain space. To disable the replacement, set sanitize​Spaces to an empty string.

validate​Fields

Type
boolean
Default
true
Required
no

If true, Lingo4G raises an error if the query contains a field name qualifier referring to a field that does not exist in the index. Field name validation ensures that accidental typos in field names result in errors rather than empty search results.