Query parsers
The query​Parsers
section of the project descriptor declares parsers that Lingo4G uses to prepare Lucene
queries from text.
Properties in this object must be of the following types:
-
enhanced
-
Implements an enhanced syntax of Lucene's (flexible)
Standard​Query​Parser
. The extensions include support for interval queries.
A typical query parsers section of the project descriptor looks like this:
{
"queryParsers": {
"enhanced": {
"type": "enhanced",
"defaultFields": [
"title",
"abstract"
],
"defaultOperator": "OR"
}
}
}
The query​Parsers
property of the project descriptor must be an object with keys corresponding to query
parser identifiers. Use that identifier to choose the query parser at analysis time, for example in the
query​Parser
property of the
query:​string
component.
Property values of the query​Parsers
object represent configurations of the specific query parser. Each
configuration object must contain the type
property defining the kind of query parser to use.
You must declare at least one query parser configuration in your project descriptor.
You can use the following query parser types in your project descriptors:
enhanced
This query parser implements an enhanced version of the syntax of Lucene's (flexible)
Standard​Query​Parser
. The extensions include support for
interval queries.
{
"type": "enhanced",
"defaultFields": [],
"defaultOperator": "AND",
"sanitizeSpaces": "(?U)\\p{Blank}+",
"validateFields": true
}
Query syntax
The text query must contain one or more clauses, optionally combined with Boolean operators
AND or OR. If you don't provide any explicit operators in the query, Lingo4G uses the
default​Operator
to combine clauses.
A review of all types of clauses and their modifications is provided in the following sections.
Term queries
A simple term query selects documents that contain matching terms in any of the default search fields. The following list shows a few examples of different term queries.
-
test
— selects documents containing the word test. -
"test equipment"
—phrase search; selects documents containing adjacent terms test equipment. -
"test failure"~4
—proximity search; selects documents containing the words test and failure within 4 words (positions) from each other. The provided "proximity" is technically translated into "edit distance" (maximum number of atomic word-moving operations required to transform the document's phrase into the query phrase). For a more intuitive notion of proximity, use the ordered interval searches with a maximum position range constraint. -
tes*
— prefix wildcard matching; selects documents containing words starting with tes, such as: test, testing or testable. -
/.est(s|ing)/
— documents containing words matching the regular expression you provide. Here documents containing resting or nests would both match, along with other terms ending in ests or esting. -
nest~2
— fuzzy term matching; documents containing words within 2-edits distance (2 additions, removals or replacements of a letter) from nest, such as test, net or rests.
Fields
An unqualified term query is applied to all default search fields specified in your project descriptor. To search for terms in a specific field, prefix the term with the field name followed by a colon, for example:
title:​test
— documents containing test in thetitle
field.
It is also possible to group several clauses and apply them to a single field using parentheses:
-
title:​(dandelion ​O​R daisy)
— documents containing dandelion or daisy in thetitle
field.
Boolean operators
You can combine simple terms and other clauses using the AND, OR and NOT Boolean operators, for example:
-
test ​A​N​D results
— selects documents containing both the word test and the word results in any of the default search fields. -
test ​O​R suite ​O​R results
— selects documents with at least one of test, suite or results in any of the default search fields. -
title:​test ​A​N​D ​N​O​T title:​complete
— selects documents containing test and not containing complete in thetitle
field. -
title:​test ​A​N​D (pass* ​O​R fail*)
— grouping; use parentheses to specify the precedence of terms in a Boolean clause. Query will match documents containing test in thetitle
field and a word starting with pass or fail in the default search fields. -
title:​(pass fail skip)
— uses the default operator to combine three term queries. If the default operator is anO​R
then the query selects documents containing at least one of pass, fail or skip in thetitle
field. If the default operator is anA​N​D
then the query selects documents containing all of those terms in the title field. -
title:​(+test +"result unknown")
— shorthand AND notation; documents containing both pass and result unknown in thetitle
field.
Note the operators must be written in all-capital letters.
Range operators
To search for ranges of textual or numeric values, use square or curly brackets, for example:
-
name:​[​Jones ​T​O ​Smith]
— inclusive range; selects documents whosename
field has any value between Jones and Smith, including boundaries. -
score:​{2.5 ​T​O 7.3}
— exclusive range; selects documents whosescore
field is between 2.5 and 7.3, excluding boundaries. -
score:​{2.5 ​T​O *]
— one-sided range; selects documents whosescore
field is larger than 2.5.
Term boosting
You can attach a floating point boost value to quoted terms, term range expressions and grouped clauses to increase their score relative to other clauses. For example:
-
jones^2 ​O​R smith^0.5
— prioritizes documents withjones
term over matches on thesmith
term. -
field:​(a ​O​R b ​N​O​T c)^2.5 ​O​R field:​d
— applies the boost to a sub-query.
Special character escaping
You can put most search terms in double quotes to make special-character escaping unnecessary. If a search term contains the quote character (or cannot be quoted for some reason), use backslash to escape any character. For example:
-
\:​\(quoted\+term\)\:
— a single search term:​(quoted+term):
with escape sequences. An alternative quoted form would be simpler:":​(quoted+term):​"
.
Another case when quoting may be required is to escape leading forward slashes, which are parsed as regular expressions. For example, this query will not parse correctly without quotes:
-
title:​"/daisy"
— a full quote is needed here to prevent the leading forward slash character from being recognized as an (invalid) regular expression term query.
The conversion from a quoted expression to a Lucene query depends on the analyzer specified for the field the quoted expression applies to. Term queries are parsed and divided into a stream of individual tokens using the same analyzer used to index the field's content. The result is a phrase query for a stream of tokens or a simple term query for a single token.
Minimum-should-match constraint
You can apply the minimum-should-match operator to a disjunction Boolean query (a query with only "OR"-subclauses), forcing the query to match documents containing at least the provided number of subclauses. For example:
-
(blue ​O​R crab ​O​R fish)@2
— matches all documents with at least two terms from the [blue, crab, fish] set (in any order). -
((yellow ​A​N​D blue) ​O​R crab ​O​R fish)@2
— sub-clauses of the top-level disjunction query can themselves be complex queries; here the min-should-match selects documents that match at least two of:yellow ​A​N​D blue
,crab
,fish
.
Interval queries and functions
Interval functions are a very powerful mechanism for selecting documents based on the presence and proximity of specific regions of text. Before we explain how interval functions work, we need to show how Lingo4G and Lucene index text data. When indexing, Lucene splits the text of each field in each document into tokens. Each token has an associated position in the token stream. For example, the following sentence:
The quick brown fox jumps over the lazy dog
could be transformed into the following token stream. Note that some token positions are "blank", these positions reflect tokens omitted from the index, typically stop words.
The— quick2 brown3 fox4 jumps5 over6 the— lazy7 dog8
Intervals are contiguous spans between two token positions in a document. For example, consider this interval
query for intervals between an ordered sequence of terms brown
and dog
:
fn:​ordered(brown dog)
. The query covers the following interval:
The quick brown fox jumps over the lazy dog
The result of this function is the entire span of terms between
brown
and dog
. This type of function can be called an interval selector.
The second class of interval functions works on top of other intervals and provide filters (interval
restrictions).
In the above example, the matching interval can be of any length — if the word brown
occurred at
the beginning of the document and the word dog
at the very end, the interval would be very long,
covering the entire document. You can restrict the matching intervals to, for example, only those with at most
3 positions between the search terms: fn:​maxgaps(3 fn:​ordered(brown dog))
. There are five tokens
in between the terms dog
and brown
(and therefore five "gaps" between the matching
interval's positions) and the above query no longer matches the input document at all.
Interval filtering functions allow expressing a variety of conditions ordinary Lucene queries can't easily
cover. For example, consider this interval query that searches for words lazy
or
quick
but only if they are in the neighborhood of 1 position from the words dog
or
fox
:
fn:​within(fn:​or(lazy quick) 1 fn:​or(dog fox))
The result of this query is correctly shown below: only the word
lazy
matches the query; the word quick
is 2 positions away from fox
but is not part of the match (it's only the interval's filtering condition).
The quick brown fox jumps over the lazy dog
Interval functions
Enhanced query parser supports the following interval functions, grouped by similar functionality:
- term queries
- alternatives
- length restrictions
- context filtering
- ordering
- containment
All examples in the following description of interval functions assume a single document with the following content (tokens):
The quick brown fox jumps over the lazy dog
term literals
Quoted or unquoted character sequences are converted into an interval expression based on the sequence (or graph) of tokens returned by the field's analyzer. In most cases, the interval expression will be a contiguous sequence of tokens equivalent to that returned by the field's analysis chain.
Another way to express a contiguous sequence of terms is to use the
fn:​phrase
function.
- Examples
-
-
fn:​or(quick "fox")
The quick brown fox jumps over the lazy dog
-
"quick fox"
(The document does not match — no adjacent termsquick fox
exist.)The quick brown fox jumps over the lazy dog
-
fn:​phrase(quick brown fox)
The quick brown fox jumps over the lazy dog
-
fn:​wildcard
Matches the disjunction of all terms that match a wildcard glob.
The expanded wildcard can cover a lot of terms. By default, Lingo4G limits the maximum number of such "expansions" to 128. You can override the default limit, but this can lead to excessive memory use or slow query execution.
- Arguments
-
fn:​wildcard(glob max​Expansions)
glob
- term glob to expand based on the contents of the index.
max​Expansions
- maximum acceptable number of term expansions before the function fails. This is an optional parameter.
- Examples
-
-
fn:​wildcard(jump*)
The quick brown fox jumps over the lazy dog
-
fn:​wildcard(br*n)
The quick brown fox jumps over the lazy dog
-
fn:​fuzzy​Term
Matches the disjunction of all terms that are within the given edit distance from the provided base.
The expanded set of terms can cover a lot of terms. By default, Lingo4G limits the maximum number of such "expansions" to 128. You can override the default limit, but this can lead to excessive memory use or slow query execution.
- Arguments
-
fn:​fuzzy​Term(glob max​Edits max​Expansions)
glob
- the baseline term.
max​Edits
- maximum number of edit operations for the transformed term to be considered equal (1 or 2).
max​Expansions
- maximum acceptable number of term expansions before the function fails. This is an optional parameter.
- Examples
-
-
fn:​fuzzy​Term(box)
The quick brown fox jumps over the lazy dog
-
fn:​or
Matches the disjunction of nested intervals.
- Arguments
-
fn:​or(sources...)
sources
- sub-intervals (terms or other functions)
- Examples
-
-
fn:​or(dog fox)
The quick brown fox jumps over the lazy dog
-
fn:​at​Least
Matches documents that contain at least the provided number of source intervals.
- Arguments
-
fn:​at​Least(min sources...)
min
- an integer specifying minimum number of sub-interval arguments that must match.
sources
- sub-intervals (terms or other functions)
- Examples
-
-
fn:​at​Least(2 quick fox "furry dog")
The quick brown fox jumps over the lazy dog
-
fn:​at​Least(2 fn:​unordered(furry dog) fn:​unordered(brown dog) lazy quick)
(This query results in multiple overlapping intervals.)The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
-
fn:​maxgaps
Accepts source
interval if it has at most max
position gaps.
- Arguments
-
fn:​maxgaps(gaps source)
gaps
- an integer specifying maximum number of source's position gaps.
source
- source sub-interval.
- Examples
-
-
fn:​maxgaps(0 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))
The quick brown fox jumps over the lazy dog
-
fn:​maxgaps(1 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))
The quick brown fox jumps over the lazy dog
-
fn:​maxwidth
Accepts source
interval if it has at most the given width (position span).
- Arguments
-
fn:​maxwidth(max source)
max
- an integer specifying maximum width of source's position span.
source
- source sub-interval.
- Examples
-
-
fn:​maxwidth(2 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))
The quick brown fox jumps over the lazy dog
-
fn:​maxwidth(3 fn:​ordered(fn:​or(quick lazy) fn:​or(fox dog)))
The quick brown fox jumps over the lazy dog
-
fn:​phrase
Matches an ordered, gapless sequence of source intervals.
- Arguments
-
fn:​phrase(sources...)
sources
- sub-intervals (terms or other functions)
- Examples
-
-
fn:​phrase(quick brown fox)
The quick brown fox jumps over the lazy dog
-
fn:​phrase(fn:​ordered(quick fox) jumps)
The quick brown fox jumps over the lazy dog
-
fn:​ordered
Matches an ordered span containing all source intervals, possibly with gaps in between their respective source interval positions. Source intervals must not overlap.
- Arguments
-
fn:​ordered(sources...)
sources
- sub-intervals (terms or other functions)
- Examples
-
-
fn:​ordered(quick jumps dog)
The quick brown fox jumps over the lazy dog
-
fn:​ordered(quick fn:​or(fox dog))
(Note only the shorter match out of the two alternatives is included in the result; the algorithm is not required to return or highlight all matching interval alternatives).The quick brown fox jumps over the lazy dog
-
fn:​ordered(quick jumps fn:​or(fox dog))
The quick brown fox jumps over the lazy dog
-
fn:​ordered(fn:​phrase(brown fox) fn:​phrase(fox jumps))
(Sources overlap, no matches.)The quick brown fox jumps over the lazy dog
-
fn:​unordered
Matches an unordered span containing all source intervals, possibly with gaps in between their respective source interval positions. Source intervals may overlap.
- Arguments
-
fn:​unordered(sources...)
sources
- sub-intervals (terms or other functions)
- Examples
-
-
fn:​unordered(dog jumps quick)
The quick brown fox jumps over the lazy dog
-
fn:​unordered(fn:​or(fox dog) quick)
(Note only the shorter match out of the two alternatives is included in the result; the algorithm is not required to return or highlight all matching interval alternatives).The quick brown fox jumps over the lazy dog
-
fn:​unordered(fn:​phrase(brown fox) fn:​phrase(fox jumps))
The quick brown fox jumps over the lazy dog
-
fn:​unordered​No​Overlaps
Matches an unordered span containing two source intervals, possibly with gaps in between their respective source interval positions. Source intervals must not overlap.
Note that, unlike fn:​unordered
, this function takes a fixed number of arguments (two).
- Arguments
-
fn:​unordered​No​Overlaps(source1 source2)
source1
- sub-interval (term or other function)
source2
- sub-interval (term or other function)
- Examples
-
-
fn:​unordered​No​Overlaps(fn:​phrase(fox jumps) brown)
The quick brown fox jumps over the lazy dog
-
fn:​unordered​No​Overlaps(fn:​phrase(brown fox) fn:​phrase(fox jumps))
(Sources overlap, no matches.)The quick brown fox jumps over the lazy dog
-
fn:​before
Matches intervals from the source that appear before intervals from the reference.
This is a filtering function, reference intervals will not be part of the match.
- Arguments
-
fn:​before(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​before(fn:​or(brown lazy) fox)
The quick brown fox jumps over the lazy dog
-
fn:​before(fn:​or(brown lazy) fn:​or(dog fox))
The quick brown fox jumps over the lazy dog
-
fn:​after
Matches intervals from the source that appear after intervals from the reference.
This is a filtering function, reference intervals will not be part of the match.
- Arguments
-
fn:​after(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​after(fn:​or(brown lazy) fox)
The quick brown fox jumps over the lazy dog
-
fn:​after(fn:​or(brown lazy) fn:​or(dog fox))
The quick brown fox jumps over the lazy dog
-
fn:​extend
Matches an interval around another source, extending its span by a number of positions before and after.
This is an advanced function that allows extending the left and right "context" of another interval.
- Arguments
-
fn:​extend(source before after)
source
- source sub-interval (term or other function)
before
- an integer number of positions to extend to the left of the source
after
- an integer number of positions to extend to the right of the source
- Examples
-
-
fn:​extend(fox 1 2)
The quick brown fox jumps over the lazy dog
-
fn:​extend(fn:​or(dog fox) 2 0)
The quick brown fox jumps over the lazy dog
-
fn:​within
Matches intervals of the source that appear within the provided number of positions from the intervals of the reference.
- Arguments
-
fn:​within(source positions reference)
source
- source sub-interval (term or other function)
positions
- an integer number of maximum positions between source and reference
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​within(fn:​or(fox dog) 1 fn:​or(quick lazy))
The quick brown fox jumps over the lazy dog
-
fn:​within(fn:​or(fox dog) 2 fn:​or(quick lazy))
The quick brown fox jumps over the lazy dog
-
fn:​not​Within
Matches intervals of the source that do not appear within the provided number of positions from the intervals of the reference.
- Arguments
-
fn:​not​Within(source positions reference)
source
- source sub-interval (term or other function)
positions
- an integer number of maximum positions between source and reference
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​not​Within(fn:​or(fox dog) 1 fn:​or(quick lazy))
The quick brown fox jumps over the lazy dog
-
fn:​contained​By
Matches intervals of the source that are contained by intervals of the reference.
- Arguments
-
fn:​contained​By(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​contained​By(fn:​or(fox dog) fn:​ordered(quick lazy))
The quick brown fox jumps over the lazy dog
-
fn:​contained​By(fn:​or(fox dog) fn:​extend(lazy 3 3))
The quick brown fox jumps over the lazy dog
-
fn:​not​Contained​By
Matches intervals of the source that are not contained by intervals of the reference.
- Arguments
-
fn:​not​Contained​By(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​not​Contained​By(fn:​or(fox dog) fn:​ordered(quick lazy))
The quick brown fox jumps over the lazy dog
-
fn:​not​Contained​By(fn:​or(fox dog) fn:​extend(lazy 3 3))
The quick brown fox jumps over the lazy dog
-
fn:​containing
Matches intervals of the source that contain at least one interval of the reference.
- Arguments
-
fn:​containing(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​containing(fn:​extend(fn:​or(lazy brown) 1 1) fn:​or(fox dog))
The quick brown fox jumps over the lazy dog
-
fn:​containing(fn:​at​Least(2 quick fox dog) jumps)
The quick brown fox jumps over the lazy dog
-
fn:​not​Containing
Matches intervals of the source that do not contain any intervals of the reference.
- Arguments
-
fn:​not​Containing(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​not​Containing(fn:​extend(fn:​or(fox dog) 1 0) fn:​or(brown yellow))
The quick brown fox jumps over the lazy dog
-
fn:​not​Containing(fn:​ordered(fn:​or(the ​The) fn:​or(fox dog)) brown)
The quick brown fox jumps over the lazy dog
-
fn:​overlapping
Matches intervals of the source that overlap with at least one interval of the reference.
- Arguments
-
fn:​overlapping(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​overlapping(fn:​phrase(brown fox) fn:​phrase(fox jumps))
The quick brown fox jumps over the lazy dog
-
fn:​overlapping(fn:​or(fox dog) fn:​extend(lazy 2 2))
The quick brown fox jumps over the lazy dog
-
fn:​non​Overlapping
Matches intervals of the source that do not overlap with any intervals of the reference.
- Arguments
-
fn:​non​Overlapping(source reference)
source
- source sub-interval (term or other function)
reference
- reference sub-interval (term or other function)
- Examples
-
-
fn:​non​Overlapping(fn:​phrase(brown fox) fn:​phrase(lazy dog))
The quick brown fox jumps over the lazy dog
-
fn:​non​Overlapping(fn:​or(fox dog) fn:​extend(lazy 2 2))
The quick brown fox jumps over the lazy dog
-
default​Fields
An array of field names to search for query terms without an explicit field name qualifier. For example, the
data title:​mining
query contains one unqualified term: foo
and one with an explicit
field qualifier: title:​mining
. If default​Fields
was equal to
["title", "abstract"]
, Lingo4G would rewrite the query to
(summary:​foo ​O​R description:​bar) title:​bar
.
If you do not provide default​Fields
or set it to an empty array, Lingo4G will raise errors for
queries containing terms without explicit field qualifiers.
default​Operator
The default Boolean operator Lingo4G applies to each clause of the query, unless you explicitly provide the operator to use.
For example, with the default​Operator
equal to A​N​D
, Lingo4G rewrites the
data mining
query to data ​A​N​D mining
.
The default​Operator
property supports the following values:
A​N​D
-
Conjunction operator.
O​R
-
Disjunction operator.
sanitize​Spaces
Before parsing the query, Lingo4G replaces each occurrence of the regular expression pattern you provide in the
sanitize​Spaces
property with a single space character. The default pattern normalizes any sequence
of Unicode white space characters into one plain space. To disable the replacement, set
sanitize​Spaces
to an empty string.
validate​Fields
If true
, Lingo4G raises an error if the query contains a field name qualifier referring to a field
that does not exist in the index. Field name validation ensures that accidental typos in field names result in
errors rather than empty search results.