OpenSearch Queries: Query DSL and Beyond

0
52
Lines of code on a black screen

OpenSearch has evolved rapidly since its fork from the source code of the last truly open source version of Elasticsearch.

So far, the community’s work has focused on removing proprietary code from Elastic, including a number of things that were never purely open source themselves. These include some aspects of the querying languages and capabilities of Elasticsearch. But given OpenSearch’s commitment to being, well, open, there has been effort to replace those elements of the ELK Stack with something new.

As a result, OpenSearch queries now look both very familiar and also fairly unique. There are new elements to talk about in addition to carryover from Lucene and Query DSL. For more information on these, look at our introduction to Elasticsearch queries.

For starters, the primary language for OpenSearch queries is Query DSL. However, there are also two plugins that allow you to use two other querying languages: SQL and PPL (Piped Processing Language).

PPL was created by AWS for Open Distro for Elasticsearch. As Open Distro rolls over now into OpenSearch, PPL’s utility is inherently intertwined with OpenSearch applications.

Data Types and Equivalents: Query DSL, PPL, SQL

OpenSearch includes the Query Workbench (within OpenSearch Dashboards), where you can switch to SQL or PPL via plugins, translate SQL to REST, as well as query data in the OpenSearch Dashboards UI. You can also download data in CSV, JSON, text, Java Database Connectivity (JDBC), and Open Database Connectivity (ODBC) formats.

The Query Workbench is actually a part of the SQL plugin, which we will discuss below in greater detail.

The following table compares the three main querying language types in OpenSearch:

[Sourced  from https://opensearch.org/docs/search-plugins/ppl/datatypes/ ]

PPL Type OpenSearch Type SQL Type
boolean boolean BOOLEAN
byte byte TINYINT
byte short SMALLINT
integer integer INTEGER
long long BIGINT
float float REAL
float half_float FLOAT
float scaled_float DOUBLE
double double DOUBLE
string keyword VARCHAR
text text VARCHAR
timestamp date TIMESTAMP
ip ip VARCHAR
timestamp date TIMESTAMP
binary binary VARBINARY
struct object STRUCT
array nested STRUCT

Query DSL in OpenSearch

OpenSearch docs break down Query DSL into three categories: full-text queries, term-level queries, and boolean queries. We covered Query DSL extensively in our Elasticsearch queries introduction. Here, it suffices to say that boolean queries cover must, must_not, should, and filter requests. Let’s go into slightly more detail on the other two categories:

Term-Level Queries

Term-level queries are made up of the following list of options: term, terms, IDs, range, prefix, exists, wildcard, and regexp. You can find examples for Query DSL term-level searches in our Elasticsearch queries tutorial. Here is an example of prefix, though:

Prefix

This query looks for any hits with a specified prefix:

GET shakespeare/_search
{
  "query":Hello {
    "prefix": {
      "speaker": "KING"
    }
  }
}

Full-Text Queries

OpenSearch full-text queries are a little more expansive and cover a lot of options. The list includes match, multi_match, match_phrase, match_phrase_prefix, match_bool_prefix, common, query_string, simple_query_string, match_all, and match_none.

There are also a number of optional query parameters for full-text OpenSearch Queries:

allow_leading_wildcard
analyze_wildcard
analyzer
auto_generate_synonyms_phrase_query
boost
cutoff_frequency
flags
fuzziness
fuzzy_transpositions
fields
enable_position_increments
low_freq_operator
max_determined_states
max_expansions
minimum_should_match
operator (and or or)
phrase_slop
prefix_length
quote_field_suffix
rewrite
slop
tie_breaker
time_zone
type
zero_terms_query

Take for example fuzziness.

Fuzziness instructs the query on how flexible it should be in searching for values. It instructs the limit on edits to a result in order to get it back to the original query. For example, it would take 1 character change to get from “desert” to “dessert.” Those changes can be deletions, insertions, or replacements (so you would count a letter switch as a single edit, not 2 with a deletion and an insertion). Settings can include values AUTO, 1, or any other positive value.

GET _search
{
  "query": {
    "match": {
      "title": {
        "query": "desert",
        "fuzziness": "AUTO",
      }
    }
  }
}

This dictates the number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matches a certain value. For example, the distance between wined and wind is 1. The default, AUTO, chooses a value based on the length of each term and is a good choice for most use cases.

Piped Processing Language (PPL) Queries and Syntax

OpenSearch carries over Piped Processing Language (PPL) from OpenDistro for Elasticsearch . Syntax uses the pipe marker ( | ) to separate query commands. The platform release comes with a PPL CLI built-in as well as a way to save JSON or text results.

There are also only a few commands to use with the language.

There are also only a few commands to use with the language.

Search

Syntax starts with the search command, followed by the source and any other expressions you need to add:

search source=<<index>> [some bool-expression]
search command 1 | command 2 | command 3

All the following queries are placed after the first clause and pipe: search source=<<index>> |

Head

This is a pretty important one: specify the number of results you want:

search source=voters | fields state | head 5

Dedup

This critical feature – data deduplication – removes duplicate documents from results (when they come up for different fields).

dedup [int] <<field>> [keepempty=<<true/false>>] [consecutive=<<true/false>>]

The keepempty parameter tells the query to keep results with empty fields.

You can also input multiple fields at a time:

search source=voters | dedup 3 age keepempty=true | fields age, state, phone number;
Age State Phone Number
87 NJ 555-1235
87 PA 555-2347
87 CT 555-3467
86 NJ 555-7654
86 NY 555-9876
86 NC 555-8642

Eval

The eval query makes quick math of your results. Input an expression within the query based on the results it returns. The final results will include the answer to the expression.

In this example, assume you’re looking at data on an election from six months ago (.5 years). For some reason, you want to update the ages in the results to be contemporary:

search source=voters | eval age = age + .5 | fields age;

Fields

Add or remove fields using + or - symbols:

search source=voters | fields age, state, city | fields - state;

Rename

To display a certain result with an alternate label, use rename to change how a field’s results display. In this example, the field was accidentally named Republican instead of Party (Republican is one of the values). To correct the error:

search source=candidates | rename republican as party | fields party;

Sort

This will sort results in a defined field either in ascending order (default) or descending order (marked by -)

search source=candidates | sort age | fields age, party, state, city, office;
Age Party State City Office
35 Republican NJ New Brunswick Mayor
47 Democrat NJ New Brunswick Mayor
52 Independent NJ New Brunswick Mayor
67 Green NJ New Brunswick Mayor
search source=candidates | sort - age | fields age, party, state, city, office;
Age Party State City Office
67 Green NJ New Brunswick Mayor
52 Independent NJ New Brunswick Mayor
47 Democrat NJ New Brunswick Mayor
35 Republican NJ New Brunswick Mayor

Also use sort for two different fields simultaneously:

search source=candidates | sort + age - sex | fields age, sex, party, state, city, office;
Age Sex Party State City Office
43 M Blue NJ Highland Park Mayor
44 M Codependent Party NJ Highland Park Mayor
47 F Demublican NJ Highland Park Mayor
75 F Repubricat NJ Highland Park Mayor

Stats

Calculate stats using the five optional functions: AVG, MAX, MIN, SUM, COUNT. The following example will give you the highest five average voter ages among the 50 US states:

search source=voters | stats avg(age) by state; | head 5;
State avg(age)
Florida 55.70
California 53.33
Arizona 52.89
New York 49.25
Utah 48.00

Where

Limit results by very specific values:

search source=candidates | where party=republican and gender =”F” | fields party, gender;

Rare

Find the least common values from a given list of fields with the rare query.

search source=candidates | rare age by party | fields age, party;

Result:

Age Party
17 Republican
104 Democrat

Top                                   ,

Finds the most common fields in a list, with the top result having the most hits:

top N <<field-list>> <<by-clause>>
search source=voters | top 5 age by state

Result:

Age State
85 Florida
84 Florida
84 California
83 Texas
83 Florida

Plugin: OpenSearch SQL

The OpenSearch SQL plugin supports complex SQL queries that include things like JOINs, subqueries and more. It supports the Query Workbench in OpenSearch Dashboards, an SQL CLI, and other functions.

The order of operations follows this (complete) syntax layout:

SELECT [DISTINCT] (* | expression) [[AS] alias] [, ...]
FROM index_or_table_name
[WHERE predicates]
[GROUP BY <<column_name>> [, ...]
 [HAVING predicates]]
[ORDER BY <<column_name>> [IS [NOT] NULL] [ASC | DESC] [, ...]]
[LIMIT [offset, ] size];

Summary

This post encompasses a brief overview of OpenSearch queries, highlighting some of the differences with Elasticsearch (which relies on Query DSL), as well as providing some additional info on Piped Processing Language whose Open Distro origins make it a critical component in the AWS-spearheaded OpenSearch project.

___

This post was originally published on the Logz.io blog.

Logz.io logo. Blue open block with Logz.io in blue text, on orange background