Review: Elasticsearch 7 soars with SQL, search optimizations
- 02 October, 2019 20:00
Elasticsearch started life as a document database sitting atop the Lucene text search engine library. It was soon joined by related applications, and the preferred acronym for the Elasticsearch family of products was ELK: Elasticsearch; Logstash, the data pipelining tool, principally used to hoover data from logging into an Elasticsearch database; and Kibana, the data visualization construction kit.
The ELK trio has since been joined by a small platoon of “data shipper” utilities: the Beats products. Similar to Logstash, the Beats products move data from an outside source into an Elasticsearch database. They differ in the source of the shipped data. Filebeat is designed to read and forward the contents of log files (like Logstash, but without Logstash’s transformation and aggregation capabilities). Metricbeat reads system metric data gathered from Windows, Mac, or Linux hosts. Metricbeat can also gather enterprise application metrics from Microsoft SQL Server, MySQL, PostgreSQL, and other sources.
The Beats group is a lengthy list of sibling products; you can find the full family of Beats on the Elastic website. Similarly, the features and product updates that have appeared with the roll-out of Elastic Stack 7.x is a lengthy list, one that could occupy several articles. While there is much to be said about all of the updates to the various components in the Elastic Stack 7.x release, this article will focus principally on the enhancements and improvements made to the stack’s cornerstone: Elasticsearch itself.
Sooner or later, this had to happen, right? The Elasticsearch engineers had to take a stab at grafting SQL onto Elasticsearch. There are just too many database technologists, daily speaking SQL, to ignore. An SQL query engine would pave a familiar path into the Elasticsearch realm for RDBMS users. Ironically, this idea of a friendly path to Elasticsearch is echoed in a particular feature of the product’s SQL query engine: Enter the SQL, and the query engine will show you the equivalent Elasticsearch query into which that SQL is translated.
To be clear, it is not the case that Elasticsearch supports all of any particular SQL standard (of which there are several). Mostly, you can search, meaning you can issue
DESCRIBE commands. The latter,
DESCRIBE, displays the schema of a particular index.
Pretty quickly, the warps in this new reality begin to show: There is no schema in a schema-less Elasticsearch database. So, how does a
SELECT work? And how is it that the Elasticsearch SQL online documentation includes words otherwise unheard of when one speaks of Elasticsearch—words like “table” and “column”?
Necessarily, Elasticsearch queries dance on both sides of the event horizon that separates the relational database slice of the universe from the NoSQL database slice. So, when SQL comes to Elasticsearch, an index becomes a table and a field becomes a column. This renaming is necessary so that anyone familiar with, say, the
.schema command in SQLite will feel right at home. And the output of a
DESCRIBE command—an ASCII table showing columns, SQL type, and corresponding Elasticsearch data type—will be comfortably familiar. It looks like a schema.
(Of course, a field can’t simply slip into the role of a column. Elasticsearch allows object nesting in a document—fields within fields. Luckily, the Elasticsearch SQL output of an index schema handles this the same way that Elasticsearch handles the situation under the hood: dot notation is used to “flatten” nested fields.)
As already implied, you cannot
DELETE. Nor can you perform
UNIONs, or nested
SELECTs. But you can employ Elasticsearch’s aggregation and statistical operator, and you can retrieve query scores. So, while Elasticsearch’s SQL might appear anemic at first glance, it’s actually quite robust, particularly in that it fulfills the goal of giving RDBMS engineers a much easier transition into Elasticsearch.
Finally, while SQL queries can be submitted via Elasticsearch’s REST API, both ODBC and JDBC drivers are now available for enabling Elasticsearch SQL queries in Windows and Java applications.
Replica selection in Elasticsearch
Some improvements in Elasticsearch 7.x are what one might call “infrastructure upgrades.” Adaptive replica selection was experimental in Elasticsearch 6.x; it comes to maturity in Elasticsearch 7. Its purpose is to combat the unfortunate fact that, when a search was being carried out, the round-robin selection of a replica node might land on a node that was busy performing a garbage collection or suffering some other performance degradation like high disk I/O or network saturation. It might even be that the selected node was simply running on less-capable hardware (as compared to its sibling replica nodes).
The node that receives a query from a user application is called the “coordinator” node. The coordinator disperses the query to all the relevant data nodes, which perform the actual database search. Adaptive replica selection provides performance stats of the data nodes as “seen” by the coordinator node. The coordinator is able to use these stats to select optimal nodes in a given replica set, thereby maximizing the likelihood that the portion of a query bound for a particular replica set receives the fastest response possible.
(Side note: A shard is a partition of an index, a subset of the index elements. The index is divided into shards so that it can be distributed across cluster nodes. For fault tolerance, a given shard is replicated on more than one node. The set of nodes that host a copy of a given shard is a replica set, and its members are called replicas.)
Cluster coordination is another area where change has been significant and ongoing (in fact, there have been changes since the first Elasticsearch 7.x release). “Discovery” is the process that a cluster uses to identify all of its member nodes, as well as to determine which master-eligible node is cluster master. Discovery must be quick and efficient; an incomplete node membership runs the risk of creating a “split brain,” i.e. two clusters that should be a single cluster. (The longer they are separated, the farther apart their content grows.) And determining a cluster’s master node quickly is of prime importance. Not only is the master the source of truth for the cluster’s current configuration and state, but the master is the only node that can change cluster state. The master node is responsible for pinging all cluster members to verify cluster health.
In the past, Elasticsearch used the Zen discovery mechanism, which supports multiple forms of discovery (unicast, file-based, etc.). Over time, Elasticsearch engineers learned that it was critical to properly configure an Elasticsearch cluster’s discovery parameters. They witnessed many misconfigurations that ultimately blossomed into instabilities. As a result, Elasticsearch has migrated to a single discovery implementation that removes the instability risk. Cluster administrators need no longer tweak configuration parameters and hope that the latest tweaking won’t run cluster performance into a ditch. (There are, however, “discovery plug-ins” that can be used to enhance the discovery process, depending on a given instance’s environment. A list of discovery plug-ins can be found at the Elastic website.)
Cross-cluster search in Elasticsearch
In previous releases, if you wanted a search to span multiple clusters, you elected specific nodes in participating clusters to be “tribe” nodes. In effect, a tribe node was the member of more than one cluster (think dual or multiple citizenship). So, a query sent to a tribe node—which would become the coordinating node for that query—would extend across all the clusters for which that tribe node was a member.
One weakness in this approach appeared when two clusters to which the tribe node belonged each had an index with the same name: The tribe node simply chose one of the two when fulfilling a query request. Now, cross-cluster search allows you to elect specific nodes as “gateway” nodes, which can receive query requests originating outside the cluster—no more dual citizenship. In addition, when you want to query the index of a remote cluster, you prefix the index with the cluster name, something along the lines of
<cluster_name>:<index>. Thus the need for a tribe node to arbitrarily choose among identically named indexes has been eliminated.
Effectively, an Elasticsearch gateway node acts just like a coordinator node. In another improvement, designed to minimize CPU and network pressure (and thereby reduce round-trip query time), an Elasticsearch cluster responding to a remote query first performs its own query results reduction phase before returning those results to the caller.
Elasticsearch query improvements
Some new features expand Elasticsearch’s already extensive query capabilities. For example, the new
rank_features field types, plus the new “rank feature” query, provide relevance tuning, an activity that is otherwise difficult to do without a large training set and the manpower to comb through the results of the training. These new field types, which appear to be ordinary float value data fields but are engineered so that they can be added to a document’s score, let you tweak document relevance within specific queries. You might, for example, want to capture a value for a document’s popularity in a
rank_feature field and have that incorporated into the document’s score for its query relevancy.
In addition, these fields can be employed in search optimizations referred to as “top-k retrieval optimizations,” which are also new in Elasticsearch 7. In a nutshell, these optimizations allow Elasticsearch to fetch the requested subset of a query’s most relevant documents (hence, “top-k”) without having to crawl through all applicable search results. The explanation of these optimizations is dense, but interested readers are encouraged to read the Elastic blog on the topic.
Another new query type that benefits from the “top-k retrieval optimizations” is the new “distance feature” query. As its name implies, this query type lets you score results based on proximity to an origin in space of time. Field types that can participate in a distance query are
geo_point types. So, the size of a distance between a document’s space or time-type field and an origin can increase (or reduce) that document’s query score.
High Level REST Client
On the development side of the ledger, while the Java High Level REST Client appeared in an Elasticsearch 6.0 beta release, it was finally declared complete in the Elasticsearch 7.0 release. In this case, being complete means that the High Level REST Client can now be used instead of the Transport client (which it is meant to replace).
The Java High Level REST Client sits atop the Java Low Level REST Client, which can be used to communicate with an Elasticsearch cluster via HTTP. However, the Low Level REST Client leaves request marshalling and response unmarshalling to the user. Meanwhile, the High Level REST Client represents requests and responses as objects. This allows the High Level REST Client to expose specific APIs, rather than requiring callers to interact with generic HTTP method calls. And the High Level REST Client deals with object marshalling and unmarshalling so developers don’t have to.
Finally, it is worthwhile to pause a moment and discuss Kibana, because it has grown from being just a data visualization tool to a kind of Elasticsearch uber-dashboard. Kibana was originally a platform for creating and displaying real-time data visualizations – line graphs, bar charts, pie charts, etc. – drawn live from an Elasticsearch database. Though still a visualization builder, Kibana now provides consoles for management, development, machine learning, data exploration, and much more. For example:
- From Kibana’s index management console, you can display stats such as field mapping (index schema), index summary metrics (number of documents, on-disk storage consumed, etc.), index default parameters (number of shards, nested field limits, etc.), and more.
- The Canvas workpad area lets you create content for Canvas. A complementary visualization tool, Canvas can create Kibana-style data graphics, arrange them in a form suitable for presentations, and export them to file formats for import into presentation or slide-deck software.
- The Maps console lets you process and display geographical data (the console opens to a world map). Data is displayed as layers veneered atop and bound to the mapping structure below, so you can do things like set the color of a region (state, country, etc.) based on, say, the number of documents in a data set whose geospatial coordinates are in that region.
- The development tools dashboard provides a console for entering and executing queries against a database. (This is what used to be available as the widely popular Sense plug-in for the Chrome browser.) The dev tools console can also provide the performance of various internal database components.
These are just a sampling. The easiest way to witness all of Kibana’s features is to install your own Elasticsearch and Kibana combo and do some exploring.
While Elasticsearch is evolving rapidly, its evolution is not simply engineers ladling out a stew of new features. They are also upgrading infrastructure in ways that both improve cluster performance and simplify an otherwise complex cluster configuration process.
This is not to say that properly configuring an enterprise-scale Elasticsearch cluster has suddenly been made straightforward; there are numerous considerations to be accounted for. But it is true that installing and running a working Elasticsearch database that’s suitable for development is surprisingly easy. And the new features that have appeared with Elasticsearch 7 make the trip from that initial installation to a distributed, enterprise-level, multi-node Elasticsearch cluster a much swifter journey.