HBase is massively scalable -- and hugely complex

Comments

Apache HBase describes itself as "the Hadoop database," which can be a bit confusing, as Hadoop is typically understood to refer to the popular MapReduce processing framework. But Hadoop is really an umbrella name for an entire ecosystem of technologies, some of which HBase uses to create a distributed, column-oriented database built on the same principles as Google's Bigtable. HBase does not use Hadoop's MapReduce capabilities directly, though HBase can integrate with Hadoop to serve as a source or destination of MapReduce jobs.

The hallmarks of HBase are extreme scalability, high reliability, and the schema flexibility you get from a column-oriented database. While tables and column families must be defined in advance, you can add new columns on the fly. HBase also offers strong row-level consistency, built-in versioning, and "coprocessors" that provide the equivalents of triggers and stored procedures.

Designed to support queries of massive data sets, HBase is optimized for read performance. For writes, HBase seeks to maintain consistency. In contrast to "eventually consistent" Cassandra, HBase does not offer various consistency level settings (to acknowledge the write after one node has written it or a quorum of nodes has written it). Thus, the price of HBase's strong consistency is that writes can be slower.

HDFS -- the Hadoop Distributed File System -- is the Hadoop ecosystem's foundation, and it's the file system atop which HBase resides. Designed to run on commodity hardware and tolerate member node failures, HDFS works best for batch processing systems that prefer streamed access to large data sets. This seems to make it inappropriate for the random access one would expect in database systems like HBase. But HBase takes steps to compensate for HDFS's otherwise incongruous behavior.

Zookeeper, another Hadoop technology (though no longer used by current versions of the Hadoop MapReduce engine), is a distributed communication and coordination service. Zookeeper maintains a synchronized, in-memory data structure that can be accessed by multiple clients. The data structure is organized like a file system, though the structure's components (znodes) can be data containers, as well as elements in a hierarchical tree. Imagine a file system whose files can also be directories.

HBase uses Zookeeper to coordinate cluster activities and monitor the health of member nodes. When you run an HBase cluster, you must also run Zookeeper in parallel. HBase will run and manage Zookeeper by default, though you can configure HBase to use a separately managed Zookeeper setup. You can even run the Zookeeper server processes on the same hardware as the other HBase processes, but that's not recommended, particularly for a high-volume HBase cluster.

How HBase works

The HBase data model will seem familiar at first. A table consists of rows. Each row -- fundamentally, just a blob of bytes -- is uniquely identified by a row key. The choice of row key, made when the table is created, is important because HBase uses row keys to guide data sharding -- that is, the way in which data is distributed throughout the cluster. Row keys also determine the sort order of a table's rows.

More precisely, a row is a collection of key/value pairs, the key being a column identifier and the value being the content of the cell that exists at the intersection of a specific row and column. However, because HBase is a column-oriented database, no two rows in a table need have the same columns. To complicate matters further, data is versioned in HBase. The actual coordinates of a value (cell) is the tuple {row key, column key, timestamp}. In addition, columns can be grouped into column families, which give a database designer further control over access characteristics, as all columns within a column family will be stored in close proximity to one another.

A write operation in HBase first records the data to a commit log (a "write-ahead log"), then to an internal memory structure called a MemStore. When the MemStore fills, it is flushed to disk as an entity called an HFile. HFiles are stored as a sequence of data blocks, with an index appended to the file's end. Another index, kept in memory, speeds searches for data in HFiles.

HFiles are immutable once written. If a key is deleted, HBase records a special "tombstone" marker to commemorate the delete. Tombstones are removed (as is the deleted data) when HFiles are periodically compacted.

HBase attempts to satisfy read operations first through the MemStore. Failing that, HBase checks yet another in-memory structure, the BlockStore, which is a read cache designed to deliver frequently read data from memory, rather than from the disk-based HFiles.

HBase shards rows by regions, which are defined by a range of row keys. Every region in an HBase cluster is managed by a RegionServer process. Typically, there is a single RegionServer process per HBase node. As the amount of data grows, HBase splits regions and migrates the associated data to different nodes in the cluster for balancing purposes.

HBase's cluster architecture is not completely symmetrical. For example, every cluster must have a single, active master node. Multiple nodes can (and should) be designated as master nodes, but when the cluster boots, the candidate masters coordinate so that only one is the acting master. It's the master's responsibility to monitor region servers, handle region server failover, and coordinate region splits.

Should the master node crash, the cluster can still operate in a steady-state mode -- managing read and write requests -- but cannot execute any of the operations that require the master's coordination (such as rebalancing). This is why it's a good idea to specify multiple master nodes; if and when the reigning master should fail, it will be quickly replaced.

You can run HBase atop a native file system for development purposes, but a deployed HBase cluster runs on HDFS, which -- as mentioned earlier -- seems like a poor playground for HBase. Despite the streaming-oriented underlying file system, HBase achieves fast random I/O. It accomplishes this magic by a combination of batching writes in memory and persisting data to disk using log-structured merge trees. As a result, all random writes are performed in memory, and when data is flushed to disk, the data is first sorted, then written sequentially with an accompanying index. Random reads are first attempted in memory, as mentioned above. If the requested data is not in memory, the subsequent disk search is speedy because the data is sorted and indexed.

Working with HBase

HDFS was designed on the principle that it is easier to move computation (as in a MapReduce operation) close to the data being processed than it is to move the data close to the computation. As a result, it is not in HDFS's nature to ensure that related pieces of data (say, rows in a database) are co-located. This means it's possible that a block whose data is managed by a particular RegionServer will not be stored on the same physical host as that RegionServer. However, HDFS provides mechanisms that advertise block location and -- more important -- perform block relocation upon request. HBase uses these mechanisms to move blocks so that that they are local to their owning RegionServer.

While HBase does not support transactions, neither is it eventually consistent; rather, HBase supports strong consistency, at least at the level of a single row. HBase has no sense of data types; everything is stored as an array of bytes. However, HBase does define a special "counter" datatype, which provides for an atomic increment operation -- useful for counting views of a Web page, for example. You can increment any number of counters within a single row via a single call, and without having to lock the row. Note that counters will be synchronized for write operations (multiple writes will always execute consistent increments) but not necessarily for read operations.

The HBase shell is actually a modified, interactive Ruby shell running in JRuby, with Ruby executing in a Java VM. Anything you can do in the interactive Ruby shell you can do in the HBase shell, which means the HBase shell can be a powerful scripting environment.

The latest version of the shell provides a sort of object-oriented interface for manipulating HBase tables. You can, for example, assign a table to a JRuby variable, then issue a method on the table object using the standard dot notation. For example, if you've defined a table and assigned it to the myTable variable, you could write (put) data to the table with something like:

myTable.put '<row>', '<col>', '<v>'

This would write the value <v> into the row <row> at column <col>.

There are some third-party management GUIs for HBase, such as hbase-explorer. HBase itself includes some built-in Web-based monitoring tools. An HBase master node serves a Web interface on port 60010. Browse to it, and you'll find information about the master node itself including start time, the current Zookeeper port, a list of region servers, the average number of regions per region servers, and so on. A list of tables is also provided. Click on a table and you're shown information such as the region servers that are hosting the table's components. This page also provides controls for initiating a compaction on the table or splitting the table's regions.

In addition, each region server node runs a monitoring Web interface at port 60030. Here you'll find lots of metrics: read and write latencies, for example, broken down into various percentiles. You can also see information about the regions managed by this region server, and you can generate a dump of the active threads on the server.

The HBase reference guide includes a Getting Started guide and an FAQ. It's a live document, so you'll find user community comments attached to each entry. The HBase website

also provides links to the HBase Java API, as well as to videos and off-site sources of HBase information. More information can be found in the HBase wiki. While good, the HBase documentation is not quite on par with documentation I've seen on other database product sites, such as Cassandra and MongoDB. Nevertheless, there's plenty of material around the Internet, and the HBase community is large and active enough that any HBase questions won't go unanswered for long.

One of HBase's more interesting recent additions is support for "coprocessors" -- user code that executes as part of the HBase RegionServer and Master processes. There are roughly two kinds of coprocessors: observers and endpoints. An observer is a user-written Java class that defines methods to be invoked when certain HBase events occur. Think of an observer as the HBase counterpart to the RDBMS trigger. One observer, called a RegionObserver, can hook specific points in the flow of control of data manipulation operations like get, put, and delete.

The HBase endpoint coprocessor works much like a stored procedure. When loaded it can be invoked from an observer, for example, and thereby permits adding new features to HBase dynamically. There are various ways to load coprocessors into an HBase cluster, including via the HBase shell.

Configuring a large HBase cluster can be difficult. An HBase cluster includes master nodes, RegionServer processes, HDFS processes, and an entire Zookeeper cluster running side by side. Clearly, troubleshooting a failure can be a complex undertaking, as there are numerous moving parts to be examined.

HBase is very much a developer-centric database. Its online reference guide is heavily linked into HBase's Java API docs. If you want to understand the role played by a particular HBase entity -- say, a Filter -- be prepared to be handed off to the Java API's documentation of the Filter class for a full explanation.

Given that access is by row and that rows are indexed by row keys, it follows that careful design of row key structure is critical for good performance. Ironically, programmers in the good old days of ISAM (Indexed Sequential Access Method) databases knew this well: Database access was all about the components -- and the ordering of those components -- in compound-key indexes.

HBase employs a collection of battle-tested technologies from the Hadoop world, and it's well worth consideration when building a large, scalable, highly available, distributed database, particularly for those applications where strong consistency is important.