Apache HBase
Apache HBase is an open-source clone of Google’s Bigtable, built on top of HDFS as the storage layer. Started in 2008 as part of the Hadoop ecosystem, HBase provides random, strongly-consistent read/write access to massive structured datasets — petabyte-scale tables with billions of rows and millions of columns. HBase remains widely deployed at large enterprises that already run Hadoop, but greenfield projects today usually choose Cassandra, ScyllaDB, or a managed wide-column service instead.
Key Features:
- Bigtable Data Model. Sparse multidimensional sorted map:
(row key, column family, column qualifier, timestamp) → value. Columns within a family are stored together; different rows can have entirely different columns.
- Strong Consistency. Single master per region — reads and writes for a row are linearizable. Different from Cassandra’s eventual model.
- HDFS Storage. SSTables (HFiles) live on HDFS, inheriting its durability, replication factor, and locality model.
- Region Servers. Tables are split into row-key ranges (regions) assigned to region servers. Auto-split when a region exceeds size threshold.
- HBase Shell + REST + Thrift. Multiple client surfaces; JVM is the native one.
- Coprocessors. Server-side trigger / aggregation hooks — conceptual ancestor of Cassandra’s materialized views.
HBase vs. Cassandra:
- HBase. Strong consistency, single master per region, HDFS-backed. Best when you already run Hadoop and want strong consistency.
- Cassandra. Eventual consistency, masterless, multi-region native. Best for write-heavy and multi-DC workloads.
- Both implement Bigtable-style wide-column data models with very similar query semantics.
Use Cases:
- Operational data stores on top of an existing Hadoop cluster.
- Massive time-series tables (Open TSDB built on HBase).
- Real-time random access to data also queried by MapReduce / Spark batch jobs.
- Facebook Messages, Yahoo, AdRoll, and many financial-services workloads ran on HBase historically.
Notes:
HBase’s master-region-server topology has more operational complexity than Cassandra’s peer-to-peer ring. For new deployments today the question usually becomes “HBase or Cassandra/Scylla?” — HBase wins when strong consistency is required and Hadoop infrastructure is already in place; otherwise Cassandra or a managed Bigtable / DynamoDB equivalent is typically simpler.