RocksDB is a high-performance embedded key-value database, originally forked from Google's LevelDB by Facebook (Meta) in 2012. It keeps LevelDB's LSM-tree on-disk format but adds the features LevelDB lacks for production server workloads: column families, multi-threaded compaction, transactions, bloom filters, multiple compression algorithms, TTLs, backups, and a much larger tuning surface. RocksDB powers MyRocks (MySQL), CockroachDB, TiKV, Kafka Streams' state store, Apache Flink's keyed state, Yugabyte, ScyllaDB's metadata, and Meta's social-graph storage tier.
RocksDB is an embedded library, not a server. Like LevelDB, it stores ordered (key, value) byte-string pairs and uses a Log-Structured Merge tree on disk. Unlike LevelDB, the codebase is deeply optimized for modern hardware (SSDs, NVMe, multi-core CPUs) and includes a large set of production features.
LSM-tree write path (same as LevelDB, with refinements):
Why RocksDB exists: LevelDB's compaction is single-threaded, has no bloom filters by default, no column families, no transactions, and a small tuning surface. RocksDB rewrote almost every layer to remove those limits while keeping the same on-disk file format compatibility (mostly).
Key concepts beyond LevelDB:
brew install rocksdb
# Python binding (C++ bridge)
pip install python-rocksdb # classic binding, builds against system librocksdb
# or
pip install rocksdict # newer, prebuilt wheels, simpler API
sudo apt-get update
sudo apt-get install -y librocksdb-dev libsnappy-dev liblz4-dev libzstd-dev libbz2-dev
pip install python-rocksdb
# or
pip install rocksdict
git clone https://github.com/facebook/rocksdb.git
cd rocksdb
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
-DWITH_SNAPPY=ON -DWITH_LZ4=ON -DWITH_ZSTD=ON \
-DUSE_RTTI=1 ..
make -j"$(nproc)"
sudo make install
rocksdict)from rocksdict import Rdict
db = Rdict('/tmp/rocks-smoketest')
db[b'hello'] = b'world'
print(db[b'hello']) # b'world'
db.close()
Two binding choices — rocksdict (modern, dict-like, prebuilt wheels) and python-rocksdb (classic, more control, requires the C++ library on the system). Examples below use rocksdict for ergonomics.
import json
from rocksdict import Rdict, Options
# Open with explicit options
opts = Options()
opts.create_if_missing(True)
opts.set_compression_type('zstd')
db = Rdict('/tmp/users.rdb', options=opts)
# --- Insert ---
def put_user(user_id: int, record: dict) -> None:
key = f"user:{user_id:010d}".encode() # zero-padded for stable ordering
db[key] = json.dumps(record).encode()
put_user(1001, {"first": "Alice", "last": "Smith", "city": "Seattle"})
put_user(1002, {"first": "Bob", "last": "Johnson", "city": "Portland"})
put_user(1003, {"first": "Charlie", "last": "Davis", "city": "San Francisco"})
# --- Retrieve ---
raw = db[b'user:0000001002']
print(json.loads(raw)) # {'first': 'Bob', ...}
# --- Existence check ---
print(b'user:0000001002' in db) # True
# --- Delete ---
del db[b'user:0000001003']
# --- Missing keys ---
try:
_ = db[b'user:0000009999']
except KeyError:
print("not found")
db.close()
Classic python-rocksdb API (more verbose, mirrors C++):
import rocksdb
opts = rocksdb.Options()
opts.create_if_missing = True
opts.compression = rocksdb.CompressionType.lz4_compression
opts.write_buffer_size = 64 * 1024 * 1024
opts.max_write_buffer_number = 3
db = rocksdb.DB('/tmp/users.rdb', opts)
db.put(b'user:0000001001', b'{"first":"Alice"}')
print(db.get(b'user:0000001001')) # b'{"first":"Alice"}'
db.delete(b'user:0000001001')
The native API. Same shape as LevelDB plus column families, transactions, and many more options.
#include <rocksdb/db.h>
#include <rocksdb/options.h>
#include <rocksdb/table.h>
#include <rocksdb/filter_policy.h>
#include <cassert>
#include <iostream>
#include <memory>
int main() {
rocksdb::Options options;
options.create_if_missing = true;
options.compression = rocksdb::kZSTD;
// Bloom filter for fast missing-key lookups
rocksdb::BlockBasedTableOptions table_opts;
table_opts.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
table_opts.block_cache = rocksdb::NewLRUCache(512 * 1024 * 1024); // 512 MiB
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_opts));
// Multi-threaded compaction
options.IncreaseParallelism(8);
options.OptimizeLevelStyleCompaction();
rocksdb::DB* db = nullptr;
rocksdb::Status s = rocksdb::DB::Open(options, "/tmp/users.rdb", &db);
assert(s.ok());
db->Put(rocksdb::WriteOptions(), "user:1001",
R"({"first":"Alice","last":"Smith"})");
std::string value;
s = db->Get(rocksdb::ReadOptions(), "user:1001", &value);
if (s.ok()) std::cout << value << std::endl;
db->Delete(rocksdb::WriteOptions(), "user:1001");
delete db;
return 0;
}
Compile:
g++ -std=c++17 rocks_demo.cpp -o rocks_demo \
-lrocksdb -lpthread -lsnappy -llz4 -lzstd -lbz2
./rocks_demo
Column Families (CFs) are independent keyspaces inside one RocksDB database. Each CF has its own MemTable, SSTables, compression settings, and bloom filters — but writes across CFs commit atomically. Think of them as named namespaces or lightweight tables.
from rocksdict import Rdict, Options, ColumnFamily
# Open existing CFs (must list them all if any non-default exist)
db = Rdict('/tmp/multi.rdb')
# Create new CFs
db.create_column_family('users')
db.create_column_family('orders')
db.create_column_family('events')
users = db.get_column_family('users')
orders = db.get_column_family('orders')
users[b'1001'] = b'{"first":"Alice"}'
orders[b'A100'] = b'{"user":1001,"total":42.50}'
# Atomic write across CFs (via WriteBatch — see §6)
print(users[b'1001']) # b'{"first":"Alice"}'
print(orders[b'A100'])
db.close()
Why use CFs?
users for point reads (small block size, large bloom filter) and events for sequential append (universal compaction, large write buffer).WriteBatch.RocksDB supports both optimistic and pessimistic transactions with snapshot isolation. Use the TransactionDB (pessimistic) or OptimisticTransactionDB (optimistic) variants.
#include <rocksdb/utilities/transaction.h>
#include <rocksdb/utilities/transaction_db.h>
rocksdb::TransactionDBOptions txn_opts;
rocksdb::TransactionDB* tdb = nullptr;
rocksdb::Status s = rocksdb::TransactionDB::Open(
options, txn_opts, "/tmp/txn.rdb", &tdb);
// Begin a transaction
rocksdb::Transaction* txn = tdb->BeginTransaction(rocksdb::WriteOptions());
std::string current;
txn->GetForUpdate(rocksdb::ReadOptions(), "balance:alice", ¤t); // locks
int64_t bal = std::stoll(current);
bal -= 100;
txn->Put("balance:alice", std::to_string(bal));
txn->Put("balance:bob", std::to_string(/* +100 */ 0));
s = txn->Commit(); // or txn->Rollback()
delete txn;
WriteBatch (no locking, group commit)from rocksdict import Rdict, WriteBatch
db = Rdict('/tmp/users.rdb')
batch = WriteBatch()
batch.put(b'user:1001', b'{"first":"Alice"}')
batch.put(b'user:1002', b'{"first":"Bob"}')
batch.delete(b'user:1003')
# Commit atomically — all writes apply together or none do
db.write(batch)
db.close()
Optimistic vs. pessimistic. Pessimistic locks early and waits, good for high contention on the same keys. Optimistic doesn't lock; conflicts are detected at commit time and the loser must retry — better when contention is rare.
RocksDB iterators are similar to LevelDB's but support prefix bloom filters: if you tell RocksDB the prefix-extraction function up front, prefix scans skip entire SSTables that can't match.
from rocksdict import Rdict
db = Rdict('/tmp/users.rdb')
# Forward scan, all keys
it = db.iter()
it.seek_to_first()
while it.valid():
print(it.key(), it.value())
it.next()
# Range: keys in [start, stop)
it = db.iter()
it.seek(b'user:0000001001')
while it.valid() and it.key() < b'user:0000002000':
print(it.key())
it.next()
# Reverse scan
it = db.iter()
it.seek_to_last()
while it.valid():
print(it.key())
it.prev()
db.close()
Prefix seek requires configuring the prefix extractor in the options:
from rocksdict import Options, SliceTransform
opts = Options()
opts.create_if_missing(True)
# First 5 bytes are the prefix (e.g., "user:" or "ordr:")
opts.set_prefix_extractor(SliceTransform.create_fixed_prefix(5))
Two ways to capture a consistent on-disk copy of a live database:
O(1) creation via hard links to existing SSTables — same disk, instant.
#include <rocksdb/utilities/checkpoint.h>
rocksdb::Checkpoint* cp = nullptr;
rocksdb::Checkpoint::Create(db, &cp);
cp->CreateCheckpoint("/backups/checkpoint-2026-04-25");
delete cp;
Designed for cross-disk and cross-host backup. Reuses unchanged SSTables across backups.
#include <rocksdb/utilities/backup_engine.h>
rocksdb::BackupEngineOptions bopts("/backups/rocksdb");
rocksdb::BackupEngine* engine = nullptr;
rocksdb::BackupEngine::Open(rocksdb::Env::Default(), bopts, &engine);
engine->CreateNewBackup(db); // incremental — only new SSTables copied
// List backups
std::vector<rocksdb::BackupInfo> info;
engine->GetBackupInfo(&info);
// Restore
engine->RestoreDBFromLatestBackup("/data/restored", "/data/restored");
delete engine;
RocksDB has a famously large tuning surface — the official tuning guide is a long read. The most-touched knobs:
write_buffer_size — MemTable size before flush. Default 64 MiB; raise to 256–512 MiB for write-heavy workloads.max_write_buffer_number — how many MemTables can exist (active + immutable + flushing). Default 2; bump to 4–6 if flushes can't keep up.level0_file_num_compaction_trigger — how many L0 files trigger compaction. Default 4; lower means more aggressive compaction.max_background_jobs — total compaction + flush threads. Set near CPU count for write-heavy.compaction_style — kCompactionStyleLevel (read-friendly), kCompactionStyleUniversal (write-friendly), kCompactionStyleFIFO (TTL log).compression_per_level — e.g. no compression on L0–L2, ZSTD on L3+ to balance hot-data CPU vs. cold-data space.block_cache — LRU cache for uncompressed blocks. Size to 25–50% of available RAM.NewBloomFilterPolicy(10, false) = 10 bits/key, ~1% false-positive rate. Massively cuts disk I/O on missing-key lookups.OptimizeLevelStyleCompaction(memtable_size) — one-call preset for OLTP-style workloads.OptimizeForPointLookup(cache_size_mb) — preset for hash-table-like usage.rocksdb::Options options;
options.create_if_missing = true;
options.IncreaseParallelism(16); // 16 background threads
options.OptimizeLevelStyleCompaction(512L*1024*1024);
options.compression = rocksdb::kZSTD;
options.write_buffer_size = 256 * 1024 * 1024; // 256 MiB
options.max_write_buffer_number = 4;
options.level0_file_num_compaction_trigger = 8;
options.target_file_size_base = 256 * 1024 * 1024;
rocksdb::BlockBasedTableOptions table_opts;
table_opts.block_cache = rocksdb::NewLRUCache(8L * 1024 * 1024 * 1024); // 8 GiB
table_opts.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
table_opts.cache_index_and_filter_blocks = true;
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_opts));
A merge operator lets you queue a deferred update against a key without first reading its current value. RocksDB stores the merge operands and applies them on read or during compaction. Useful for counters, append-only sets, and CRDT-like structures.
// Counter merge operator: increments stored as ASCII integers
class CounterMergeOperator : public rocksdb::AssociativeMergeOperator {
public:
bool Merge(const rocksdb::Slice& key,
const rocksdb::Slice* existing_value,
const rocksdb::Slice& value,
std::string* new_value,
rocksdb::Logger* logger) const override {
int64_t cur = 0;
if (existing_value) cur = std::stoll(existing_value->ToString());
int64_t inc = std::stoll(value.ToString());
*new_value = std::to_string(cur + inc);
return true;
}
const char* Name() const override { return "CounterMergeOperator"; }
};
options.merge_operator.reset(new CounterMergeOperator());
db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");
db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");
db->Merge(rocksdb::WriteOptions(), "page_views:home", "1");
std::string value;
db->Get(rocksdb::ReadOptions(), "page_views:home", &value);
// value == "3" — operands collapsed on read
Merges are O(1) at write time (no read), and operands are coalesced lazily during compaction or at read time. Without a merge operator, the same workload is a read-modify-write that costs an extra disk read per increment.
| Capability | LevelDB | RocksDB |
|---|---|---|
| Column families | No | Yes |
| Compaction threads | 1 | N (configurable) |
| Compaction styles | Leveled only | Leveled, Universal, FIFO |
| Bloom filters | Manual / off by default | First-class, per-CF |
| Compression | Snappy | Snappy, LZ4, ZSTD, ZLIB, BZIP2 |
| Transactions | No | Optimistic + pessimistic |
| Merge operators | No | Yes |
| TTL | No | Yes (TtlDB) |
| Backup / checkpoint | No | Yes, incremental |
| Statistics & metrics | Minimal | Rich; integrates with Prometheus / OTel |
| Secondary indexes (built-in) | No | No (compose via prefixes) |
| Code size | ~25k LOC | ~400k LOC |
| Tuning surface | Small | Vast |
| Target use | Embedded simplicity | Production server-side |
Strong fits:
Poor fits:
Multi-threaded compaction, column families, bloom filters per SSTable, optimistic and pessimistic transactions, merge operators, multiple compression algorithms (LZ4/ZSTD), per-key TTL, incremental backups via BackupEngine, hard-linked checkpoints, and a much richer statistics/metrics surface. Each one removes a real production limitation of LevelDB: single-threaded compaction caps throughput on multi-core hosts, missing bloom filters punish point reads on missing keys, no transactions makes correct multi-key updates impossible, and no checkpoints makes online backup hard.
Universal compaction merges all SSTables of similar size into one, producing fewer, larger files. It writes less in total (lower write amplification) than Leveled, which is great for write-heavy workloads or workloads that don't need fast reads. The trade-off is higher space amplification (you may have 2x your dataset on disk during compaction) and slower point reads, since a key may live in a wider range of files. Pick Universal for ingest-heavy time-series or log workloads; pick Leveled (the default) for OLTP-style mixed read/write.
For a key that doesn't exist, a naive LSM read may have to consult every SSTable on every level to confirm absence — potentially many disk reads. A bloom filter is a compact in-memory probabilistic structure that, given a key, returns "definitely not in this SSTable" or "maybe in this SSTable". With ~10 bits/key the false-positive rate is ~1%, so 99% of missing-key reads avoid disk entirely. RocksDB stores a bloom filter per SSTable (or per partitioned block); the filter blocks themselves are usually pinned in the block cache.
Column families share the same WAL and the same instance, so writes across CFs commit atomically via a single WriteBatch — you can't get that across separate databases without a 2PC layer. CFs also share threads and the block cache, so resource accounting is unified. You'd open separate databases instead when you want hard isolation or different on-disk locations / lifecycles.
The application opens an OptimisticTransactionDB, begins a transaction (which captures a snapshot sequence number), reads and writes through the transaction object without taking locks, and calls Commit(). At commit, RocksDB scans the read-set and verifies that no concurrent committed write modified any of those keys after the snapshot. If a conflict is found, Commit() returns Status::Busy, the transaction is aborted, and the application retries (typically with backoff). This is best for low-contention workloads — under heavy contention, pessimistic locking via TransactionDB usually wins because retries dominate.
A merge operator lets you record a deferred update (e.g. "+1") against a key without first reading the current value. RocksDB persists the merge operand and applies it lazily — either at read time, when all operands are folded into the base value, or during compaction, when adjacent operands are coalesced. This turns a read-modify-write counter into an O(1) write, which is decisive for high-throughput counters, sets, and CRDT-like data structures. The trade-off is that merge logic must be associative and embedded in the operator implementation.