LevelDB is a fast, embedded key-value store written by Sandy Ghemawat and Jeff Dean at Google. It is a single-process library — no server, no network protocol — that maps arbitrary byte-string keys to arbitrary byte-string values, ordered by key. The on-disk layout is an LSM (Log-Structured Merge) tree, which makes writes cheap and range scans efficient. LevelDB is the technical ancestor of RocksDB (Facebook's fork) and the storage layer behind Bitcoin Core, Chrome's IndexedDB, and many other systems.
LevelDB stores data as sorted (key, value) pairs. Keys and values are arbitrary byte strings — no schema, no types, no SQL. The library is embedded directly into your application process, so reads and writes are function calls (no IPC, no network round-trip).
Storage model — LSM tree:
Properties:
brew install leveldb
# Python binding
pip install plyvel
sudo apt-get update
sudo apt-get install -y libleveldb-dev libsnappy-dev
# Python binding
pip install plyvel
git clone --recurse-submodules https://github.com/google/leveldb.git
cd leveldb
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
sudo cmake --install .
import plyvel
db = plyvel.DB('/tmp/leveldb-smoketest', create_if_missing=True)
db.put(b'hello', b'world')
print(db.get(b'hello')) # b'world'
db.close()
plyvel)The standard Python binding is plyvel — a thin Cython wrapper around the C++ library. Keys and values are bytes, never str.
import json
import plyvel
# Open (or create) a database directory
db = plyvel.DB('/tmp/users.ldb', create_if_missing=True)
# --- Insert ---
def put_user(user_id: int, record: dict) -> None:
key = f"user:{user_id}".encode()
value = json.dumps(record).encode()
db.put(key, value)
put_user(1001, {"first": "Alice", "last": "Smith", "city": "Seattle"})
put_user(1002, {"first": "Bob", "last": "Johnson", "city": "Portland"})
put_user(1003, {"first": "Charlie", "last": "Davis", "city": "San Francisco"})
# --- Retrieve a single key ---
raw = db.get(b'user:1002')
if raw is not None:
print(json.loads(raw)) # {'first': 'Bob', 'last': 'Johnson', 'city': 'Portland'}
# --- Delete ---
db.delete(b'user:1003')
# --- Check missing keys ---
print(db.get(b'user:9999')) # None
db.close()
Key-design tip. LevelDB stores keys in lexicographic byte order, so encode multi-part keys with sortable separators (e.g. b"user:0000001002" rather than b"user:1002") to keep numeric ranges contiguous.
The original LevelDB API is C++. The shape is DB::Open → Put/Get/Delete → delete db.
#include <leveldb/db.h>
#include <cassert>
#include <iostream>
#include <string>
int main() {
leveldb::DB* db = nullptr;
leveldb::Options options;
options.create_if_missing = true;
leveldb::Status status = leveldb::DB::Open(options, "/tmp/users.ldb", &db);
assert(status.ok());
// Put
db->Put(leveldb::WriteOptions(), "user:1001",
R"({"first":"Alice","last":"Smith"})");
// Get
std::string value;
status = db->Get(leveldb::ReadOptions(), "user:1001", &value);
if (status.ok()) std::cout << value << std::endl;
// Delete
db->Delete(leveldb::WriteOptions(), "user:1001");
delete db;
return 0;
}
Compile with the system library:
g++ -std=c++17 leveldb_demo.cpp -o leveldb_demo -lleveldb -lpthread -lsnappy
./leveldb_demo
Because keys are stored in sorted order, iterators are the canonical way to read multiple records. Range scans are O(log n) seek + O(k) sequential.
import plyvel
db = plyvel.DB('/tmp/users.ldb')
# Full forward scan
for key, value in db:
print(key, value)
# Range scan: keys in [start, stop)
for key, value in db.iterator(start=b'user:1001', stop=b'user:2000'):
print(key, value)
# Prefix scan
for key, value in db.iterator(prefix=b'user:'):
print(key, value)
# Reverse scan
for key, value in db.iterator(reverse=True):
print(key, value)
# Keys-only / values-only (skip the half you don't need)
for key in db.iterator(include_value=False):
print(key)
Composite indexes via key prefixes. A common pattern is to use prefix-based "tables" inside one LevelDB: b"user:...", b"order:...", b"order_by_user:<user_id>:<order_id>". Prefix scans then act as lookups by secondary key.
LevelDB has no multi-key transactions, but a WriteBatch commits a group of writes atomically — either all of them apply or none of them do.
import plyvel
db = plyvel.DB('/tmp/users.ldb')
with db.write_batch() as wb:
wb.put(b'user:1001', b'{"first":"Alice"}')
wb.put(b'user:1002', b'{"first":"Bob"}')
wb.delete(b'user:1003')
# The batch is committed atomically when the `with` block exits.
# If an exception is raised, nothing is written.
# Synchronous (fsync) batch — slower but durable across power loss
with db.write_batch(sync=True) as wb:
wb.put(b'config:version', b'42')
Sync vs. async writes. The default write does not fsync; data is durable on process crash but may be lost on power loss. Pass sync=True when correctness depends on the write surviving an OS crash.
A snapshot pins a logical view of the database at a point in time. Subsequent reads against that snapshot see the data exactly as it was, even as concurrent writes commit.
import plyvel
db = plyvel.DB('/tmp/users.ldb')
# Capture a snapshot
snap = db.snapshot()
# Concurrent writer modifies the database
db.put(b'user:1001', b'{"first":"Alice","city":"NYC"}')
# Reads against the snapshot still see the *old* value
print(snap.get(b'user:1001')) # original value, not "NYC"
print(db.get(b'user:1001')) # current value
snap.close()
Snapshots are cheap (a single sequence number) and are the standard tool for consistent multi-key reads, point-in-time iteration, and producing backups while writes continue.
The most-touched options for production deployments:
write_buffer_size — size of the in-memory MemTable before it flushes to L0. Larger = fewer flushes, more recovery time. Default 4 MiB; raise to 64–128 MiB for write-heavy workloads.block_size — size of an SSTable block. Larger blocks = better compression and scan throughput; smaller blocks = lower point-read latency. Default 4 KiB.block_cache — LRU cache for uncompressed blocks. Size to a meaningful fraction of working-set bytes for read-heavy workloads.max_open_files — cap on simultaneously held SSTable file handles. Raise on large databases.compression — snappy (default) or none. Snappy is almost free in CPU; turn off only when storing already-compressed payloads.paranoid_checks — verify checksums on every read. Costs CPU; turns silent disk corruption into errors.import plyvel
db = plyvel.DB(
'/tmp/big.ldb',
create_if_missing=True,
write_buffer_size=128 * 1024 * 1024, # 128 MiB MemTable
max_open_files=2000,
lru_cache_size=512 * 1024 * 1024, # 512 MiB block cache
block_size=8 * 1024,
compression='snappy',
)
RocksDB is Facebook's hard fork of LevelDB. The on-disk format is largely compatible, but RocksDB adds substantially more functionality. Pick LevelDB for embedded simplicity; pick RocksDB for production scale.
| Capability | LevelDB | RocksDB |
|---|---|---|
| Column families (logical namespaces) | No | Yes |
| Multi-threaded compaction | Single-threaded | Multi-threaded |
| Bloom filters | Manual | Built-in, per-CF |
| Backup & checkpoint API | No | Yes |
| TTL on keys | No | Yes |
| Transactions (optimistic / pessimistic) | No | Yes |
| Compression algorithms | Snappy | Snappy, LZ4, Zstd, ZLIB |
| Tuning surface | Small | Large |
| Code size | ~25k LOC | ~400k LOC |
Good fits:
Poor fits:
A write is appended to the WAL on disk, then inserted into the in-memory MemTable (a skiplist). When the MemTable hits write_buffer_size, it becomes immutable and is flushed to a Level-0 SSTable. Background compaction later merges L0 into L1, L1 into L2, etc., maintaining sorted, non-overlapping files at each level past L0.
Writes only touch the MemTable + WAL — both are sequential. Reads, however, may have to consult the MemTable, an immutable MemTable, all L0 files (which can overlap), and then one file per lower level, before falling back to disk. Bloom filters and the block cache mitigate this, but a write-heavy workload that hasn't compacted can show high read amplification.
WriteBatch groups multiple put/delete operations into a single sequence-numbered commit. Either all writes in the batch are visible after the WAL fsync or none are. There is no isolation across batches and no rollback — if you need cross-key MVCC, use RocksDB transactions or a database that supports them.
The database directory holds an exclusive file lock (LOCK file). LevelDB has no client/server protocol or distributed coordination, so two processes writing simultaneously would corrupt the LSM invariants. For multi-process access, run a server in front of it (e.g. a custom HTTP wrapper) or use a database designed for that — FoundationDB, Redis, Postgres.
Encode the secondary key in the key prefix and store the primary key (or full record) as the value. For example, to index users by city: write b"user:<id>" for the canonical record and b"by_city:<city>:<id>" with an empty value for the index. A prefix scan on b"by_city:Seattle:" yields all matching IDs. Maintain index and primary entries inside one WriteBatch for atomicity.
When you want a small, audited, embedded library with a tiny tuning surface — e.g. embedding into an application, shipping a CLI tool, or as a teaching example. RocksDB is the better default for any production server-side workload because it has multi-threaded compaction, bloom filters per column family, transactions, backup/checkpoint, and a much richer set of compression and TTL options. LevelDB stays in the picture mainly because of code-size, simplicity, and the systems that already depend on it.