Apache Paimon
Apache Paimon (formerly Flink Table Store) is a streaming-first open table format that originated in the Apache Flink community and graduated to a top-level Apache project in 2024. Where Hudi, Iceberg, and Delta evolved from batch-first roots, Paimon was designed from day one for high-frequency CDC and real-time ingest, using an LSM-tree storage layer rather than the snapshot-of-Parquet-files model.
Key Features:
- LSM-Tree Storage. Sorted runs and background compaction give cheap, frequent updates — the right shape for CDC streams that change a small fraction of rows per second.
- Streaming Source & Sink. First-class Flink integration; tables work as streaming sources with sub-minute latency and as sinks for exactly-once writes.
- Changelog Production. Reading a table as a changelog (insert / update / delete events) is a primitive operation, not a derived one.
- Primary Key Tables. Natural
UPSERT and DELETE semantics by primary key, like a database.
- Multi-Engine Support. Flink, Spark, Trino, StarRocks, Doris, and Hive can all read Paimon tables.
- Hive Metastore Compatible. Works with existing Hive / AWS Glue catalogs, easing adoption.
Paimon vs. Hudi vs. Iceberg vs. Delta:
- Paimon — LSM-tree, streaming-native, primary-key UPSERT first.
- Hudi — Originally streaming-friendly via merge-on-read; broader ecosystem.
- Iceberg — Snapshot-of-Parquet, batch-first, becoming streaming-capable with V2 deletes.
- Delta — Snapshot-of-Parquet plus transaction log; strong batch + Spark Streaming.
Use Cases:
- CDC ingest from operational databases at high update rates (thousands of upserts/sec per table).
- Real-time materialized views in a Flink-centric lakehouse.
- Streaming joins between fact and dimension tables with sub-minute freshness.
- Workloads where Hudi’s merge-on-read isn’t fast enough on tiny commits.