Amazon Data Firehose

Amazon Data Firehose (formerly Kinesis Data Firehose)

Amazon Data Firehose is the serverless delivery service for streaming data. It captures, buffers, optionally transforms, and loads streams into S3, Redshift, OpenSearch, Splunk, Snowflake, HTTP endpoints, and Iceberg tables — with no cluster to manage and automatic scaling.

Firehose vs. Kinesis Data Streams:

Data Streams is a durable, replayable stream consumers pull from — low latency (~200ms), custom processing, 24-hour to 365-day retention, shard-based scaling.
Firehose is a push-and-forget delivery pipe — buffers up to ~60s / 5MiB, no replay, fully serverless, minimal code.
Choose Firehose when you just need streaming data landed into S3/Redshift/OpenSearch; choose Data Streams when multiple consumers or custom processing need the raw feed.

Key Features:

Managed Delivery to Many Sinks: S3 (Parquet/JSON/raw), Redshift via COPY, OpenSearch, Splunk, Datadog, New Relic, Snowflake, Iceberg tables, generic HTTP.
Format Conversion: On-the-fly JSON to Parquet/ORC with Glue Data Catalog schema.
Lambda Transforms: Invoke a Lambda function per batch to enrich, mask, or reshape records before delivery.
Dynamic Partitioning: Route records into partitioned S3 prefixes by record content (e.g., year=2026/month=04/tenant=acme/).
Buffering Controls: Tune buffer size (1–128 MiB) and interval (60–900 s) to trade latency for file size and cost.
Error Handling: Failed records delivered to a configurable S3 error prefix for reprocessing.
Server-Side Encryption: SSE-KMS at rest; TLS in transit; VPC endpoints supported for private delivery.
Source Choices: Direct PUT API, a Kinesis Data Stream, MSK topic, CloudWatch Logs subscription, or AWS WAF logs.

Common Use Cases:

Clickstream Ingestion: Land web/mobile events into an S3 data lake in Parquet, partitioned by date.
Centralized Logging: Fan-out application logs to OpenSearch for search and S3 for cold archive.
CDC Landing: Deliver DMS or database CDC streams into Iceberg tables.
SIEM Feeds: Stream security events to Splunk, Datadog, or a custom SIEM.
Real-Time Analytics Prep: Buffer IoT telemetry into Parquet for Athena queries.

Service Limits & Quotas:

Throughput per stream (US East/West, Europe): default soft limit 5 MiB/s, 5,000 records/s, or 2,000 PUT requests/s — whichever is hit first. Other regions have lower defaults.
Record size: up to 1,000 KiB per record (after base64 if applicable).
Buffer size: 1–128 MiB.
Buffer interval: 60–900 s.
Max delivery streams per region: default soft limit 50.
Lambda transform payload: up to 6 MiB per invocation; transformed records up to 6 MiB.

Pricing Model:

Ingestion: billed per GB ingested (rounded up to the nearest 5 KB per record — small records are surprisingly expensive).
Format conversion: additional per-GB charge for Parquet/ORC conversion.
Dynamic partitioning: per-GB and per-object surcharge applies.
VPC delivery: hourly per-AZ charge plus per-GB processed.
Downstream sinks: S3 storage, Redshift cluster, OpenSearch domain billed separately.
Common cost surprises: the 5 KB rounding multiplies cost for tiny JSON events; dynamic partitioning at high cardinality; CloudWatch Logs subscription Firehose pipes feeding small batches.

Code Example:


import boto3, json

firehose = boto3.client("firehose", region_name="us-west-2")

# Send a batch of events to a Firehose stream that delivers to S3 as Parquet
records = [
    {"Data": (json.dumps({"user_id": uid, "event": "click", "ts": 1714000000}) + "\n").encode()}
    for uid in (101, 102, 103)
]

resp = firehose.put_record_batch(
    DeliveryStreamName="events-to-lake",
    Records=records,
)
print(f"Failed: {resp['FailedPutCount']}/{len(records)}")

A matching Lambda transform receives base64-encoded records and returns enriched output:


import base64, json

def lambda_handler(event, _ctx):
    out = []
    for rec in event["records"]:
        payload = json.loads(base64.b64decode(rec["data"]))
        payload["ingested_at"] = "2026-04-25T00:00:00Z"
        out.append({
            "recordId": rec["recordId"],
            "result": "Ok",
            "data": base64.b64encode((json.dumps(payload) + "\n").encode()).decode(),
        })
    return {"records": out}

Common Interview Questions:

When would you choose Firehose over Kinesis Data Streams?

Choose Firehose when you only need to land data in a managed sink (S3, Redshift, OpenSearch, Splunk, Snowflake, Iceberg) and you don't need replay or multiple custom consumers. Choose Data Streams when you need ordered replay, multiple independent consumers, or sub-second latency.

How does Firehose batch records, and how does that affect downstream files?

Firehose buffers until either the buffer size (1–128 MiB) or interval (60–900 s) threshold is reached, then writes one file per delivery. Tuning these values controls file size in S3 — too small and you create the small-file problem in Athena; too large and you delay availability.

How do you transform records mid-stream?

Attach a Lambda transform. Firehose invokes it per batch with base64-encoded records and expects each record back with a result of Ok, Dropped, or ProcessingFailed. Failed records are sent to the error S3 prefix.

How does dynamic partitioning work and what does it cost?

Dynamic partitioning extracts values from each record (via JQ expressions or Lambda) and uses them to build the S3 prefix at write time, e.g. tenant=acme/year=2026/month=04/. It adds per-GB and per-object surcharges and works best when partition cardinality is bounded.

How is JSON to Parquet conversion configured?

Enable record format conversion on the delivery stream and reference a Glue table with the target schema. Firehose deserializes JSON, casts to the schema, and writes Parquet files — billed per GB processed. Mismatched schemas produce conversion errors that go to the error prefix.

What happens to records that fail delivery?

Firehose retries with exponential backoff for the configured retry duration. After that, records are written to the configured S3 error prefix (one per source: processing-failed, format-conversion-failed, delivery-failed) so you can reprocess them.