Amazon Athena

Amazon Athena is a serverless, interactive query service that runs standard SQL directly against data in Amazon S3 — no cluster to provision, no data to load. It's powered by Trino (for SQL) and Apache Spark (for notebook workloads) and integrates with the AWS Glue Data Catalog for schema and table metadata.

Key Features:

Serverless SQL over S3: Issue SELECT queries against Parquet, ORC, Avro, JSON, CSV, Iceberg, and Hudi tables without provisioning compute.
Two Pricing Models: Per-TB scanned on the default engine, or per-DPU-hour via provisioned capacity (Workgroups) for predictable-cost workloads.
Glue Data Catalog Integration: Tables defined once in Glue are queryable from Athena, Redshift Spectrum, EMR, and SageMaker.
Federated Queries: Connectors query RDS, DynamoDB, Redshift, DocumentDB, HBase, and custom sources — join S3 data with operational stores in one query.
Iceberg & Hudi Support: ACID tables with time travel, schema evolution, row-level updates/deletes.
Athena for Spark: Notebook-style PySpark on the same data lake, managed without an EMR cluster.
CTAS & INSERT: Create partitioned, columnar tables from query results to materialize aggregates and accelerate downstream queries.
Lake Formation Governance: Fine-grained row-, column-, and cell-level access control.

Common Use Cases:

Ad-hoc Data Lake Exploration: Analysts run SQL against raw and curated S3 zones without copying data into a warehouse.
Log Analytics: Query CloudTrail, VPC Flow Logs, ALB access logs, and application logs landed in S3 by Firehose.
Cost-Effective Reporting: Backing store for QuickSight and Grafana dashboards over historical data.
Federated Analytics: Join lake data with live RDS/DynamoDB tables for blended operational + historical reporting.
Pipeline Validation: Use CTAS to materialize and verify ETL outputs before promotion.

Cost & Performance Best Practices:

Partition tables on common filter columns (date, region) to prune scanned data.
Use columnar formats (Parquet, ORC) and Snappy/ZSTD compression — often 5–10x less scanned bytes than JSON/CSV.
Project only needed columns in SELECT — Athena bills on bytes scanned, not rows returned.
Avoid SELECT * on large tables; prefer SELECT col_a, col_b.
Compact small files periodically — Athena performs poorly over millions of tiny files.

Service Limits & Quotas:

Concurrent queries: default soft limit around 20 DML queries per workgroup; raise via Service Quotas.
Query timeout: 30 minutes for DML by default.
Query string size: up to 262,144 bytes.
Result size: no hard cap — results stream to your S3 output location.
Partitions per table: millions are supported but query planning slows past ~10k partitions per query — use partition projection or Iceberg for large partition counts.
Provisioned capacity: sold in DPU bundles (starting at 24 DPU); minimum hourly commitment applies.

Pricing Model:

On-demand: charged per TB of data scanned (rounded up to the nearest 10 MB per query, with a 10 MB minimum).
Provisioned capacity: charged per DPU-hour for reserved compute — predictable cost, isolation from noisy neighbours.
S3 charges for source data and query results are billed separately.
No charge for failed queries (except those caused by Lambda federated connector invocations).
Common cost surprises: SELECT * on uncompressed JSON, missing partition predicates, large CTAS rewrites, and Lambda federated connectors that bill separately.

Code Example:


SELECT
    date_trunc('day', event_time)                  AS day,
    region,
    count_if(status = 'error')                     AS errors,
    count(*)                                       AS total
FROM   logs.app_events
WHERE  year = '2026' AND month = '04'
GROUP BY 1, 2
ORDER BY day, region;

Run the same query programmatically via boto3:


import boto3, time

athena = boto3.client("athena", region_name="us-west-2")

resp = athena.start_query_execution(
    QueryString="SELECT count(*) FROM logs.app_events WHERE year='2026'",
    QueryExecutionContext={"Database": "logs"},
    WorkGroup="analytics",
    ResultConfiguration={"OutputLocation": "s3://my-athena-results/"},
)
qid = resp["QueryExecutionId"]

while True:
    state = athena.get_query_execution(QueryExecutionId=qid)["QueryExecution"]["Status"]["State"]
    if state in ("SUCCEEDED", "FAILED", "CANCELLED"):
        break
    time.sleep(1)

rows = athena.get_query_results(QueryExecutionId=qid)["ResultSet"]["Rows"]
print(rows[1]["Data"][0]["VarCharValue"])

Athena vs. Redshift Spectrum vs. EMR:

Athena: Best for ad-hoc, interactive SQL over S3 — serverless and pay-per-query.
Redshift Spectrum: Queries S3 from within Redshift — pick it when you already have a Redshift warehouse and want to join warehouse tables with lake data.
EMR / EMR Serverless: Full Spark/Hive/Presto workloads — pick it for heavy, long-running ETL or when you need engine-level control.

Common Interview Questions:

How does Athena bill queries, and what are the fastest ways to lower the bill?

You pay per TB scanned (10 MB minimum per query). The biggest reductions come from columnar formats (Parquet/ORC), partitioning on common predicates, projecting only required columns, and compacting small files. Provisioned capacity converts the per-TB charge into a fixed DPU-hour cost for predictable workloads.

What is the difference between Athena and Redshift Spectrum?

Both query S3 via the Glue catalog. Athena is fully serverless and ideal for ad-hoc SQL. Spectrum runs from inside a Redshift cluster, so it makes sense when you already pay for Redshift and want to join warehouse tables with external data.

How do you handle small-file problems in Athena?

Compact small files periodically (Glue jobs, Iceberg OPTIMIZE, or CTAS into a new partitioned table). Aim for files in the 128 MB to 1 GB range. Iceberg tables also support automatic compaction.

When would you choose Athena Spark over the SQL engine?

Choose Spark notebooks for iterative data engineering, ML feature work, or pipelines that need PySpark transformations beyond what SQL expresses. The SQL engine remains the right choice for declarative analytics.

How do partition projection and Iceberg help with very large partition counts?

Partition projection lets Athena compute partition values from query predicates instead of listing the Glue catalog (avoiding planning slowdowns past ~10k partitions). Iceberg uses hidden partitioning and metadata files, so the planner skips Glue's per-partition lookups entirely.

What is a federated query and what does it cost extra?

A federated query uses an Athena Data Source Connector (a Lambda function) to read from non-S3 sources like RDS, DynamoDB, or HBase. You pay for Lambda invocation and duration in addition to the standard Athena per-TB scan charge.

Athena is the everyday entry point to the S3-based data lake — it lets analysts and engineers query raw and curated data with SQL without standing up any infrastructure.