Amazon Athena

Amazon Athena is a serverless, interactive query service that runs standard SQL directly against data in Amazon S3 — no cluster to provision, no data to load. It's powered by Trino (for SQL) and Apache Spark (for notebook workloads) and integrates with the AWS Glue Data Catalog for schema and table metadata.


Key Features:


Common Use Cases:


Cost & Performance Best Practices:


Service Limits & Quotas:


Pricing Model:


Code Example:


SELECT
    date_trunc('day', event_time)                  AS day,
    region,
    count_if(status = 'error')                     AS errors,
    count(*)                                       AS total
FROM   logs.app_events
WHERE  year = '2026' AND month = '04'
GROUP BY 1, 2
ORDER BY day, region;
  

Run the same query programmatically via boto3:


import boto3, time

athena = boto3.client("athena", region_name="us-west-2")

resp = athena.start_query_execution(
    QueryString="SELECT count(*) FROM logs.app_events WHERE year='2026'",
    QueryExecutionContext={"Database": "logs"},
    WorkGroup="analytics",
    ResultConfiguration={"OutputLocation": "s3://my-athena-results/"},
)
qid = resp["QueryExecutionId"]

while True:
    state = athena.get_query_execution(QueryExecutionId=qid)["QueryExecution"]["Status"]["State"]
    if state in ("SUCCEEDED", "FAILED", "CANCELLED"):
        break
    time.sleep(1)

rows = athena.get_query_results(QueryExecutionId=qid)["ResultSet"]["Rows"]
print(rows[1]["Data"][0]["VarCharValue"])
  


Athena vs. Redshift Spectrum vs. EMR:


Common Interview Questions:

How does Athena bill queries, and what are the fastest ways to lower the bill?

You pay per TB scanned (10 MB minimum per query). The biggest reductions come from columnar formats (Parquet/ORC), partitioning on common predicates, projecting only required columns, and compacting small files. Provisioned capacity converts the per-TB charge into a fixed DPU-hour cost for predictable workloads.

What is the difference between Athena and Redshift Spectrum?

Both query S3 via the Glue catalog. Athena is fully serverless and ideal for ad-hoc SQL. Spectrum runs from inside a Redshift cluster, so it makes sense when you already pay for Redshift and want to join warehouse tables with external data.

How do you handle small-file problems in Athena?

Compact small files periodically (Glue jobs, Iceberg OPTIMIZE, or CTAS into a new partitioned table). Aim for files in the 128 MB to 1 GB range. Iceberg tables also support automatic compaction.

When would you choose Athena Spark over the SQL engine?

Choose Spark notebooks for iterative data engineering, ML feature work, or pipelines that need PySpark transformations beyond what SQL expresses. The SQL engine remains the right choice for declarative analytics.

How do partition projection and Iceberg help with very large partition counts?

Partition projection lets Athena compute partition values from query predicates instead of listing the Glue catalog (avoiding planning slowdowns past ~10k partitions). Iceberg uses hidden partitioning and metadata files, so the planner skips Glue's per-partition lookups entirely.

What is a federated query and what does it cost extra?

A federated query uses an Athena Data Source Connector (a Lambda function) to read from non-S3 sources like RDS, DynamoDB, or HBase. You pay for Lambda invocation and duration in addition to the standard Athena per-TB scan charge.


Athena is the everyday entry point to the S3-based data lake — it lets analysts and engineers query raw and curated data with SQL without standing up any infrastructure.