AWS Glue ETL Service

AWS Glue is a fully managed, serverless ETL service that runs Apache Spark (and Python shell) jobs to extract, transform, and load data across AWS services. It pairs with the Glue Data Catalog (metadata), Glue Crawlers (schema discovery), and Glue Workflow (orchestration) to form an end-to-end data integration platform — without you provisioning Spark clusters.

Key Features:

Serverless Spark: Spark and Python-shell job runtimes auto-scale on demand; you pay per DPU-second with a 1-minute minimum.
Job Types: Spark (batch), Python shell (lightweight scripts), Streaming (Spark Structured Streaming over Kinesis/MSK), and Ray (distributed Python) jobs.
Glue Studio: Visual job authoring that generates PySpark or Scala code; suitable for non-developers and one-off pipelines.
DynamicFrames: Glue's schema-flexible alternative to Spark DataFrames — useful when source schema drifts.
Job Bookmarks: Track processed source files/partitions so reruns only handle new data.
Iceberg, Hudi, Delta: Native readers/writers for open table formats with ACID semantics.
Glue DataBrew: No-code data prep with 250+ transforms for analysts.
Auto Scaling & Flex Execution: Auto-scaling allocates workers as the job needs them; Flex jobs trade run-time SLA for ~35% lower cost on non-urgent batches.

Common Use Cases:

Data Lake Ingestion: Land raw files into S3 then transform into curated Parquet/Iceberg tables.
Warehouse Loading: ELT into Redshift, Snowflake, or RDS with managed connectors.
Schema Drift Handling: Use DynamicFrames to absorb upstream schema changes without job failures.
Streaming ETL: Read Kinesis or MSK and write to S3/JDBC sinks via Spark Structured Streaming.
Compaction: Roll up small files in S3 lake tables on a schedule.

Service Limits & Quotas:

Concurrent job runs: default soft limit 200 per account.
DPUs per job: 2–100 for Standard/G.1X workers, 2–149 for G.2X, with G.4X and G.8X for memory-intensive jobs.
Job timeout: default 48 hours (configurable).
Spark version: Glue 4.0/5.0 ships modern Spark; older runtimes remain available for legacy jobs.
Python shell jobs: 1 DPU or 1/16 DPU for lightweight scripts.
Streaming jobs: minimum 2 DPUs; long-running.

Pricing Model:

Spark/Streaming jobs: billed per DPU-second (1 DPU = 4 vCPU + 16 GB) with a 1-minute minimum per run.
Python shell jobs: per DPU-second with a 1-minute minimum (much cheaper than Spark for small jobs).
Flex execution: ~35% cheaper than standard, suitable for non-time-critical batches.
Crawlers: per DPU-second with a 10-minute minimum per run.
Glue Studio & Data Catalog: see catalog page for storage/request pricing.
Common cost surprises: oversized worker types, missing job bookmarks (full re-scan every run), enabled metrics with high cardinality, and idle streaming jobs that never auto-scale down.

Code Example:


import sys
from awsglue.transforms import Filter
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Read from a catalog table; bookmarks let reruns skip already-processed files
src = glueContext.create_dynamic_frame.from_catalog(
    database="raw",
    table_name="user_events",
    transformation_ctx="src",
)

adults = Filter.apply(frame=src, f=lambda r: r["age"] is not None and r["age"] >= 30)

# Write Parquet output partitioned by event_date
glueContext.write_dynamic_frame.from_options(
    frame=adults,
    connection_type="s3",
    connection_options={
        "path": "s3://my-lake/curated/user_events/",
        "partitionKeys": ["event_date"],
    },
    format="parquet",
    transformation_ctx="sink",
)

job.commit()

DynamicFrame vs. DataFrame:

DynamicFrame: Glue-native, lazily typed. Each record can have its own schema (Choice types). Useful for messy/unstructured input and built-in transforms like ResolveChoice, Relationalize, ApplyMapping.
DataFrame: Standard Spark, fully typed. Faster and supports the entire Spark SQL surface.
Pattern: Read into DynamicFrame to absorb drift, convert to DataFrame for heavy transforms, and convert back to write through Glue's optimized writers.

Common Interview Questions:

What is a job bookmark and when does it help?

A bookmark stores state about which source files or partitions a job has already processed. On the next run, Glue skips them. It saves time and money on incremental ingest jobs but must be reset (or disabled) when reprocessing historical data.

When would you pick Glue over EMR or EMR Serverless?

Glue is best when you want a managed Spark experience tightly integrated with the Data Catalog and you're happy with the Glue runtime. EMR/EMR Serverless wins when you need a specific Spark version, custom Hadoop ecosystem components, or fine-grained cluster control. EMR is often cheaper for very large, long-running clusters.

What is Flex execution and when should you avoid it?

Flex runs jobs on spare capacity at ~35% lower cost but with no SLA on start/run time — ideal for nightly batches and backfills. Avoid it for time-critical jobs (e.g., hourly SLAs, dashboards) where unpredictable start latency is unacceptable.

How do you handle schema drift?

Use DynamicFrames and ResolveChoice transforms to surface and resolve type ambiguity. For curated outputs, write to Iceberg or Delta where schema evolution is explicit and ACID-safe.

How do you optimize cost on a Spark Glue job?

Right-size worker type (G.1X for typical workloads, G.2X+ only for memory-bound), enable auto-scaling, use Flex when SLA permits, enable bookmarks for incrementals, prune partitions in create_dynamic_frame.from_catalog, and prefer Parquet/ZSTD output to keep downstream scans cheap.

How do streaming Glue jobs differ from batch?

Streaming jobs run continuously on Spark Structured Streaming over Kinesis or MSK, with checkpointing to S3. They use a minimum 2 DPU and bill for the full uptime — good for low-latency lake hydration but not always cheaper than Firehose.

AWS Glue ETL is the default serverless ETL platform on AWS. It removes Spark cluster management, integrates tightly with the catalog, and supports modern lake formats — making it the natural fit for most lake- and warehouse-bound data pipelines.