AWS Glue ETL Service

AWS Glue is a fully managed, serverless ETL service that runs Apache Spark (and Python shell) jobs to extract, transform, and load data across AWS services. It pairs with the Glue Data Catalog (metadata), Glue Crawlers (schema discovery), and Glue Workflow (orchestration) to form an end-to-end data integration platform — without you provisioning Spark clusters.


Key Features:


Common Use Cases:


Service Limits & Quotas:


Pricing Model:


Code Example:


import sys
from awsglue.transforms import Filter
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Read from a catalog table; bookmarks let reruns skip already-processed files
src = glueContext.create_dynamic_frame.from_catalog(
    database="raw",
    table_name="user_events",
    transformation_ctx="src",
)

adults = Filter.apply(frame=src, f=lambda r: r["age"] is not None and r["age"] >= 30)

# Write Parquet output partitioned by event_date
glueContext.write_dynamic_frame.from_options(
    frame=adults,
    connection_type="s3",
    connection_options={
        "path": "s3://my-lake/curated/user_events/",
        "partitionKeys": ["event_date"],
    },
    format="parquet",
    transformation_ctx="sink",
)

job.commit()
  


DynamicFrame vs. DataFrame:


Common Interview Questions:

What is a job bookmark and when does it help?

A bookmark stores state about which source files or partitions a job has already processed. On the next run, Glue skips them. It saves time and money on incremental ingest jobs but must be reset (or disabled) when reprocessing historical data.

When would you pick Glue over EMR or EMR Serverless?

Glue is best when you want a managed Spark experience tightly integrated with the Data Catalog and you're happy with the Glue runtime. EMR/EMR Serverless wins when you need a specific Spark version, custom Hadoop ecosystem components, or fine-grained cluster control. EMR is often cheaper for very large, long-running clusters.

What is Flex execution and when should you avoid it?

Flex runs jobs on spare capacity at ~35% lower cost but with no SLA on start/run time — ideal for nightly batches and backfills. Avoid it for time-critical jobs (e.g., hourly SLAs, dashboards) where unpredictable start latency is unacceptable.

How do you handle schema drift?

Use DynamicFrames and ResolveChoice transforms to surface and resolve type ambiguity. For curated outputs, write to Iceberg or Delta where schema evolution is explicit and ACID-safe.

How do you optimize cost on a Spark Glue job?

Right-size worker type (G.1X for typical workloads, G.2X+ only for memory-bound), enable auto-scaling, use Flex when SLA permits, enable bookmarks for incrementals, prune partitions in create_dynamic_frame.from_catalog, and prefer Parquet/ZSTD output to keep downstream scans cheap.

How do streaming Glue jobs differ from batch?

Streaming jobs run continuously on Spark Structured Streaming over Kinesis or MSK, with checkpointing to S3. They use a minimum 2 DPU and bill for the full uptime — good for low-latency lake hydration but not always cheaper than Firehose.


AWS Glue ETL is the default serverless ETL platform on AWS. It removes Spark cluster management, integrates tightly with the catalog, and supports modern lake formats — making it the natural fit for most lake- and warehouse-bound data pipelines.