AWS EMR (Elastic MapReduce) is a managed big data platform that runs distributed processing frameworks — Apache Spark, Hadoop, HBase, Presto/Trino, Flink, and Hive — on dynamically provisioned EC2, EKS, or serverless compute. EMR removes the need to install, configure, and patch the underlying cluster while preserving full access to open-source APIs and tunable cluster shape.
Launching an EMR cluster with Spark from boto3, reading from S3 and writing back parquet:
import boto3
emr = boto3.client("emr", region_name="us-west-2")
response = emr.run_job_flow(
Name="etl-daily-2026-04-25",
ReleaseLabel="emr-7.2.0",
LogUri="s3://my-emr-logs/",
Applications=[{"Name": "Spark"}, {"Name": "Hive"}],
Instances={
"InstanceGroups": [
{"Name": "master", "InstanceRole": "MASTER",
"InstanceType": "m6g.xlarge", "InstanceCount": 1,
"Market": "ON_DEMAND"},
{"Name": "core", "InstanceRole": "CORE",
"InstanceType": "m6g.2xlarge", "InstanceCount": 2,
"Market": "ON_DEMAND"},
{"Name": "task", "InstanceRole": "TASK",
"InstanceType": "m6g.2xlarge", "InstanceCount": 8,
"Market": "SPOT", "BidPrice": "OnDemandPrice"},
],
"KeepJobFlowAliveWhenNoSteps": False,
"Ec2SubnetId": "subnet-0123456789abcdef0",
},
JobFlowRole="EMR_EC2_DefaultRole",
ServiceRole="EMR_DefaultRole",
VisibleToAllUsers=True,
Steps=[{
"Name": "Daily ETL",
"ActionOnFailure": "TERMINATE_CLUSTER",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster",
"s3://my-bucket/jobs/etl.py",
"--input", "s3://my-bucket/raw/2026-04-25/",
"--output", "s3://my-bucket/curated/2026-04-25/",
],
},
}],
Tags=[{"Key": "Owner", "Value": "data-eng"}],
)
print(response["JobFlowId"])
EMR on EC2 is the original — full control, all frameworks, best for long-running clusters or interactive workloads. EMR Serverless is the simplest — submit a Spark or Hive job and AWS handles capacity, ideal for spiky and infrequent jobs where you don't want to pay for idle capacity. EMR on EKS lets you share K8s clusters across teams and reuse existing platform tooling — best when you already invest in EKS.
S3 decouples compute from storage so the cluster can be transient — spin up, run the job, terminate. HDFS dies with the cluster, which forces long-lived clusters and ties cost to data volume. S3 is also cheaper, replicates across AZs by default, and integrates with Athena, Glue, Redshift Spectrum, and Lake Formation.
Master runs the resource manager (YARN ResourceManager, Spark driver host, HDFS NameNode). Core nodes run task processes and store HDFS data — losing a core node loses HDFS blocks. Task nodes run only compute, no HDFS — perfect for Spot because their interruption only kills tasks, which Spark/MapReduce will retry.
Custom Auto Scaling requires you to write CloudWatch-metric-based rules per node group. Managed Scaling is a single setting (min/max units) where EMR continuously evaluates YARN container demand, pending Spark stages, and HDFS utilization to add/remove core and task nodes — almost always better than hand-tuned rules.
Launch into a private subnet with a Security Configuration that enables Kerberos for in-cluster auth, in-transit TLS for Hadoop services, EMRFS server-side encryption (SSE-KMS) and TLS for S3, EBS volume encryption with a CMK, and IAM Roles for EMRFS to map user identities to S3 permissions. Add Lake Formation for column- and row-level access control on tables.
Start with the Spark History Server (port-forward via Session Manager, or persist event logs to S3 and view via the EMR console). Look for stage skew (one task much slower than others — often caused by skewed join keys), excessive shuffle, GC pauses on executors. Tune executor count/memory, enable adaptive query execution, repartition or salt skewed keys, and check that file sizes are appropriate (avoid millions of tiny files; aim for 128 MB–1 GB per file).
AWS EMR is ideal for businesses that need to process large-scale datasets, perform data transformations, or run advanced analytics in a flexible, scalable, and cost-effective environment.