AWS EMR (Elastic MapReduce)

AWS EMR (Elastic MapReduce) is a managed big data platform that runs distributed processing frameworks — Apache Spark, Hadoop, HBase, Presto/Trino, Flink, and Hive — on dynamically provisioned EC2, EKS, or serverless compute. EMR removes the need to install, configure, and patch the underlying cluster while preserving full access to open-source APIs and tunable cluster shape.


Key Features:


Common Use Cases:


Example Workflow:

  1. Data Storage: Store raw data in Amazon S3.
  2. Cluster Provisioning: Launch an EMR cluster with the necessary frameworks (e.g., Spark, Hadoop).
  3. Data Processing: Use the cluster to process and analyze the data, running jobs written in languages like Python, Scala, or SQL.
  4. Results Storage: Save the processed data or analysis results back to Amazon S3, DynamoDB, or Redshift for further use.
  5. Cluster Termination: Shut down the cluster when the job is complete to save costs.


Service Limits & Quotas:


Pricing Model:


Code Example:

Launching an EMR cluster with Spark from boto3, reading from S3 and writing back parquet:

import boto3

emr = boto3.client("emr", region_name="us-west-2")

response = emr.run_job_flow(
    Name="etl-daily-2026-04-25",
    ReleaseLabel="emr-7.2.0",
    LogUri="s3://my-emr-logs/",
    Applications=[{"Name": "Spark"}, {"Name": "Hive"}],
    Instances={
        "InstanceGroups": [
            {"Name": "master", "InstanceRole": "MASTER",
             "InstanceType": "m6g.xlarge", "InstanceCount": 1,
             "Market": "ON_DEMAND"},
            {"Name": "core", "InstanceRole": "CORE",
             "InstanceType": "m6g.2xlarge", "InstanceCount": 2,
             "Market": "ON_DEMAND"},
            {"Name": "task", "InstanceRole": "TASK",
             "InstanceType": "m6g.2xlarge", "InstanceCount": 8,
             "Market": "SPOT", "BidPrice": "OnDemandPrice"},
        ],
        "KeepJobFlowAliveWhenNoSteps": False,
        "Ec2SubnetId": "subnet-0123456789abcdef0",
    },
    JobFlowRole="EMR_EC2_DefaultRole",
    ServiceRole="EMR_DefaultRole",
    VisibleToAllUsers=True,
    Steps=[{
        "Name": "Daily ETL",
        "ActionOnFailure": "TERMINATE_CLUSTER",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "spark-submit", "--deploy-mode", "cluster",
                "s3://my-bucket/jobs/etl.py",
                "--input", "s3://my-bucket/raw/2026-04-25/",
                "--output", "s3://my-bucket/curated/2026-04-25/",
            ],
        },
    }],
    Tags=[{"Key": "Owner", "Value": "data-eng"}],
)
print(response["JobFlowId"])


Common Interview Questions:

EMR on EC2 vs. EMR Serverless vs. EMR on EKS — when do you pick each?

EMR on EC2 is the original — full control, all frameworks, best for long-running clusters or interactive workloads. EMR Serverless is the simplest — submit a Spark or Hive job and AWS handles capacity, ideal for spiky and infrequent jobs where you don't want to pay for idle capacity. EMR on EKS lets you share K8s clusters across teams and reuse existing platform tooling — best when you already invest in EKS.

Why do most EMR jobs read from and write to S3 instead of HDFS?

S3 decouples compute from storage so the cluster can be transient — spin up, run the job, terminate. HDFS dies with the cluster, which forces long-lived clusters and ties cost to data volume. S3 is also cheaper, replicates across AZs by default, and integrates with Athena, Glue, Redshift Spectrum, and Lake Formation.

What is the role of master, core, and task nodes?

Master runs the resource manager (YARN ResourceManager, Spark driver host, HDFS NameNode). Core nodes run task processes and store HDFS data — losing a core node loses HDFS blocks. Task nodes run only compute, no HDFS — perfect for Spot because their interruption only kills tasks, which Spark/MapReduce will retry.

How does EMR Managed Scaling differ from Auto Scaling?

Custom Auto Scaling requires you to write CloudWatch-metric-based rules per node group. Managed Scaling is a single setting (min/max units) where EMR continuously evaluates YARN container demand, pending Spark stages, and HDFS utilization to add/remove core and task nodes — almost always better than hand-tuned rules.

How would you secure an EMR cluster handling regulated data?

Launch into a private subnet with a Security Configuration that enables Kerberos for in-cluster auth, in-transit TLS for Hadoop services, EMRFS server-side encryption (SSE-KMS) and TLS for S3, EBS volume encryption with a CMK, and IAM Roles for EMRFS to map user identities to S3 permissions. Add Lake Formation for column- and row-level access control on tables.

How do you debug a slow Spark job on EMR?

Start with the Spark History Server (port-forward via Session Manager, or persist event logs to S3 and view via the EMR console). Look for stage skew (one task much slower than others — often caused by skewed join keys), excessive shuffle, GC pauses on executors. Tune executor count/memory, enable adaptive query execution, repartition or salt skewed keys, and check that file sizes are appropriate (avoid millions of tiny files; aim for 128 MB–1 GB per file).

AWS EMR is ideal for businesses that need to process large-scale datasets, perform data transformations, or run advanced analytics in a flexible, scalable, and cost-effective environment.