AWS EMR (Elastic MapReduce)

AWS EMR (Elastic MapReduce) is a managed big data platform that runs distributed processing frameworks — Apache Spark, Hadoop, HBase, Presto/Trino, Flink, and Hive — on dynamically provisioned EC2, EKS, or serverless compute. EMR removes the need to install, configure, and patch the underlying cluster while preserving full access to open-source APIs and tunable cluster shape.

Key Features:

Managed Service: AWS EMR handles the provisioning, configuration, and tuning of the cluster, freeing you from managing the underlying infrastructure.
Scalability: EMR can easily scale up or down based on your data processing needs, allowing you to adjust resources to optimize costs.
Integration with AWS Services: EMR integrates seamlessly with other AWS services such as S3, DynamoDB, and Redshift, enabling you to store, process, and analyze data across the AWS ecosystem.
Cost-Effective: You only pay for the resources you use, and EMR allows you to utilize spot instances and other pricing options to reduce costs.
Security: EMR provides multiple layers of security, including network isolation, encryption (in-transit and at-rest), and integration with AWS IAM for access control.
Deployment Options: EMR on EC2 (classic clusters), EMR on EKS (run Spark in Kubernetes), EMR Serverless (no cluster to manage), and EMR on AWS Outposts (on-prem).
EMRFS: Hadoop-compatible filesystem layer over S3 with consistent view, server-side encryption, and IAM role mapping per user.

Common Use Cases:

Data Processing: Processing large datasets using Hadoop, Spark, or Hive for tasks like ETL, data warehousing, and batch processing.
Machine Learning: Running machine learning algorithms at scale using Spark MLlib or other frameworks supported by EMR.
Data Analysis: Performing complex data analysis and querying using Presto, Hive, or Pig.
Streaming Data: Processing streaming data in real-time with Spark Streaming or Flink.
Genomics & Scientific Computing: Burst compute for sequence alignment, simulations, and other parallelizable scientific workloads.

Example Workflow:

Data Storage: Store raw data in Amazon S3.
Cluster Provisioning: Launch an EMR cluster with the necessary frameworks (e.g., Spark, Hadoop).
Data Processing: Use the cluster to process and analyze the data, running jobs written in languages like Python, Scala, or SQL.
Results Storage: Save the processed data or analysis results back to Amazon S3, DynamoDB, or Redshift for further use.
Cluster Termination: Shut down the cluster when the job is complete to save costs.

Service Limits & Quotas:

Active clusters per region: Default soft limit starts at 100; adjustable through Service Quotas.
Instances per cluster: Default soft limit of 256 EC2 instances per cluster (master + core + task); raisable.
Instance fleets: Up to 5 instance types per fleet (master/core/task), each with up to 30 EC2 instance type configurations across launch specifications.
Steps per cluster: 256 active + pending steps per cluster (older completed steps are pruned).
EMR Serverless: Default workers per application starts at 100; concurrent applications and aggregate vCPU per region also soft-limited.
Bootstrap actions: Up to 16 per cluster.

Pricing Model:

EMR on EC2: Per-second EC2 instance cost plus a small EMR uplift per instance-hour (typically a few cents — varies by family).
EMR Serverless: Per vCPU-second and per GB-second of memory consumed by jobs (1-minute minimum), plus storage.
EMR on EKS: Per vCPU-hour and GB-hour of pod resources during the job lifetime; you separately pay for the underlying EKS nodes.
You also pay for: attached EBS volumes, S3 storage and requests for EMRFS data, data transfer OUT of region, and any KMS request charges for encryption.
Cost optimization levers: use Spot for task nodes (Spark and MapReduce tolerate task-node loss), enable managed scaling, terminate clusters on idle, prefer Graviton (~20% cheaper), use S3 instead of HDFS, and right-size by checking Ganglia/CloudWatch utilization.
Common cost surprises: long-lived clusters left running over weekends, master/core nodes oversized, and large EBS volumes attached to terminated cluster nodes that weren't properly cleaned up.

Code Example:

Launching an EMR cluster with Spark from boto3, reading from S3 and writing back parquet:

import boto3

emr = boto3.client("emr", region_name="us-west-2")

response = emr.run_job_flow(
    Name="etl-daily-2026-04-25",
    ReleaseLabel="emr-7.2.0",
    LogUri="s3://my-emr-logs/",
    Applications=[{"Name": "Spark"}, {"Name": "Hive"}],
    Instances={
        "InstanceGroups": [
            {"Name": "master", "InstanceRole": "MASTER",
             "InstanceType": "m6g.xlarge", "InstanceCount": 1,
             "Market": "ON_DEMAND"},
            {"Name": "core", "InstanceRole": "CORE",
             "InstanceType": "m6g.2xlarge", "InstanceCount": 2,
             "Market": "ON_DEMAND"},
            {"Name": "task", "InstanceRole": "TASK",
             "InstanceType": "m6g.2xlarge", "InstanceCount": 8,
             "Market": "SPOT", "BidPrice": "OnDemandPrice"},
        ],
        "KeepJobFlowAliveWhenNoSteps": False,
        "Ec2SubnetId": "subnet-0123456789abcdef0",
    },
    JobFlowRole="EMR_EC2_DefaultRole",
    ServiceRole="EMR_DefaultRole",
    VisibleToAllUsers=True,
    Steps=[{
        "Name": "Daily ETL",
        "ActionOnFailure": "TERMINATE_CLUSTER",
        "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
                "spark-submit", "--deploy-mode", "cluster",
                "s3://my-bucket/jobs/etl.py",
                "--input", "s3://my-bucket/raw/2026-04-25/",
                "--output", "s3://my-bucket/curated/2026-04-25/",
            ],
        },
    }],
    Tags=[{"Key": "Owner", "Value": "data-eng"}],
)
print(response["JobFlowId"])

Common Interview Questions:

EMR on EC2 vs. EMR Serverless vs. EMR on EKS — when do you pick each?

EMR on EC2 is the original — full control, all frameworks, best for long-running clusters or interactive workloads. EMR Serverless is the simplest — submit a Spark or Hive job and AWS handles capacity, ideal for spiky and infrequent jobs where you don't want to pay for idle capacity. EMR on EKS lets you share K8s clusters across teams and reuse existing platform tooling — best when you already invest in EKS.

Why do most EMR jobs read from and write to S3 instead of HDFS?

S3 decouples compute from storage so the cluster can be transient — spin up, run the job, terminate. HDFS dies with the cluster, which forces long-lived clusters and ties cost to data volume. S3 is also cheaper, replicates across AZs by default, and integrates with Athena, Glue, Redshift Spectrum, and Lake Formation.

What is the role of master, core, and task nodes?

Master runs the resource manager (YARN ResourceManager, Spark driver host, HDFS NameNode). Core nodes run task processes and store HDFS data — losing a core node loses HDFS blocks. Task nodes run only compute, no HDFS — perfect for Spot because their interruption only kills tasks, which Spark/MapReduce will retry.

How does EMR Managed Scaling differ from Auto Scaling?

Custom Auto Scaling requires you to write CloudWatch-metric-based rules per node group. Managed Scaling is a single setting (min/max units) where EMR continuously evaluates YARN container demand, pending Spark stages, and HDFS utilization to add/remove core and task nodes — almost always better than hand-tuned rules.

How would you secure an EMR cluster handling regulated data?

Launch into a private subnet with a Security Configuration that enables Kerberos for in-cluster auth, in-transit TLS for Hadoop services, EMRFS server-side encryption (SSE-KMS) and TLS for S3, EBS volume encryption with a CMK, and IAM Roles for EMRFS to map user identities to S3 permissions. Add Lake Formation for column- and row-level access control on tables.

How do you debug a slow Spark job on EMR?

Start with the Spark History Server (port-forward via Session Manager, or persist event logs to S3 and view via the EMR console). Look for stage skew (one task much slower than others — often caused by skewed join keys), excessive shuffle, GC pauses on executors. Tune executor count/memory, enable adaptive query execution, repartition or salt skewed keys, and check that file sizes are appropriate (avoid millions of tiny files; aim for 128 MB–1 GB per file).

AWS EMR is ideal for businesses that need to process large-scale datasets, perform data transformations, or run advanced analytics in a flexible, scalable, and cost-effective environment.