Amazon EC2

Amazon EC2 (Elastic Compute Cloud)

Amazon EC2 provides resizable virtual machines (instances) in the AWS cloud. It remains the foundational AWS compute primitive — even managed services like RDS, EMR, and SageMaker run on EC2 underneath. EC2 is the right choice when you need full OS-level control, specialized hardware (GPU, high-memory, HPC), or software that doesn't fit containers or Lambda.

Key Concepts:

Instance Families: General purpose (M, T), compute-optimized (C), memory-optimized (R, X), storage-optimized (I, D), accelerated (G, P, Trn, Inf), and burstable (T).
AMIs (Amazon Machine Images): Immutable OS + software snapshots used to launch instances; build with EC2 Image Builder or packer.
EBS vs. Instance Store: EBS is network-attached, durable, and snapshotable; instance store is local NVMe, ephemeral, and fastest — useful for scratch space.
Placement Groups: Cluster (low-latency), Spread (AZ-isolated), Partition (rack-isolated) — control where instances land relative to each other.
Auto Scaling Groups: Launch/terminate instances based on metrics, schedules, or health checks; integrate with ELB for zero-downtime deploys.
Nitro System: Hypervisor + hardware offload that powers modern instance families — enables bare-metal instances and strong tenant isolation.

Pricing Models:

On-Demand: Per-second billing, no commitment — simple but most expensive.
Savings Plans / Reserved Instances: 1- or 3-year commitment for up to ~72% discount. Savings Plans apply across instance families; RIs are instance-specific.
Spot Instances: Up to ~90% off on-demand pricing using spare capacity — may be interrupted with 2 minutes' notice. Ideal for batch, CI, big data, ML training with checkpointing.
Dedicated Hosts / Dedicated Instances: Physical server isolation for BYOL licensing and compliance.
Capacity Reservations: Guarantee capacity in a specific AZ without commitment pricing — useful for DR and predictable scale events.

When to Use EC2 (vs. Alternatives):

Pick Lambda for short-running, event-driven work under 15 minutes.
Pick Fargate / ECS / EKS for containerized workloads.
Pick Batch / EMR for large-scale parallel jobs that AWS can orchestrate for you.
Pick EC2 when you need OS-level control, GPUs, specialized networking (EFA), custom kernels, or long-lived stateful processes.

Service Limits & Quotas:

vCPU-based On-Demand limits: Default soft limits per region per instance family — Standard (A/C/D/H/I/M/R/T/Z) starts around 64 vCPUs; G/VT and P (GPU) families much lower (often 0 by default and require a quota request).
Spot vCPU limits: Tracked separately from On-Demand; both adjustable through Service Quotas.
EBS volumes per region: Default soft limit starts at 5,000 volumes and 50 TiB total storage — adjustable.
Elastic IPs: 5 per region by default; request increase if needed.
Security groups per ENI: 5 by default, max 16; rules per SG default 60 inbound + 60 outbound.
Maximum EBS volume size: 64 TiB for io2 Block Express, 16 TiB for gp3/io2; max IOPS 256,000 (io2 Block Express).

Pricing Model:

Per-second billing: 60-second minimum for Linux On-Demand, Spot, and Reserved Instances; Windows and some commercial Linux distros bill per-hour.
You pay for: running instance hours (vCPU + memory class), attached EBS storage (per GB-month + IOPS for gp3/io2), data transfer OUT to the internet, NAT Gateway data processing, and Elastic IPs that aren't attached.
You don't pay for: stopped instances (compute), inbound data transfer, data transfer between EC2 and S3 in the same region.
Common cost surprises: cross-AZ data transfer charges (~$0.01/GB each way), unattached EBS volumes silently accruing storage cost, idle Elastic IPs, NAT Gateway data processing fees ($0.045/GB) on chatty workloads, and forgetting to release capacity reservations.
Free tier: 750 hours/month of t2.micro or t3.micro for 12 months for new accounts, plus 30 GB of EBS.

Code Example:

Launching a tagged EC2 instance with an IAM instance profile and user-data using boto3:

import boto3

ec2 = boto3.client("ec2", region_name="us-west-2")

user_data = """#!/bin/bash
yum update -y
yum install -y python3
echo "ready" > /var/log/bootstrap.log
"""

response = ec2.run_instances(
    ImageId="ami-0abcdef1234567890",       # Amazon Linux 2023 AMI
    InstanceType="t3.small",
    MinCount=1, MaxCount=1,
    KeyName="my-keypair",
    SubnetId="subnet-0123456789abcdef0",
    SecurityGroupIds=["sg-0123456789abcdef0"],
    IamInstanceProfile={"Name": "ec2-app-role"},
    UserData=user_data,
    TagSpecifications=[{
        "ResourceType": "instance",
        "Tags": [
            {"Key": "Name", "Value": "app-worker-01"},
            {"Key": "Environment", "Value": "prod"},
            {"Key": "Owner", "Value": "kevin"},
        ],
    }],
    MetadataOptions={"HttpTokens": "required"},  # IMDSv2 only
)
print(response["Instances"][0]["InstanceId"])

Common Interview Questions:

What is the difference between an EBS volume and an instance store?

EBS volumes are network-attached block storage that persist independently from the instance lifecycle and can be snapshotted, encrypted, and re-attached. Instance store is physically attached NVMe SSD on the host — extremely fast but ephemeral; data is lost when the instance stops, terminates, or the underlying hardware fails. Use EBS for anything that must survive reboots, instance store for scratch space (shuffle data, caches, temp files for ML training).

How do Spot Instances work and when would you use them?

Spot uses spare EC2 capacity at up to ~90% off On-Demand pricing. AWS can reclaim the instance with a 2-minute warning when capacity is needed elsewhere. Use Spot for fault-tolerant workloads — Spark/EMR jobs, CI runners, batch processing, ML training with checkpointing, stateless web tiers behind a load balancer. Use Spot Fleet or mixed-instance ASGs to spread across many instance types and AZs to reduce interruption risk.

What is the difference between a Security Group and a Network ACL?

Security Groups are stateful instance-level firewalls — return traffic for an allowed inbound flow is automatically permitted. They support allow rules only. Network ACLs are stateless subnet-level firewalls supporting both allow and deny rules; you must explicitly allow return traffic. SGs are the everyday tool; NACLs are coarse-grained and used for broad deny patterns (e.g., blocking known-bad IP ranges at the subnet edge).

What is IMDSv2 and why does it matter?

The Instance Metadata Service v2 requires a session token obtained via PUT before any GET — preventing SSRF attacks where a vulnerable web app is tricked into fetching credentials from 169.254.169.254. Always set HttpTokens=required on new launches. The 2019 Capital One breach exploited IMDSv1.

Reserved Instance vs. Savings Plan — which would you pick?

Savings Plans are usually the better default — they apply commitment automatically across instance families, sizes, and regions (Compute Savings Plans) without needing to predict the exact instance type. RIs are still useful for predictable workloads that need capacity reservation guarantees in a specific AZ, or for RDS/Redshift/ElastiCache where Savings Plans don't apply.

How do you achieve high availability with EC2?

Distribute instances across multiple Availability Zones inside an Auto Scaling Group, register them behind an Application or Network Load Balancer, use ELB health checks to replace failures automatically, and store state in shared services (RDS Multi-AZ, ElastiCache, S3) rather than on the instance. For region-level HA, replicate to a second region and front with Route 53 failover or Global Accelerator.

Best Practices:

Immutable infrastructure — bake AMIs, don't configure live.
Use Systems Manager Session Manager instead of opening SSH (port 22) to the internet.
Attach instance roles rather than embedding AWS credentials.
Distribute across AZs behind an Elastic Load Balancer for HA.
Tag aggressively — cost allocation, automation, and access control all rely on tags.