AWS CloudWatch

AWS CloudWatch is the umbrella observability service for AWS — metrics, logs, traces, alarms, dashboards, and synthetic canaries in one platform. Every AWS service publishes metrics into CloudWatch by default, and most also stream structured logs and events into adjacent services (CloudWatch Logs, EventBridge, X-Ray) that share the CloudWatch console.

Key Features:

Metrics: Time-series numerical data with up to 1-second resolution for high-resolution custom metrics; default 1-minute granularity for AWS-emitted metrics.
Alarms: Threshold or anomaly-detection alarms triggering SNS, EC2 Auto Scaling, EventBridge, or Lambda actions; composite alarms combine multiple alarms with boolean logic.
Logs: CloudWatch Logs ingests application, system, and AWS service logs into log groups with retention from 1 day to indefinite; supports subscription filters to Lambda/Kinesis/Firehose.
Logs Insights: Purpose-built query language over log data with visualization; results in seconds for terabyte-scale searches.
Dashboards: Multi-widget visual layouts mixing metrics, logs queries, alarm status, and explorer views; cross-account/cross-region viewing.
Anomaly Detection: ML-trained bands around metrics for alarms that adapt to seasonality without static thresholds.
Synthetics & RUM: Canary scripts simulate user journeys; Real User Monitoring captures browser performance and errors.
Container Insights & Lambda Insights: Pre-built dashboards and metrics for ECS, EKS, and Lambda.
Cross-Account Observability: Source/monitoring account model exposes metrics, logs, and traces from many accounts in one console.

Common Use Cases:

Infrastructure Health: CPU, memory, disk, network for EC2, RDS, ECS tasks, Lambda errors and duration.
Application Performance Monitoring: Custom business metrics (orders/sec, queue depth) with alarms on degradation.
Log Centralization: Lambda, ECS, EKS, and EC2 logs streamed to CloudWatch Logs for searchable retention.
Auto Scaling Triggers: Target tracking and step scaling policies driven by CloudWatch alarms.
SLO Monitoring: Composite alarms and Logs Insights queries computing error budgets.
Security Operations: Metric filters on log streams (e.g., failed SSH logins) feeding alarms.

Service Limits & Quotas:

Metrics per region: no hard limit; standard 1-minute resolution retained 15 months (rolled up over time: 1m for 15 days, 5m for 63 days, 1h for 15 months).
Custom metric dimensions: 30 per metric.
Alarms per account per region: default soft limit 5,000 (raisable).
PutMetricData: default 150 transactions/sec, max 1,000 metrics per request.
Log group retention: 1 day to never-expire; indefinite by default (cost trap).
Log event size: 256 KB per event (hard).
Logs Insights query: 60-minute timeout, scans up to 100,000 log groups per query.

Pricing Model:

Custom metrics: $0.30 per metric per month for the first 10K metrics, tiered down.
API requests: PutMetricData $0.01 per 1,000 metrics, GetMetricData $0.01 per 1,000 metrics requested.
Logs ingestion: $0.50/GB ingested (standard); $0.25/GB for Infrequent Access log class.
Logs storage: $0.03/GB/month archived after ingest.
Logs Insights: $0.005 per GB scanned.
Alarms: $0.10 per standard alarm per month, $0.30 per high-resolution alarm.
Common cost surprise: verbose application logs at default infinite retention — set retention on every log group, or convert chatty groups to Infrequent Access. Lambda logs especially balloon costs if functions log every invocation.

Code Example — Custom Metric, Alarm, and Logs Insights:


import boto3, time

cw = boto3.client("cloudwatch", region_name="us-west-2")

cw.put_metric_data(
    Namespace="MyApp/Orders",
    MetricData=[{
        "MetricName": "OrdersProcessed",
        "Dimensions": [{"Name": "Environment", "Value": "prod"}],
        "Value": 142,
        "Unit": "Count",
        "Timestamp": time.time(),
    }],
)

cw.put_metric_alarm(
    AlarmName="OrdersStalled-prod",
    Namespace="MyApp/Orders",
    MetricName="OrdersProcessed",
    Dimensions=[{"Name": "Environment", "Value": "prod"}],
    Statistic="Sum",
    Period=300,
    EvaluationPeriods=2,
    Threshold=1.0,
    ComparisonOperator="LessThanThreshold",
    TreatMissingData="breaching",
    AlarmActions=["arn:aws:sns:us-west-2:111122223333:oncall-pager"],
)

Logs Insights Query (Lambda errors by function):


fields @timestamp, @log, @message
| filter @message like /ERROR/
| stats count() by bin(5m), @log
| sort @timestamp desc

Common Interview Questions:

What's the difference between standard and high-resolution metrics?

Standard metrics are 1-minute granularity (default). High-resolution metrics record at 1-second granularity and cost more per alarm; useful only for fast autoscaling or sub-minute SLOs.

How long are CloudWatch metrics retained?

1-second data for 3 hours, 1-minute data for 15 days, 5-minute data for 63 days, 1-hour data for 15 months. After that, the data is gone — export to a long-term store (S3 via metric streams) if you need history beyond 15 months.

What is the EMF (Embedded Metric Format) and why use it?

A JSON log format that CloudWatch Logs auto-extracts into metrics. Lets you log structured events from Lambda or ECS once and get both searchable logs and high-cardinality metrics — without extra PutMetricData API calls.

Composite alarm vs. alarm action chain — when use each?

A composite alarm fires when a boolean expression over child alarms evaluates true (e.g., high-error AND low-traffic). Action chains run when one alarm transitions. Composite alarms are the right way to suppress noisy correlated alerts and define SLO conditions.

How do you reduce CloudWatch Logs cost on a chatty service?

Set explicit retention on every log group (default is forever), filter logs at the source (Lambda Powertools, log levels), use Infrequent Access log class at half the ingestion price, and avoid logging entire request/response payloads.

CloudWatch vs. third-party (Datadog, New Relic)?

CloudWatch is cheapest, deepest in AWS service coverage, and has no agent for native AWS metrics. Third-party tools often win on UX, cross-cloud, APM, and richer alerting workflows. Most teams keep CloudWatch as the primary store and stream a subset to a third-party platform.

CloudWatch is the default observability fabric for AWS — start with it, enable retention policies on day one, and reach for third-party platforms only when application-level APM or cross-cloud correlation is required.