AWS CloudWatch

AWS CloudWatch is the umbrella observability service for AWS — metrics, logs, traces, alarms, dashboards, and synthetic canaries in one platform. Every AWS service publishes metrics into CloudWatch by default, and most also stream structured logs and events into adjacent services (CloudWatch Logs, EventBridge, X-Ray) that share the CloudWatch console.


Key Features:


Common Use Cases:


Service Limits & Quotas:


Pricing Model:


Code Example — Custom Metric, Alarm, and Logs Insights:


import boto3, time

cw = boto3.client("cloudwatch", region_name="us-west-2")

cw.put_metric_data(
    Namespace="MyApp/Orders",
    MetricData=[{
        "MetricName": "OrdersProcessed",
        "Dimensions": [{"Name": "Environment", "Value": "prod"}],
        "Value": 142,
        "Unit": "Count",
        "Timestamp": time.time(),
    }],
)

cw.put_metric_alarm(
    AlarmName="OrdersStalled-prod",
    Namespace="MyApp/Orders",
    MetricName="OrdersProcessed",
    Dimensions=[{"Name": "Environment", "Value": "prod"}],
    Statistic="Sum",
    Period=300,
    EvaluationPeriods=2,
    Threshold=1.0,
    ComparisonOperator="LessThanThreshold",
    TreatMissingData="breaching",
    AlarmActions=["arn:aws:sns:us-west-2:111122223333:oncall-pager"],
)
  

Logs Insights Query (Lambda errors by function):


fields @timestamp, @log, @message
| filter @message like /ERROR/
| stats count() by bin(5m), @log
| sort @timestamp desc
  


Common Interview Questions:

What's the difference between standard and high-resolution metrics?

Standard metrics are 1-minute granularity (default). High-resolution metrics record at 1-second granularity and cost more per alarm; useful only for fast autoscaling or sub-minute SLOs.

How long are CloudWatch metrics retained?

1-second data for 3 hours, 1-minute data for 15 days, 5-minute data for 63 days, 1-hour data for 15 months. After that, the data is gone — export to a long-term store (S3 via metric streams) if you need history beyond 15 months.

What is the EMF (Embedded Metric Format) and why use it?

A JSON log format that CloudWatch Logs auto-extracts into metrics. Lets you log structured events from Lambda or ECS once and get both searchable logs and high-cardinality metrics — without extra PutMetricData API calls.

Composite alarm vs. alarm action chain — when use each?

A composite alarm fires when a boolean expression over child alarms evaluates true (e.g., high-error AND low-traffic). Action chains run when one alarm transitions. Composite alarms are the right way to suppress noisy correlated alerts and define SLO conditions.

How do you reduce CloudWatch Logs cost on a chatty service?

Set explicit retention on every log group (default is forever), filter logs at the source (Lambda Powertools, log levels), use Infrequent Access log class at half the ingestion price, and avoid logging entire request/response payloads.

CloudWatch vs. third-party (Datadog, New Relic)?

CloudWatch is cheapest, deepest in AWS service coverage, and has no agent for native AWS metrics. Third-party tools often win on UX, cross-cloud, APM, and richer alerting workflows. Most teams keep CloudWatch as the primary store and stream a subset to a third-party platform.


CloudWatch is the default observability fabric for AWS — start with it, enable retention policies on day one, and reach for third-party platforms only when application-level APM or cross-cloud correlation is required.