AWS AI/ML Services Overview

AWS organizes its AI and ML offerings into three layers. Picking the right layer is primarily about how much of the ML stack you want to own: more managed at the top, more flexibility at the bottom.

Layer 1 — Generative AI (Foundation Models)

Amazon Bedrock: Managed API for foundation models from Anthropic, Meta, Mistral, Cohere, AI21, Stability, and Amazon — plus Knowledge Bases (RAG), Agents, and Guardrails.
Amazon Q: AWS-native AI assistant tuned for business data (Q Business) and AWS development (Q Developer) — built on Bedrock under the hood.
SageMaker JumpStart: Deploy and fine-tune open-source foundation models (Llama, Mistral, Falcon, Stable Diffusion) on dedicated SageMaker endpoints.

Layer 2 — Task-Specific AI APIs

Amazon Comprehend / Comprehend Medical: NLP — entities, sentiment, PII redaction, clinical concept extraction.
Amazon Textract: Document AI — OCR, forms, tables, Analyze Expense / ID / Lending.
Amazon Rekognition: Computer vision — labels, faces, moderation, custom image classifiers.
Amazon Transcribe & Polly: Speech-to-text and text-to-speech; Transcribe Call Analytics surfaces conversation insights.
Amazon Translate: Neural machine translation across 75+ languages.
Amazon Personalize: Managed recommendation and personalization models trained on your interaction data.
Amazon Forecast: Time-series forecasting using statistical + deep-learning ensembles.
Amazon Kendra: Enterprise semantic search across connectors to SharePoint, Confluence, S3, and more.

Layer 3 — ML Platform

Amazon SageMaker: Full ML lifecycle — labeling, training, tuning, deployment, monitoring, feature store, pipelines.
EC2 + Deep Learning AMIs: Raw compute for teams that want to manage their own training stack.
AWS Trainium / Inferentia: Purpose-built accelerators for cost-efficient training and inference.

Choosing Between Layers:

Start with task-specific APIs if a managed API matches your use case — fastest to production, no training data required.
Use Bedrock when the task is generative, open-ended, or benefits from foundation-model reasoning. Add Knowledge Bases for RAG and Guardrails for safety.
Drop to SageMaker when you need full control over training, hosting, or custom model architectures.

Service Limits & Quotas (Common Patterns):

Bedrock: per-model TPM/RPM quotas vary by model and region; provisioned throughput available for guaranteed capacity.
SageMaker: instance-type-specific quotas (e.g., default soft limit on ml.p4d/ml.p5 capacity); raise via Service Quotas with justification.
Comprehend / Textract / Rekognition: per-API TPS and async job concurrency quotas — most are soft limits.
Personalize: 5 datasets, 10 solutions, and 5 campaigns per dataset group by default.
Translate: 5,000 bytes per real-time request; up to 20 MB per async batch document.
Forecast: dataset import jobs scale to billions of rows; predictor training time grows with dataset and algorithm choice.

Pricing Model (Layer Patterns):

Bedrock: per 1,000 input/output tokens per model; provisioned throughput per model unit per hour.
Task APIs: per unit of input — characters (Comprehend, Translate), pages (Textract), images/minutes (Rekognition).
SageMaker: per training-instance-hour, per endpoint-instance-hour (real-time), per inference for serverless endpoints, plus storage and data transfer.
Custom Models (Custom Labels, Comprehend custom, Rekognition Custom Labels): training per hour, inference billed per hosted hour.
Common cost surprises: SageMaker endpoints left running idle (per-hour while up); Bedrock with very long contexts; Rekognition Custom Labels models hosted 24/7 in dev; chatty per-image API calls without batching.

Code Example — Picking the Right Layer:

Same conceptual task ("understand this customer review") at three layers:


import boto3, json

# Layer 2 (task API) — predictable cost, structured output
comprehend = boto3.client("comprehend")
review = "The screen is gorgeous but battery life is awful."
sent = comprehend.detect_sentiment(Text=review, LanguageCode="en")["Sentiment"]
ents = comprehend.detect_entities(Text=review, LanguageCode="en")["Entities"]

# Layer 1 (Bedrock) — open-ended reasoning, more flexible output
bedrock = boto3.client("bedrock-runtime")
prompt = (
    "Extract per-feature sentiment from this review as JSON list of "
    "{feature, sentiment, evidence}.\n\n" + review
)
resp = bedrock.invoke_model(
    modelId="anthropic.claude-3-7-sonnet-20250219-v1:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 256,
        "messages": [{"role": "user", "content": prompt}],
    }),
)
print(json.loads(resp["body"].read())["content"][0]["text"])

# Layer 3 (SageMaker) — deploy a fine-tuned model when you need
# full control over architecture, latency, or custom output schemas.

Common Interview Questions:

How do you choose between Bedrock and a task-specific API like Comprehend?

Pick the task API when your problem is well-defined and the API output matches what you need (NER, sentiment, OCR, labels) — predictable cost, low latency, structured response. Pick Bedrock when the task is open-ended, requires reasoning, or needs free-form generation. Often the best architecture combines both.

When is SageMaker the right answer?

When neither task APIs nor Bedrock cover your need: custom architectures, training on proprietary data at scale, custom inference logic, on-device export, or fine-grained latency/cost control. SageMaker also hosts JumpStart foundation models when you need dedicated capacity.

How does Amazon Q relate to Bedrock?

Q is built on Bedrock but exposes a higher-level assistant experience. Q Business connects to your enterprise data (S3, SharePoint, Confluence, Salesforce, ServiceNow). Q Developer is the Copilot-style coding assistant inside IDEs and the AWS console. Use Bedrock directly when you want to build your own assistant.

What's the difference between Bedrock Knowledge Bases and Kendra?

Both do retrieval over enterprise content. Knowledge Bases is purpose-built for RAG with foundation models — chunk, embed, retrieve, and augment a model prompt. Kendra is a full enterprise search product with deep connectors and ranking models — useful when you need a search experience first and RAG second.

How are these services HIPAA / compliance eligible?

Most production AWS AI services (Bedrock, SageMaker, Comprehend Medical, Textract, Rekognition, Transcribe Medical) are HIPAA-eligible under a signed BAA. Always verify the current eligibility list and configure region/data residency appropriately.

How do you keep generative AI costs predictable?

Cache responses where possible, use prompt caching when supported, choose smaller models (Haiku, Mistral 7B) for simple tasks, set max-token caps, monitor token usage via CloudWatch, and consider provisioned throughput when traffic is steady enough to justify the commit.

The AWS AI/ML stack is layered for a reason: start at the highest layer that solves your problem and only drop down when the abstraction stops fitting. Most production systems mix and match — task APIs for primitives, Bedrock for reasoning, SageMaker for the few cases that need full control.