Amazon Comprehend
Amazon Comprehend is a managed natural-language processing (NLP) service that extracts insights from unstructured text. It exposes task-specific APIs — entity recognition, sentiment, key phrases, syntax, language detection, topic modeling, and PII redaction — without requiring ML expertise or model training. Comprehend Medical adds clinical concept extraction and ICD-10/RxNorm coding.
Key Features:
- Pre-Trained APIs: Ready-to-call endpoints for entities, sentiment (positive/negative/neutral/mixed), key phrases, language detection, and syntax parsing.
- Targeted Sentiment: Sentiment scored per entity, not just per document — useful for review analysis (e.g., "battery: negative, screen: positive").
- Custom Classification & Entity Recognition: Train domain-specific classifiers and entity recognizers from labeled examples when the generic models aren't enough.
- Comprehend Medical: Specialized variant that extracts medical entities (conditions, medications, dosages) and ICD-10-CM / RxNorm codes from clinical text while maintaining HIPAA eligibility.
- PII Detection & Redaction: Identifies and redacts personal data (names, SSNs, addresses, credit cards) — useful for compliance pipelines over logs or support tickets.
- Topic Modeling: Asynchronous job over a document corpus in S3 that produces topics and per-document topic distributions.
- Real-Time or Batch: Synchronous API for single documents, asynchronous jobs for large S3 datasets.
- Real-Time Endpoints: Custom models can be deployed as low-latency real-time endpoints with provisioned throughput (inference units).
Common Use Cases:
- Support Ticket Triage: Classify incoming tickets by topic and sentiment to route to the right team.
- Compliance & PII Redaction: Scrub sensitive data from logs, chat transcripts, or documents before analytics or model training.
- Voice of Customer: Aggregate sentiment and key phrases across reviews and survey responses for product teams.
- Healthcare Text Mining: Extract structured clinical concepts from physician notes using Comprehend Medical.
- Content Tagging: Auto-tag articles or internal documents with entities and topics for search and discovery.
Service Limits & Quotas:
- Sync API document size: up to 100 KB UTF-8 per document for single-document calls.
- Batch sync APIs: up to 25 documents per call.
- Async jobs: millions of documents per S3 input prefix; output written to S3.
- Custom classifier training: minimum 10 examples per label (multi-class) or 50 documents (multi-label); tens of thousands recommended for accuracy.
- Languages: generic APIs support 12+ languages; custom models support a subset of Latin-script languages.
- Real-time endpoint inference units: default soft limit 10 IUs per endpoint.
Pricing Model:
- Sync APIs: billed per 100-character "unit" with a 3-unit minimum per request.
- Async jobs: billed per character processed (cheaper per character than sync).
- Custom models: training billed per hour of model training; inference billed per inference unit per hour for real-time endpoints, or per character for async.
- Comprehend Medical: separate per-100-character pricing, generally higher than standard.
- Common cost surprises: the 3-unit minimum makes very short documents expensive; idle real-time endpoints continue to bill for IUs; running async jobs over uncleaned data inflates character counts.
Code Example:
import boto3
comprehend = boto3.client("comprehend", region_name="us-west-2")
text = "Order #A-482 shipped from Seattle on Tuesday and arrived damaged."
print(comprehend.detect_sentiment(Text=text, LanguageCode="en")["Sentiment"])
# NEGATIVE
for ent in comprehend.detect_entities(Text=text, LanguageCode="en")["Entities"]:
print(ent["Type"], "->", ent["Text"])
# COMMERCIAL_ITEM -> Order #A-482
# LOCATION -> Seattle
# DATE -> Tuesday
# Detect and redact PII in one call
pii = comprehend.detect_pii_entities(Text="John Doe lives at 1 Main St, SSN 123-45-6789",
LanguageCode="en")
for e in pii["Entities"]:
print(e["Type"], e["BeginOffset"], e["EndOffset"])
Common Interview Questions:
When should you use Comprehend instead of a Bedrock LLM?
Use Comprehend for well-defined NLP primitives (NER, sentiment, PII redaction, language detection) where you need predictable cost, low latency, and a structured API response. Use Bedrock when the task is open-ended, requires reasoning, or benefits from instruction-following over arbitrary text.
What's the difference between standard sentiment and targeted sentiment?
Standard sentiment scores a whole document (positive/negative/neutral/mixed). Targeted sentiment associates sentiment with each detected entity in the text — essential for review analytics where one product can have positive sentiment for one feature and negative for another.
When do you train a custom classifier or entity recognizer?
When the generic taxonomies don't match your domain — e.g., classifying support tickets into your internal categories, or extracting entities like SKUs, contract clauses, or medical specialties beyond Comprehend Medical's coverage.
How does PII detection differ from PII redaction?
DetectPiiEntities returns offsets and types but doesn't modify text — you redact in your code. The async PII redaction job rewrites documents in S3 with replacements (mask or type label). Useful for compliance scrub before downstream analytics.
Is Comprehend Medical HIPAA eligible?
Yes — Comprehend Medical is in the AWS HIPAA-eligible services list under a signed BAA. It extracts conditions, medications, dosages, anatomy, and codes ICD-10-CM, RxNorm, and SNOMED CT.
How do you keep custom-classifier inference costs low?
Use async batch jobs when latency permits (cheaper per character). For real-time use, right-size inference units to actual peak QPS, scale to zero in dev/test, and combine multiple labels into a single multi-label classifier instead of multiple binary models.
Comprehend complements Bedrock and SageMaker by handling the well-defined NLP primitives that many applications need — reach for it before training a custom model when a task-specific API will do.