Knowledge Bases for Amazon Bedrock

Knowledge Bases for Amazon Bedrock is a managed retrieval-augmented generation (RAG) service. You point it at a data source (S3, a website, a SaaS connector), pick an embedding model and a vector store, and Bedrock handles ingestion, chunking, embedding, indexing, retrieval, citation tracking, and grounded generation. The result is a single API — retrieve for raw chunks, retrieve_and_generate for a fully-grounded answer — that replaces a meaningful slice of custom RAG plumbing.


1. Architecture Overview

A Knowledge Base is a thin orchestrator over four pieces:

An ingestion job walks the data source, parses, chunks, embeds, and writes to the vector store. After ingestion, queries hit the vector store and (optionally) the FM for generation.


2. Supported Data Sources


2.1 S3 Metadata Sidecars

Attach metadata to a chunk to enable filtered retrieval (e.g. only "year=2026" docs). Drop a JSON file next to each source file:


{
  "metadataAttributes": {
    "year":       { "value": { "type": "NUMBER", "numberValue": 2026 } },
    "department": { "value": { "type": "STRING", "stringValue": "HR" } },
    "tags":       { "value": { "type": "STRING_LIST", "stringListValue": ["policy", "leave"] } }
  }
}
  

Filename convention: if the source is policies/2026-leave.pdf, the sidecar is policies/2026-leave.pdf.metadata.json.


3. Supported Vector Stores

For each store you must pre-create the collection/database and pass field-mapping hints (vector field, text field, metadata field) when creating the KB.


4. Chunking Strategies


5. Embedding Model Choice

Pick the embedding model up front and treat it as immutable — switching models means reindexing the entire corpus. Smaller dimensions (Titan v2 at 256d) cut storage and latency by ~4x with a small recall penalty; worth measuring on your data.


6. Create a Knowledge Base with boto3


import boto3

agent = boto3.client("bedrock-agent", region_name="us-west-2")

kb = agent.create_knowledge_base(
    name="hr-policies",
    description="Internal HR policy documents (US, EMEA, APAC).",
    roleArn="arn:aws:iam::111111111111:role/BedrockKBRole",
    knowledgeBaseConfiguration={
        "type": "VECTOR",
        "vectorKnowledgeBaseConfiguration": {
            "embeddingModelArn": "arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0",
            "embeddingModelConfiguration": {
                "bedrockEmbeddingModelConfiguration": {
                    "dimensions": 1024,
                    "embeddingDataType": "FLOAT32",
                }
            },
        },
    },
    storageConfiguration={
        "type": "OPENSEARCH_SERVERLESS",
        "opensearchServerlessConfiguration": {
            "collectionArn": "arn:aws:aoss:us-west-2:111111111111:collection/abc123",
            "vectorIndexName": "hr-policies-idx",
            "fieldMapping": {
                "vectorField":   "embedding",
                "textField":     "text",
                "metadataField": "metadata",
            },
        },
    },
)

kb_id = kb["knowledgeBase"]["knowledgeBaseId"]
print("KB:", kb_id)

ds = agent.create_data_source(
    knowledgeBaseId=kb_id,
    name="hr-policies-s3",
    dataSourceConfiguration={
        "type": "S3",
        "s3Configuration": {
            "bucketArn":               "arn:aws:s3:::company-hr-docs",
            "inclusionPrefixes":       ["policies/"],
            "bucketOwnerAccountId":    "111111111111",
        },
    },
    vectorIngestionConfiguration={
        "chunkingConfiguration": {
            "chunkingStrategy": "HIERARCHICAL",
            "hierarchicalChunkingConfiguration": {
                "levelConfigurations": [
                    {"maxTokens": 1500},  # parent
                    {"maxTokens": 300},   # child
                ],
                "overlapTokens": 60,
            },
        },
    },
)
print("DS:", ds["dataSource"]["dataSourceId"])
  


7. Run an Ingestion Job

Ingestion jobs are async. Trigger one whenever the data source changes; Bedrock detects added/modified/deleted files and updates only the affected chunks (incremental sync).


import time

job = agent.start_ingestion_job(knowledgeBaseId=kb_id, dataSourceId=ds_id)
job_id = job["ingestionJob"]["ingestionJobId"]

while True:
    status = agent.get_ingestion_job(
        knowledgeBaseId=kb_id, dataSourceId=ds_id, ingestionJobId=job_id,
    )["ingestionJob"]
    state = status["status"]
    print(state, status.get("statistics", {}))
    if state in ("COMPLETE", "FAILED", "STOPPED"):
        break
    time.sleep(10)
  

The statistics block reports documents scanned, indexed, modified, deleted, and failed — log it to CloudWatch as your ingestion SLO.

Trigger ingestion automatically by wiring an S3 EventBridge rule on Object Created events to a Lambda that calls start_ingestion_job.


8. Retrieve and Retrieve-and-Generate

8.1 retrieve — raw chunks only

Use this when you want to do your own prompting, rerank with a different model, or display raw search results.


runtime = boto3.client("bedrock-agent-runtime", region_name="us-west-2")

resp = runtime.retrieve(
    knowledgeBaseId=kb_id,
    retrievalQuery={"text": "How many weeks of parental leave do EMEA employees get?"},
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "numberOfResults": 5,
            "overrideSearchType": "HYBRID",  # SEMANTIC | HYBRID
            "filter": {
                "andAll": [
                    {"equals":      {"key": "department", "value": "HR"}},
                    {"greaterThan": {"key": "year",       "value": 2024}},
                ]
            },
        }
    },
)

for r in resp["retrievalResults"]:
    print(round(r["score"], 3), r["location"], r["content"]["text"][:120])
  

8.2 retrieve_and_generate — grounded answer in one call


resp = runtime.retrieve_and_generate(
    input={"text": "How many weeks of parental leave do EMEA employees get?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": kb_id,
            "modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-opus-4-7",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {"numberOfResults": 8, "overrideSearchType": "HYBRID"}
            },
            "generationConfiguration": {
                "inferenceConfig": {"textInferenceConfig": {
                    "temperature": 0.0, "maxTokens": 600,
                }},
                "promptTemplate": {"textPromptTemplate": (
                    "You are an HR assistant. Answer using ONLY the search results below. "
                    "If the answer is not present, say 'I don't have that policy on file.'\n\n"
                    "$search_results$\n\nQuestion: $query$"
                )},
            },
        },
    },
)
print(resp["output"]["text"])
  

8.3 Multi-turn Sessions

Pass sessionId from one call into the next so the KB carries chat context (it rewrites follow-up questions like "what about APAC?" into standalone queries before retrieving).


session_id = resp["sessionId"]
followup = runtime.retrieve_and_generate(
    input={"text": "What about APAC?"},
    sessionId=session_id,
    retrieveAndGenerateConfiguration=resp_config,  # same as above
)
  


9. Citations and Grounding

Every retrieve_and_generate response includes a citations array that maps spans of the generated text to specific retrieved chunks. Surface these in the UI to let users verify the answer.


text = resp["output"]["text"]

for cite in resp.get("citations", []):
    span = cite["generatedResponsePart"]["textResponsePart"]["span"]
    quoted = text[span["start"]:span["end"] + 1]
    print(f"---\nCLAIM: {quoted}")
    for ref in cite["retrievedReferences"]:
        loc = ref["location"]
        kind = loc["type"]
        if kind == "S3":
            print(f"  source: {loc['s3Location']['uri']}")
        elif kind == "WEB":
            print(f"  source: {loc['webLocation']['url']}")
        print(f"  chunk:  {ref['content']['text'][:160]}...")
  

Citations are also the raw material for hallucination guardrails — wire them into a contextual-grounding guardrail (see Bedrock Guardrails) to block answers that drift from the cited context.


10. Advanced Parsing with FM-as-Parser

Default parsing extracts plain text — fine for prose but loses structure in slide decks, tables, and financial PDFs. Enable advanced parsing to use a foundation model to interpret each page as Markdown, preserving tables, headings, and figure captions.


agent.create_data_source(
    knowledgeBaseId=kb_id,
    name="financial-reports-s3",
    dataSourceConfiguration={"type": "S3", "s3Configuration": {
        "bucketArn": "arn:aws:s3:::company-finance-docs",
    }},
    vectorIngestionConfiguration={
        "parsingConfiguration": {
            "parsingStrategy": "BEDROCK_FOUNDATION_MODEL",
            "bedrockFoundationModelConfiguration": {
                "modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-opus-4-7",
                "parsingPrompt": {"parsingPromptText": (
                    "Convert each page to Markdown. Preserve tables as GitHub-flavored "
                    "Markdown tables. Render figures as '![figure: ]'."
                )},
            },
        },
        "chunkingConfiguration": {
            "chunkingStrategy": "FIXED_SIZE",
            "fixedSizeChunkingConfiguration": {"maxTokens": 500, "overlapPercentage": 15},
        },
    },
)
  

FM parsing costs more (one model call per page) and slows ingestion materially. Reserve it for documents where layout actually carries meaning — annual reports, scientific papers, regulatory filings.


11. When to Use a KB vs Roll Your Own

Knowledge Bases for Bedrock collapse most of the RAG plumbing into a managed service. The trade-off — as always with managed services — is lower flexibility on the retrieval pipeline. Start with the KB; reach for custom RAG only when an evaluation actually fails because of it.


12. Operational Tips


13. Cost Components


Common Interview Questions:

What is a Bedrock Knowledge Base and what does it manage for you?

A Knowledge Base is a managed RAG pipeline: it ingests documents from S3 (or web, Confluence, Salesforce, SharePoint), chunks them, embeds with a model like Titan or Cohere, writes vectors to a configured vector store, and exposes Retrieve and RetrieveAndGenerate APIs. AWS handles the ingestion job, retries, status tracking, and incremental sync — you bring the source bucket and pick the embedding model, chunking strategy, and vector store. It eliminates writing your own LangChain ingestion code.

How do you choose a vector store for a Knowledge Base?

OpenSearch Serverless is the default — fully managed, auto-scales, supports hybrid BM25 + vector, and integrates natively. Aurora PostgreSQL with pgvector is best when you already run Aurora and want SQL joins between vectors and operational data. Pinecone or MongoDB Atlas are options when those are your existing standard. Neptune Analytics fits when retrieval is graph-shaped. For most greenfield workloads, OpenSearch Serverless wins on operational simplicity; for tenant-isolated SaaS, one collection per tenant is usually safer than metadata filtering.

Compare fixed-size, hierarchical, and semantic chunking.

Fixed-size (e.g. 300 tokens with 60-token overlap) is cheapest and the right default for clean prose. Hierarchical chunking embeds both small child chunks (for retrieval precision) and larger parent chunks (returned for context) — better recall on long documents at roughly 2x ingestion cost. Semantic chunking splits at sentence-boundary embedding-similarity drops, preserving topical coherence — most expensive at ingest but best for mixed-topic documents like meeting transcripts or RFCs.

What is advanced parsing and when is it worth the cost?

Advanced parsing routes each page through a foundation model (Claude or Titan multimodal) that reads the page as an image and emits structured Markdown — preserving tables, multi-column layouts, equations, and figure captions that plain text extractors mangle. It costs one FM call per page so it can dwarf the embedding bill on large PDF corpora. Reserve it for layout-heavy documents (financial filings, scientific papers, scanned forms); for clean HTML or plain Markdown, the default parser is fine.

When do you call Retrieve vs. RetrieveAndGenerate?

Use Retrieve when you want raw chunks plus scores and need to compose your own prompt — for example to mix KB results with tool-call output, to apply a custom system prompt, or to use a model not supported by RetrieveAndGenerate. Use RetrieveAndGenerate for the standard "answer this with citations" path; AWS builds the prompt, calls the FM, and returns the answer with source attributions. RetrieveAndGenerate is fewer lines of code and gives you citations for free; Retrieve gives you control.

How do you keep a Knowledge Base fresh without re-embedding everything?

Use incremental sync — the ingestion job tracks document checksums in S3 and re-embeds only changed or new files; deleted files are removed from the vector store. Trigger sync on an EventBridge schedule (hourly/daily) or on S3 event notifications via Lambda for near-real-time updates. For high-velocity sources, partition the bucket by recency so each sync scans a smaller prefix. Monitor statistics.documentsFailed and the ingestion job duration in CloudWatch; failed documents silently degrade recall if unwatched.