LLM Engineering
Production patterns for building LLM-backed systems in 2026 — agents that actually work, structured output that actually parses, retrieval that actually retrieves, and inference stacks that actually scale. The pages below are notes from shipping these patterns in real systems, not vendor pitches.
The cluster covers six concrete areas: agent loops and the Model Context Protocol; reliable function calling and structured output across providers; evaluating RAG without fooling yourself; hybrid search and reranking; self-hosting open-weights models with vLLM; and orchestration frameworks (LangGraph, DSPy) — including when none of them are needed.
Pages in this Cluster
- Agents and MCP — ReAct loops, tool-use, multi-agent orchestration, and the Model Context Protocol (MCP) for decoupling tools from agents.
- Function Calling and Structured Output — getting reliable JSON out of OpenAI, Anthropic, Bedrock, and Gemini using strict schemas, Pydantic, instructor, and constrained decoding.
- RAG Evaluation with RAGAS — faithfulness, answer relevancy, context precision/recall; LLM-as-judge done correctly; offline vs online evaluation in CI.
- Hybrid Search and Reranking — BM25 + dense + Reciprocal Rank Fusion, cross-encoder rerankers (bge, Cohere Rerank v3, ColBERT), and latency tradeoffs.
- vLLM and Quantization — PagedAttention, continuous batching, AWQ/GPTQ/FP8/GGUF, and how to pick a serving stack (vLLM, TGI, SGLang, llama.cpp).
- LangGraph and DSPy — stateful graph orchestration with checkpointing vs declarative LM programs with prompt optimizers, and when neither is needed.
Companion pages on the rest of the site: Amazon Bedrock, RAG, Vector Databases, Hugging Face.