SageMaker JumpStart is the model hub built into SageMaker. It packages hundreds of pre-trained foundation and task-specific models — Llama 3.x, Mistral, Falcon, Stable Diffusion, BGE embeddings, plus traditional CV/NLP models — together with deployable inference containers and ready-to-run fine-tuning recipes. JumpStart is the fastest route from "I want to host an open-weights model on AWS" to "I have an HTTPS endpoint in my VPC." It's also where you go when Bedrock doesn't host the model you need.
JumpStart wraps three things into one feature:
Under the hood every JumpStart model is just a regular SageMaker model: you can swap the container, override the entry-point script, attach a private VPC, encrypt with KMS, and use the resulting endpoint exactly like any other SageMaker endpoint.
Representative families currently available (the catalog updates frequently — check the SageMaker Studio UI for the live list):
diffusers-based handler.Each entry exposes a stable model_id (e.g. meta-textgeneration-llama-3-1-8b-instruct) plus a model_version. Pin both in production code; "*" for version is fine for prototyping but will eventually drift.
The high-level JumpStartModel class encapsulates container selection, environment variables, and endpoint config. The minimum deploy is three lines:
from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(
model_id="meta-textgeneration-llama-3-1-8b-instruct",
model_version="2.*",
)
predictor = model.deploy(
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
accept_eula=True, # required for gated models like Llama
endpoint_name="llama3-1-8b-instruct",
)
resp = predictor.predict({
"inputs": [[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarize CAP theorem in two sentences."},
]],
"parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9},
})
print(resp[0]["generated_text"])
Override the deploy call to put the endpoint in a private subnet and encrypt the model artifacts with a customer-managed KMS key:
predictor = model.deploy(
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
accept_eula=True,
endpoint_name="llama3-1-8b-instruct-vpc",
vpc_config={
"Subnets": ["subnet-aaaa1111", "subnet-bbbb2222"],
"SecurityGroupIds": ["sg-cccc3333"],
},
kms_key="arn:aws:kms:us-west-2:111111111111:key/abcd-1234",
enable_network_isolation=True, # container has no internet egress at all
)
enable_network_isolation=True is the right default for any model that doesn't need to phone home — once the container is built, no outbound traffic leaves it.
For long generations (Stable Diffusion video, 30k-token summaries), an Async endpoint streams the response to S3 and returns immediately. For bursty low-volume usage on smaller models, a Serverless endpoint cold-starts on demand and bills per millisecond.
from sagemaker.async_inference import AsyncInferenceConfig
predictor = model.deploy(
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
accept_eula=True,
async_inference_config=AsyncInferenceConfig(
output_path="s3://my-async-out/llama-results/",
max_concurrent_invocations_per_instance=4,
),
)
The right instance is mostly a function of model size, target throughput, and budget. JumpStart populates a default; override it deliberately.
g5.2xlarge for 7B FP16, g5.12xlarge (4x A10G) for 70B with quantization or tensor parallelism.g6e.12xlarge (4x L40S) is a sweet spot for 70B inference at lower cost than g5.48xlarge.Rule of thumb for inference sizing: a model in FP16 needs ~2 bytes per parameter (so 7B ~ 14 GB, 70B ~ 140 GB), plus 20–40% headroom for the KV cache at your target context length. Quantization (INT8, AWQ, GPTQ) cuts that by 2–4x at a measurable but usually acceptable quality cost.
JumpStart ships fine-tuning recipes for most LLMs in the catalog. QLoRA — 4-bit quantized base weights plus trainable LoRA adapters — is the default and is what you want for cost-effective domain adaptation on a single g5/g6 instance.
from sagemaker.jumpstart.estimator import JumpStartEstimator
estimator = JumpStartEstimator(
model_id="meta-textgeneration-llama-3-1-8b-instruct",
model_version="2.*",
instance_type="ml.g5.12xlarge",
instance_count=1,
environment={"accept_eula": "true"},
hyperparameters={
"epoch": "3",
"learning_rate": "1e-4",
"per_device_train_batch_size": "2",
"gradient_accumulation_steps": "4",
"max_input_length": "2048",
# PEFT / QLoRA knobs
"instruction_tuned": "True",
"int8_quantization": "False",
"enable_fsdp": "False",
"peft_type": "lora",
"lora_r": "16",
"lora_alpha": "32",
"lora_dropout": "0.05",
"load_in_4bit": "True",
},
)
estimator.fit({"training": "s3://my-bucket/jumpstart-train/instruction-data/"})
# Deploy the fine-tuned model
predictor = estimator.deploy(
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
accept_eula=True,
)
Most JumpStart instruction-tuning recipes expect a train.jsonl in S3, one example per line, with an instruction-style schema:
{"instruction": "Summarize the following ticket.", "context": "Customer reports that...", "response": "Customer reports a billing error..."}
{"instruction": "Classify the sentiment.", "context": "I love the new release.", "response": "positive"}
The recipe wraps each row in the model's expected chat template (Llama 3 instruct, Mistral instruct, etc.) before tokenization. A template.json sidecar in the same S3 prefix can override the wrapping.
For 70B fine-tuning on a single node, set enable_fsdp=True and pick a multi-GPU instance (g5.48xlarge, p4d.24xlarge, p5.48xlarge). For multi-node, set instance_count>1 — the recipe uses NCCL over EFA on supported instances.
| Dimension | JumpStart Endpoint | Bedrock Provisioned Throughput |
|---|---|---|
| Pricing unit | Per instance-hour, regardless of utilization | Per Model Unit (MU) per hour, with 1- or 6-month commit options |
| Model selection | Anything in the JumpStart catalog or your own container | Only models offered on Bedrock (Claude, Llama, Mistral, Cohere, Titan, Nova) |
| Custom weights | Yes — fine-tuned, distilled, or fully custom | Only Bedrock-fine-tuned variants; you can't bring arbitrary weights |
| Networking | VPC-attached, network-isolated, KMS — full control | VPC endpoint (PrivateLink) into Bedrock; model itself runs in AWS-managed VPC |
| Scaling | SageMaker autoscaling on instance count | Scale by purchasing additional Model Units |
| Ops surface | SageMaker endpoint config, IAM, CW metrics, traffic shifting | Bedrock APIs only — no instance to size or patch |
| Cold start | None (instances always on) | None (provisioned units always warm) |
Both options give you predictable throughput; the choice is mostly about which model and how much control. JumpStart hands you the keys to the instance; Bedrock PT keeps the kitchen closed.
The flip side: choose Bedrock when you want a serverless, pay-per-token API with no instances to manage, when you need Knowledge Bases / Agents / Guardrails out of the box, or when the model you want is on Bedrock and you don't need to fine-tune beyond what Bedrock supports. A common pattern is Bedrock for the chat surface (Claude on-demand) and JumpStart endpoints for the supporting models (a BGE embedder, a reranker, a vision model) that Bedrock doesn't host.
Before pinning a JumpStart model into a production endpoint, run it through SageMaker Clarify FM evaluation or your own held-out test set. The goal is to baseline accuracy, latency, and cost on your actual workload — not on the vendor's marketing benchmarks.
import json, time
# Simple latency + token-throughput probe
prompts = [open(f"prompts/{p}").read() for p in os.listdir("prompts")]
start = time.time()
total_out_tokens = 0
for p in prompts:
resp = predictor.predict({
"inputs": [[{"role": "user", "content": p}]],
"parameters": {"max_new_tokens": 256, "temperature": 0.0},
})
total_out_tokens += resp[0].get("usage", {}).get("completion_tokens", 0)
elapsed = time.time() - start
print(f"throughput: {total_out_tokens/elapsed:.1f} tok/s, "
f"per-request avg: {elapsed/len(prompts)*1000:.0f} ms")
Pair this with a quality eval — golden-answer comparison, LLM-as-judge against Claude on Bedrock, or domain-specific metrics — and you have the three numbers that matter: quality, latency, and dollars per million tokens. Re-run on every new JumpStart version before promoting.
model_version. JumpStart catalog entries get new versions whenever the upstream weights or container update. Production callers should pin to a specific version ("2.0.4"), not "*".SageMakerVariantInvocationsPerInstance. The default static instance count over-provisions during off hours; autoscaling typically cuts the bill 30–50%.accept_eula=True. Forgetting this in CI/CD is a common deploy failure — surface a clear error early.JumpStart is SageMaker's catalog of pre-trained models — open-source LLMs (Llama, Mistral, Falcon), embeddings (BGE, E5), vision models (Stable Diffusion XL, SAM), and classical ML — that you can deploy to a SageMaker endpoint or fine-tune with one SDK call. It bundles the weights, an optimized inference container (DJL, TGI, or LMI), example notebooks, and recommended instance types. Think of it as a managed model zoo wired into the SageMaker control plane.
Choose JumpStart when you need a model Bedrock doesn't host (Falcon, BGE embeddings, Stable Diffusion XL, smaller Llama variants), when you need full fine-tuning beyond Bedrock's recipes, when you want the model running inside your own VPC on instances you manage, or when you need to swap in a custom serving container. Choose Bedrock when the model is already in its catalog and you don't want to manage GPUs, autoscaling, or container patching. JumpStart is more control and more ops; Bedrock is less of both.
Most JumpStart LLMs support instruction fine-tuning and parameter-efficient fine-tuning via LoRA/QLoRA out of the box — you point the SDK at a training dataset in S3 (JSONL with instruction and response fields), pick the base model, and JumpStart launches a managed training job with sane defaults for learning rate, epochs, and LoRA rank. Domain adaptation (continued pretraining on raw text) is supported for some base models. Outputs land in S3 as a model artifact you then deploy to an endpoint.
Match VRAM to model size at the chosen quantization: 7-8B in fp16 fits on ml.g5.2xlarge (24 GB A10G); 13B in fp16 needs ml.g5.12xlarge (4x A10G) or quantize to int4 on a single g5; 70B in fp16 needs ml.p4d.24xlarge (8x A100) or AWQ/int4 on ml.g5.48xlarge. For embeddings and small classifiers, ml.g5.xlarge is plenty. For cost-sensitive serving of supported architectures, use Inferentia2 (ml.inf2) — JumpStart publishes pre-compiled Neuron variants. Always validate with SageMaker Inference Recommender before committing.
Pin model_version so deploys are reproducible and not silently re-pulling new weights. Enable autoscaling on SageMakerVariantInvocationsPerInstance with a CloudWatch target — the default static instance count over-provisions off hours and typically wastes 30–50%. For batchy or async workloads, use Async Inference (queues to S3, scales to zero) instead of Real-Time. Compile to Inferentia where the model supports it. Monitor token-per-dollar against an equivalent Bedrock on-demand call — if Bedrock is cheaper at your traffic volume, route there.
JumpStart wins when (a) the model is open-source and not in Bedrock, (b) you want LoRA fine-tuning with managed defaults rather than authoring a full training script, and (c) you need VPC-only deployment with KMS-encrypted endpoints under your own IAM. It beats a hand-rolled training pipeline because the container, distributed-training config, checkpointing, and serving stack are pre-built. It loses to a custom pipeline when you need exotic training tricks (custom optimizers, model parallelism beyond what's pre-wired, novel architectures) or when you need a non-standard inference framework. The sweet spot is fine-tuning and serving popular open models without operating the platform.