Meta's Llama 4 lineup changes what's possible with open-source AI. Scout packs 109B parameters into a Mixture-of-Experts architecture with a 10 million token context window. Maverick scales to 402B parameters and matches proprietary models like GPT-4.5 and Gemini 2.5 Pro on reasoning benchmarks.
Both models are free to use under Meta's license. The only barrier is compute — you need the right GPU hardware and the right serving stack to run them efficiently.
This guide walks through deploying Llama 4 Scout and Maverick on cloud GPUs using vLLM. We cover exact VRAM requirements, recommended GPU configurations, step-by-step deployment, and practical tips for getting the best throughput per dollar.
Llama 4 Model Specs
Before choosing hardware, understand what you're deploying.
| Spec | Scout (109B) | Maverick (402B) |
|---|---|---|
| Total Parameters | 109B | 402B |
| Active Parameters (per token) | 17B | 17B |
| Expert Count | 16 experts | 128 experts |
| Context Window | 10M tokens | 1M tokens |
| Architecture | MoE | MoE |
| FP16 Model Size | ~218 GB | ~804 GB |
| INT4 Quantized Size | ~55 GB | ~200 GB |
| 1.78-bit Quantized | ~34 GB | ~122 GB |
Both models use Mixture-of-Experts — meaning they have far more total parameters than active parameters per token. Scout activates 17B of its 109B parameters on each forward pass, and Maverick does the same 17B from its 402B total. This is why inference is much faster than you'd expect from the total parameter count.
The critical difference is memory: even though both activate 17B parameters per inference step, the full model weights must be loaded into GPU memory. That's where your GPU selection matters.
GPU Requirements
Llama 4 Scout (109B)
Minimum viable setup — 1x H100 80GB:
With 4-bit quantization (AWQ or GPTQ), Scout fits on a single H100 80GB. The quantized model uses approximately 55 GB of VRAM, leaving headroom for KV cache. This works for moderate context lengths (up to ~32K tokens) and low batch sizes.
Recommended setup — 2x H100 80GB or 1x H200 141GB:
For production inference with longer contexts (64K-128K tokens), you'll want more VRAM for the KV cache. Two H100s with tensor parallelism, or a single H200, gives you the headroom to handle concurrent requests without running out of memory.
Full precision — 4x H100 80GB:
If you need FP16 inference (no quantization), Scout requires roughly 218 GB of VRAM. Four H100s provide 320 GB total, leaving ample room for KV cache and batching.
Budget option — 1x RTX 4090 24GB:
Using aggressive 1.78-bit quantization (via Unsloth), Scout fits on a single 24 GB consumer GPU and runs at approximately 20 tokens/sec. Viable for development and experimentation, not production.
Llama 4 Maverick (402B)
Minimum viable setup — 4x H100 80GB:
With INT4 quantization, Maverick requires approximately 200 GB of VRAM. Four H100s (320 GB total) handle this with room for moderate KV cache.
Recommended setup — 8x H100 80GB or 8x H200 141GB:
For production serving with 100K+ context windows, 8 GPUs provide the memory and bandwidth needed. This matches the configuration used in published benchmarks showing ~126 tokens/sec throughput.
Aggressive quantization — 2x RTX 4090 48GB total:
Using 1.78-bit quantization, Maverick compresses to ~122 GB and can run on 2x RTX 4090s at approximately 40 tokens/sec. Quality degrades somewhat at this compression level, but it's functional for testing.
Deploying with vLLM: Step by Step
vLLM is the recommended serving engine for Llama 4 in production. It provides continuous batching, PagedAttention for efficient memory management, and an OpenAI-compatible API out of the box.
Step 1: Set Up Your GPU Server
Provision a GPU server with the appropriate hardware. On Spheron AI, you can spin up baremetal or VM GPU servers with H100, H200, or B300 GPUs — available as both Spot and Dedicated instances.
Once your server is running, SSH in and verify GPU access:
nvidia-smiYou should see your GPUs listed with CUDA 12.x drivers.
Step 2: Install vLLM
pip install vllm --upgradeFor Llama 4 specifically, ensure you're on vLLM 0.8+ which includes native Llama 4 MoE support.
Step 3: Serve Scout
With quantization (recommended for cost efficiency):
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--port 8000Full precision on 4 GPUs:
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 4 \
--max-model-len 131072 \
--port 8000Step 4: Serve Maverick
With quantization on 4 GPUs:
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--quantization awq \
--tensor-parallel-size 4 \
--max-model-len 65536 \
--port 8000Full precision on 8 GPUs:
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--port 8000Step 5: Send Requests
vLLM exposes an OpenAI-compatible API. Test it with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [{"role": "user", "content": "Explain mixture of experts in 3 sentences."}],
"max_tokens": 256
}'Or use the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
max_tokens=512
)
print(response.choices[0].message.content)GPU Configuration and Cost Comparison
Here's what each configuration costs on cloud GPUs at current market rates:
| Configuration | Model | GPUs | Est. VRAM Used | Monthly Cost (24/7) |
|---|---|---|---|---|
| Scout INT4 | Scout 109B | 1x H100 | ~55 GB | ~$1,440 |
| Scout FP16 | Scout 109B | 4x H100 | ~218 GB | ~$5,760 |
| Scout INT4 | Scout 109B | 1x H200 | ~55 GB | ~$2,520 |
| Maverick INT4 | Maverick 402B | 4x H100 | ~200 GB | ~$5,760 |
| Maverick FP16 | Maverick 402B | 8x H100 | ~804 GB | ~$11,520 |
| Maverick FP16 | Maverick 402B | 8x H200 | ~804 GB | ~$20,160 |
*Costs based on H100 at ~$2.00/hr and H200 at ~$3.50/hr market rates. Spot instances reduce these by 30-50%.*
The sweet spot for most teams: Scout with INT4 quantization on a single H100. You get a 10M-context model that matches Claude-class performance on many tasks, for roughly $1,440/month — less than most API budgets.
Performance Tuning Tips
Use FP8 KV cache. vLLM supports FP8 KV cache quantization, which halves the memory used by the attention cache without meaningful quality loss. This lets you serve longer contexts or handle more concurrent requests with the same hardware.
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--quantization awq \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--port 8000Right-size your context window. Don't set --max-model-len to 10M just because Scout supports it. Each token in the context window consumes KV cache memory. If your application uses 32K contexts, set the limit to 32768 and save the VRAM for batching more concurrent requests.
Use Spot instances for development. Iterating on prompts, testing quantization levels, and benchmarking throughput doesn't require dedicated instances. Use Spot GPUs at reduced rates for all non-production work. On Spheron AI, B300 Spot instances start at $2.90/hr — giving you Blackwell-class hardware for experimentation at a fraction of the dedicated cost.
Scout vs Maverick: Which Should You Deploy?
Deploy Scout if: you need the longest possible context (10M tokens), your workload is document processing or RAG over large corpora, you want the best cost efficiency (single GPU deployment), or you're running a moderate-traffic API.
Deploy Maverick if: you need the highest reasoning and coding performance, your workload involves complex multi-step tasks, you're comparing against GPT-4.5 or Gemini 2.5 Pro-class models, or you have the GPU budget for 4-8 GPU deployment.
For most teams starting out, Scout is the better choice. It's dramatically cheaper to deploy, offers a context window 10x larger than Maverick, and its quality is within striking distance of Maverick on most benchmarks. Start with Scout, validate your use case, and upgrade to Maverick only if you need the extra reasoning capability.