Tutorial

Deploy Llama 4 Scout and Maverick on GPU Cloud: Complete Guide with vLLM

Back to BlogWritten by SpheronFeb 20, 2026
Llama 4Llama 4 ScoutLlama 4 MaverickvLLMGPU CloudLLM DeploymentOpen Source AIMeta AI
Deploy Llama 4 Scout and Maverick on GPU Cloud: Complete Guide with vLLM

Meta's Llama 4 lineup changes what's possible with open-source AI. Scout packs 109B parameters into a Mixture-of-Experts architecture with a 10 million token context window. Maverick scales to 402B parameters and matches proprietary models like GPT-4.5 and Gemini 2.5 Pro on reasoning benchmarks.

Both models are free to use under Meta's license. The only barrier is compute — you need the right GPU hardware and the right serving stack to run them efficiently.

This guide walks through deploying Llama 4 Scout and Maverick on cloud GPUs using vLLM. We cover exact VRAM requirements, recommended GPU configurations, step-by-step deployment, and practical tips for getting the best throughput per dollar.

Llama 4 Model Specs

Before choosing hardware, understand what you're deploying.

SpecScout (109B)Maverick (402B)
Total Parameters109B402B
Active Parameters (per token)17B17B
Expert Count16 experts128 experts
Context Window10M tokens1M tokens
ArchitectureMoEMoE
FP16 Model Size~218 GB~804 GB
INT4 Quantized Size~55 GB~200 GB
1.78-bit Quantized~34 GB~122 GB

Both models use Mixture-of-Experts — meaning they have far more total parameters than active parameters per token. Scout activates 17B of its 109B parameters on each forward pass, and Maverick does the same 17B from its 402B total. This is why inference is much faster than you'd expect from the total parameter count.

The critical difference is memory: even though both activate 17B parameters per inference step, the full model weights must be loaded into GPU memory. That's where your GPU selection matters.

GPU Requirements

Llama 4 Scout (109B)

Minimum viable setup — 1x H100 80GB:

With 4-bit quantization (AWQ or GPTQ), Scout fits on a single H100 80GB. The quantized model uses approximately 55 GB of VRAM, leaving headroom for KV cache. This works for moderate context lengths (up to ~32K tokens) and low batch sizes.

Recommended setup — 2x H100 80GB or 1x H200 141GB:

For production inference with longer contexts (64K-128K tokens), you'll want more VRAM for the KV cache. Two H100s with tensor parallelism, or a single H200, gives you the headroom to handle concurrent requests without running out of memory.

Full precision — 4x H100 80GB:

If you need FP16 inference (no quantization), Scout requires roughly 218 GB of VRAM. Four H100s provide 320 GB total, leaving ample room for KV cache and batching.

Budget option — 1x RTX 4090 24GB:

Using aggressive 1.78-bit quantization (via Unsloth), Scout fits on a single 24 GB consumer GPU and runs at approximately 20 tokens/sec. Viable for development and experimentation, not production.

Llama 4 Maverick (402B)

Minimum viable setup — 4x H100 80GB:

With INT4 quantization, Maverick requires approximately 200 GB of VRAM. Four H100s (320 GB total) handle this with room for moderate KV cache.

Recommended setup — 8x H100 80GB or 8x H200 141GB:

For production serving with 100K+ context windows, 8 GPUs provide the memory and bandwidth needed. This matches the configuration used in published benchmarks showing ~126 tokens/sec throughput.

Aggressive quantization — 2x RTX 4090 48GB total:

Using 1.78-bit quantization, Maverick compresses to ~122 GB and can run on 2x RTX 4090s at approximately 40 tokens/sec. Quality degrades somewhat at this compression level, but it's functional for testing.

Deploying with vLLM: Step by Step

vLLM is the recommended serving engine for Llama 4 in production. It provides continuous batching, PagedAttention for efficient memory management, and an OpenAI-compatible API out of the box.

Step 1: Set Up Your GPU Server

Provision a GPU server with the appropriate hardware. On Spheron AI, you can spin up baremetal or VM GPU servers with H100, H200, or B300 GPUs — available as both Spot and Dedicated instances.

Once your server is running, SSH in and verify GPU access:

bash
nvidia-smi

You should see your GPUs listed with CUDA 12.x drivers.

Step 2: Install vLLM

bash
pip install vllm --upgrade

For Llama 4 specifically, ensure you're on vLLM 0.8+ which includes native Llama 4 MoE support.

Step 3: Serve Scout

With quantization (recommended for cost efficiency):

bash
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --quantization awq \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --port 8000

Full precision on 4 GPUs:

bash
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --port 8000

Step 4: Serve Maverick

With quantization on 4 GPUs:

bash
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
    --quantization awq \
    --tensor-parallel-size 4 \
    --max-model-len 65536 \
    --port 8000

Full precision on 8 GPUs:

bash
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 131072 \
    --port 8000

Step 5: Send Requests

vLLM exposes an OpenAI-compatible API. Test it with curl:

bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
        "messages": [{"role": "user", "content": "Explain mixture of experts in 3 sentences."}],
        "max_tokens": 256
    }'

Or use the OpenAI Python SDK:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
    max_tokens=512
)
print(response.choices[0].message.content)

GPU Configuration and Cost Comparison

Here's what each configuration costs on cloud GPUs at current market rates:

ConfigurationModelGPUsEst. VRAM UsedMonthly Cost (24/7)
Scout INT4Scout 109B1x H100~55 GB~$1,440
Scout FP16Scout 109B4x H100~218 GB~$5,760
Scout INT4Scout 109B1x H200~55 GB~$2,520
Maverick INT4Maverick 402B4x H100~200 GB~$5,760
Maverick FP16Maverick 402B8x H100~804 GB~$11,520
Maverick FP16Maverick 402B8x H200~804 GB~$20,160

*Costs based on H100 at ~$2.00/hr and H200 at ~$3.50/hr market rates. Spot instances reduce these by 30-50%.*

The sweet spot for most teams: Scout with INT4 quantization on a single H100. You get a 10M-context model that matches Claude-class performance on many tasks, for roughly $1,440/month — less than most API budgets.

Performance Tuning Tips

Use FP8 KV cache. vLLM supports FP8 KV cache quantization, which halves the memory used by the attention cache without meaningful quality loss. This lets you serve longer contexts or handle more concurrent requests with the same hardware.

bash
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --quantization awq \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --port 8000

Right-size your context window. Don't set --max-model-len to 10M just because Scout supports it. Each token in the context window consumes KV cache memory. If your application uses 32K contexts, set the limit to 32768 and save the VRAM for batching more concurrent requests.

Use Spot instances for development. Iterating on prompts, testing quantization levels, and benchmarking throughput doesn't require dedicated instances. Use Spot GPUs at reduced rates for all non-production work. On Spheron AI, B300 Spot instances start at $2.90/hr — giving you Blackwell-class hardware for experimentation at a fraction of the dedicated cost.

Scout vs Maverick: Which Should You Deploy?

Deploy Scout if: you need the longest possible context (10M tokens), your workload is document processing or RAG over large corpora, you want the best cost efficiency (single GPU deployment), or you're running a moderate-traffic API.

Deploy Maverick if: you need the highest reasoning and coding performance, your workload involves complex multi-step tasks, you're comparing against GPT-4.5 or Gemini 2.5 Pro-class models, or you have the GPU budget for 4-8 GPU deployment.

For most teams starting out, Scout is the better choice. It's dramatically cheaper to deploy, offers a context window 10x larger than Maverick, and its quality is within striking distance of Maverick on most benchmarks. Start with Scout, validate your use case, and upgrade to Maverick only if you need the extra reasoning capability.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.


GlobeGLOBAL COMPUTE, BROUGHT TO YOU BY