DeepSeek V3.2 Speciale is the most capable open-source reasoning model available today. It surpasses GPT-5 on math benchmarks, matches Gemini 3.0 Pro on reasoning tasks, and achieved gold-medal level performance in the 2025 International Mathematical Olympiad. The model features DeepSeek Sparse Attention (DSA), a mechanism that dramatically reduces compute for long-context inputs.
The catch is hardware requirements. This is a 671B parameter Mixture-of-Experts model. You can't run it on a single GPU, and you need to get the configuration right to achieve reasonable throughput. Most deployment guides either skip the practical details or assume you have unlimited budget.
This guide gives you the exact hardware specs, vLLM configuration, and deployment steps — from the minimum setup that actually works to the production configuration that handles real traffic.
Model Specifications
| Spec | DeepSeek V3.2 Speciale |
|---|---|
| Total Parameters | 671B |
| Active Parameters | 37B per token |
| Architecture | Mixture of Experts (MoE) |
| Context Window | 128K tokens |
| FP16 Model Size | ~1.2 TB |
| FP8 Model Size | ~600 GB |
| Key Feature | DeepSeek Sparse Attention (DSA) |
| Supported Hardware | Hopper (H100/H200) and Blackwell (B200/B300) only |
That last line is important: only Hopper and Blackwell datacenter GPUs are supported. Consumer GPUs (RTX 4090, RTX 5090) and older datacenter GPUs (A100) won't work with the official model weights. The model relies on FP8 tensor core operations that require Hopper-class hardware or newer.
GPU Hardware Requirements
Minimum Viable: 8x H100 80GB (640 GB total)
This is the floor for running V3.2 Speciale. With FP8 quantization, the model fits across 8x H100s using tensor parallelism. You'll have limited headroom for KV cache, which means shorter effective context lengths (32K-64K) and smaller batch sizes.
Configuration: tensor-parallel-size 8, FP8 precision, max context ~64K tokens.
Recommended: 8x H200 141GB (1.13 TB total)
The H200's 141 GB per GPU gives you 1.13 TB total — enough for the FP8 model weights plus generous KV cache for full 128K context. This is what you want for production inference with long-context workloads.
Configuration: tensor-parallel-size 8, FP8 precision, max context 128K tokens.
Optimal: 8x B300 288GB (2.3 TB total)
On Blackwell hardware, V3.2 Speciale runs with maximum headroom. The B300's 288 GB per GPU gives you 2.3 TB total — enough for FP16 weights with room to spare, or FP8 weights with massive KV cache for high-concurrency serving.
Configuration: tensor-parallel-size 8, FP8 or FP16 precision, max context 128K tokens with high batch sizes.
What Won't Work
- 4x H100 80GB (320 GB): Not enough VRAM for the full model even in FP8. The model weights alone require ~600 GB.
- Any number of A100s: Hopper-class hardware is required. The model's FP8 operations are incompatible with A100 tensor cores.
- Consumer GPUs: RTX 4090/5090 lack the necessary FP8 support and memory capacity.
Step-by-Step Deployment with vLLM
Prerequisites
Provision a GPU server with 8x H100, H200, or B300 GPUs. On Spheron AI, you can get H100 servers as Spot instances starting at $1.49/hr per GPU, H200 Dedicated instances, or B300 GPUs starting at $2.90/hr Spot.
SSH into your server and verify the GPU setup:
nvidia-smi
# Verify 8 GPUs visible, CUDA 12.x, driver 535+Install Dependencies
pip install vllm --upgrade
# DeepGEMM is required for MoE computation
pip install git+https://github.com/deepseek-ai/DeepGEMMDeepGEMM provides optimized kernels for the MoE layers. Without it, inference falls back to slower generic implementations.
Download Model Weights
The model weights are approximately 600 GB (FP8). Download from Hugging Face:
huggingface-cli download deepseek-ai/DeepSeek-V3.2-Speciale \
--local-dir /data/models/deepseek-v3.2-specialeThis takes 30-60 minutes depending on your network bandwidth. Use a persistent storage volume so you don't re-download on instance restarts.
Launch vLLM Server
Standard deployment on 8x H100:
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--max-model-len 65536 \
--port 8000Production deployment with FP8 KV cache:
vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--port 8000Using FP8 KV cache halves the memory consumed by the attention cache, letting you either serve longer contexts or handle more concurrent requests.
Optimized deployment with Expert Parallelism:
For best throughput, DeepSeek recommends running with Expert Parallelism (EP) rather than pure Tensor Parallelism:
# EP mode: better throughput for MoE models
VLLM_USE_DEEP_GEMM=1 vllm serve deepseek-ai/DeepSeek-V3.2-Speciale \
--tensor-parallel-size 1 \
--pipeline-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--max-model-len 131072 \
--port 8000Note: if you encounter stability issues with EP mode, fall back to tensor parallelism (TP=8), which is more robust but slightly slower.
Test the API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2-Speciale",
"messages": [
{"role": "user", "content": "Prove that the square root of 2 is irrational."}
],
"max_tokens": 1024,
"temperature": 0.6
}'Cost Analysis
Running V3.2 Speciale full-time isn't cheap, but it's dramatically cheaper than API pricing for equivalent capability.
| Configuration | GPUs | Hourly Cost | Monthly (24/7) | Cost per 1M tokens |
|---|---|---|---|---|
| 8x H100 (Spot) | 8x H100 | ~$12.00/hr | ~$8,640 | ~$0.30 |
| 8x H100 (Dedicated) | 8x H100 | ~$16.00/hr | ~$11,520 | ~$0.40 |
| 8x H200 (Dedicated) | 8x H200 | ~$28.00/hr | ~$20,160 | ~$0.25 |
| 8x B300 (Spot) | 8x B300 | ~$23.20/hr | ~$16,704 | ~$0.10 |
Compare this to API pricing for equivalent-capability models: GPT-5 at $3-15 per million tokens, Claude Opus at $15-75 per million tokens. If you're making more than ~100M tokens of requests per month, self-hosting V3.2 Speciale on H100 Spot instances is cheaper than any frontier API.
Performance Tuning
Context length vs throughput tradeoff. Every doubling of max context length halves your maximum concurrent batch size (roughly). If your application doesn't need 128K context, set --max-model-len lower and use the freed VRAM for higher concurrency.
DeepGEMM tuning. Some users report better performance with VLLM_USE_DEEP_GEMM=0 on certain GPU configurations (particularly H20s). Benchmark with and without to find what works for your hardware.
Temperature for reasoning. DeepSeek recommends temperature 0.6 for reasoning tasks. Lower temperatures (0.1-0.3) produce more deterministic but sometimes less thorough reasoning chains.
Persistent storage matters. The model is ~600 GB. Re-downloading on every instance restart wastes time and bandwidth. Always use persistent network-attached storage, and pre-download the weights before switching to Spot instances to avoid interruption during the download.
When to Use V3.2 Speciale vs Other Models
Use V3.2 Speciale when: your workload is math-heavy reasoning, complex code generation, scientific analysis, or any task where chain-of-thought reasoning quality matters more than speed. It's the best open-source model for tasks that require multi-step logical reasoning.
Use Llama 4 Scout instead when: you need ultra-long context (10M tokens vs 128K), lower deployment cost (single GPU vs 8 GPUs), or faster inference speed on simpler tasks.
Use Llama 4 Maverick instead when: you need strong general-purpose performance with better cost efficiency than V3.2 Speciale (Maverick uses 402B total but only 17B active, vs Speciale's 671B total / 37B active).
The open-source model landscape in 2026 gives you real choices. V3.2 Speciale is the reasoning champion. Llama 4 Scout is the context-length and efficiency champion. Pick based on your workload, not hype.