Spheron GPU Catalog

Rent NVIDIA RTX PRO 6000 GPUs on Demand from $0.59/hr

96GB GDDR7 ECC Blackwell, built to run 70B FP8 LLMs on a single GPU.

At a glance

You can rent an NVIDIA RTX PRO 6000 Blackwell on Spheron starting at $0.59/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with spot pricing cheaper still. Per-minute billing, no long-term contracts, and instances deploy in under 2 minutes across data center partners in multiple regions. Each card ships with 96GB GDDR7 ECC, 1.79 TB/s memory bandwidth, 24,064 CUDA cores, and 5th generation Tensor Cores with native FP4 support, giving you the largest single-GPU VRAM available outside HBM datacenter SKUs. Perfect for teams that need to run 30B-70B LLMs at FP8 on a single GPU, fine-tune medium models with LoRA, or handle professional rendering and visualization workloads without stepping up to H100 pricing.

GPU ArchitectureNVIDIA Blackwell
VRAM96 GB GDDR7 ECC
Memory Bandwidth1.79 TB/s

Technical specifications

GPU Architecture
NVIDIA Blackwell
VRAM
96 GB GDDR7 ECC
Memory Bandwidth
1.79 TB/s
Tensor Cores
5th Gen (752 cores)
CUDA Cores
24,064
RT Cores
4th Gen (188 cores)
FP32 Performance
126 TFLOPS
FP16 Tensor (dense)
504 TFLOPS
FP8 Tensor (dense)
1,008 TFLOPS
FP4 Tensor (dense)
2,016 TFLOPS
Form Factor
Workstation (dual-slot PCIe)
Interconnect
PCIe Gen5 x16
NVLink
Not supported
TDP
600W (Max-Q: 300W)

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$0.59/hr-
Vast.ai
$1.00/hr1.7x more expensive
Hyperstack
$1.80/hr3.1x more expensive
RunPod
$1.69/hr2.9x more expensive
CoreWeave
$2.50/hr4.2x more expensive
Custom & Reserved

Need More RTX PRO 6000 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more RTX PRO 6000 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the RTX PRO 6000

Scenario 01

Pick RTX PRO 6000 Blackwell if

You want to run 30B-70B LLMs at FP8 on a single GPU without paying H100 rates. 96GB GDDR7 lets Llama 3.3 70B FP8, Qwen 2.5 32B FP16, and 70B AWQ models fit comfortably with KV cache headroom. Best single-GPU VRAM capacity below the H100/H200 price tier.

Recommended fit
Scenario 02

Pick RTX 5090 instead if

Your models fit in 32GB and you want the cheapest Blackwell hourly rate. RTX 5090 matches PRO 6000 on memory bandwidth (1.79 TB/s) and FP4 support, but lacks ECC and caps out at 32GB. Great for 7B-13B inference, SDXL, and Flux.

Recommended fit
Scenario 03

Pick L40S instead if

You need a datacenter-certified SKU with 48GB ECC and long-term multi-tenant support, and you don't need Blackwell FP4. L40S is purpose-built for inference serving and is widely available across hyperscalers.

Recommended fit
Scenario 04

Pick H100 or B200 instead if

You need HBM bandwidth (3.35-8 TB/s) and NVLink for multi-GPU tensor parallelism on 100B+ models. PCIe PRO 6000 has no NVLink, so scale-out is limited to data parallelism. For trillion-parameter training, go B200.

Recommended fit

Ideal use cases

Use case / 01
🎨

Professional Rendering

Leverage 4th generation RT Cores and Blackwell architecture for real-time ray tracing, CAD/CAM workflows, and digital content creation.

Real-time ray tracing for architectural visualizationCAD/CAM design and engineering workflowsDigital content creation and VFX pipelinesProduct design and photorealistic rendering
Use case / 02
🧠

AI Development & Fine-Tuning

Perfect for fine-tuning 7B-32B models and running 70B FP8 on a single GPU with 96GB of GDDR7 ECC memory.

LoRA and QLoRA fine-tuning of 7B-32B modelsLlama 3.3 70B FP8 and 70B AWQ inferenceQwen 2.5 32B FP16 fine-tuning with headroom for KV cacheTransfer learning and domain adaptation
Use case / 03

AI Inference

Cost-effective inference for 30B-70B models on a single GPU, with FP4 and FP8 Tensor Core acceleration.

Llama 3.3 70B FP8 and 70B AWQ on a single GPUReal-time image generation and diffusion modelsProduction inference APIs with dynamic batchingEdge AI and embedded deployment testing
Use case / 04
🔬

Scientific Visualization

Accelerate medical imaging, molecular visualization, and engineering simulation with professional-grade GPU compute.

Medical imaging and DICOM visualizationMolecular dynamics and protein structure visualizationEngineering simulation and CFD post-processingGeospatial data analysis and 3D mapping

Performance benchmarks

Llama 3.1 8B Inference
~8,990 tokens/s
vLLM, batched aggregate
Llama 3.1 70B Inference
~24,000 tok/s
vLLM FP8, 100 concurrent requests (aggregate)
30B AWQ Throughput
~8,400 tokens/s
matches 4x RTX 4090 (CloudRift)
SDXL 1024x1024
~11 img/min
FP16, base + refiner
Memory Bandwidth
1.79 TB/s
GDDR7, 512-bit bus
vs RTX 6000 Ada
~2x faster
Blackwell FP4 + 2x VRAM

Serve Llama 3.3 70B FP8 on a single RTX PRO 6000

96GB GDDR7 is enough to load Llama 3.3 70B at FP8 (~70GB weights) with room for KV cache at moderate batch sizes. vLLM gives you an OpenAI-compatible endpoint in one command.

bash
Spheron
# SSH into your RTX PRO 6000 instancessh root@<instance-ip> # Install vLLM with CUDA 12.4+ (Blackwell FP8 kernels)pip install vllm>=0.6.3 # Launch Llama 3.3 70B at FP8vllm serve meta-llama/Llama-3.3-70B-Instruct \  --quantization fp8 \  --max-model-len 8192 \  --gpu-memory-utilization 0.92 # Test the endpointcurl http://localhost:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{"model":"meta-llama/Llama-3.3-70B-Instruct","messages":[{"role":"user","content":"Hello"}]}'

For 30B-class models (Qwen 2.5 32B, Mixtral 8x7B), FP16 fits comfortably and lets you serve higher concurrency.

Related resources

FAQ / 11

Frequently asked questions

Also consider