Spheron GPU Catalog

Rent NVIDIA RTX 5090 GPUs on Demand from $0.86/hr

32GB GDDR7 Blackwell, deployed in under 2 minutes.

At a glance

You can rent an NVIDIA RTX 5090 on Spheron starting at $0.86/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with spot instances cheaper still. Per-minute billing, no contracts, deployed in under 2 minutes across data center partners in multiple regions. The RTX 5090 packs 32GB of GDDR7 memory and 5th gen Tensor Cores, making it the best price-to-performance choice for LoRA/QLoRA fine-tuning of 7B-13B models, Stable Diffusion XL inference, local LLM serving with Ollama or vLLM, and general AI development work. Launch a container with your CUDA/PyTorch image, SSH in, and start training in minutes.

GPU ArchitectureNVIDIA Blackwell
VRAM32 GB GDDR7
Memory Bandwidth1.79 TB/s

Technical specifications

GPU Architecture
NVIDIA Blackwell
VRAM
32 GB GDDR7
Memory Bandwidth
1.79 TB/s
Tensor Cores
5th Generation
CUDA Cores
21,760
RT Cores
4th Generation
FP32 Performance
104.8 TFLOPS
FP16 Tensor (dense)
209.5 TFLOPS
FP8 Tensor (dense)
419 TFLOPS
INT8 Tensor (dense)
838 TOPS
FP4 Tensor (sparse)
3,352 TOPS
System RAM
24 GB DDR5
vCPUs
8 vCPUs
Storage
200 GB NVMe SSD
Network
PCIe Gen5
TDP
575W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$0.86/hr-
CloudRift
$0.65/hr-
NeevCloud
$0.69/hr-
RunPod (Community)
$0.69/hr-
RunPod (Secure)
$0.99/hr1.2x more expensive
Custom & Reserved

Need More RTX 5090 Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more RTX 5090 capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the RTX 5090

Scenario 01

Pick RTX 5090 if

Your workload is LoRA/QLoRA fine-tuning on 7B-13B models, Stable Diffusion XL or Flux inference, or local LLM serving where 32GB VRAM is plenty. You want the cheapest Blackwell-generation GPU with 5th gen Tensor Cores and aren't bottlenecked by multi-GPU interconnect.

Recommended fit
Scenario 02

Pick RTX 4090 instead if

You need the absolute lowest hourly rate and 24GB VRAM is enough for your model. Your workload doesn't benefit from Blackwell's 2x AI throughput or the bandwidth jump from GDDR6X to GDDR7.

Recommended fit
Scenario 03

Pick RTX PRO 6000 instead if

You need 48GB or 96GB VRAM on Blackwell silicon to serve 30B+ quantized models on a single GPU, or you want pro-tier drivers and ECC memory for production workloads.

Recommended fit
Scenario 04

Pick H100 instead if

You're training or fine-tuning 30B+ parameter models end-to-end, need HBM3 bandwidth and NVLink/InfiniBand for multi-GPU, or your workload requires the Hopper FP8 Transformer Engine.

Recommended fit

Ideal use cases

Use case / 01
🛠️

AI Prototyping & Development

Rapidly iterate on AI models at low cost, making the RTX 5090 ideal for development workflows and early-stage experimentation.

Model architecture experimentationRapid prototypingDevelopment and debuggingCI/CD ML pipelines
Use case / 02
🎯

Small Model Fine-Tuning

Perform LoRA and QLoRA fine-tuning of models up to 13B parameters with 32GB of fast GDDR7 memory.

Domain-specific fine-tuning (7B-13B models)Instruction tuningRLHF experimentsAdapter training
Use case / 03
💰

Cost-Effective Inference

Deploy smaller models at minimal cost for production inference workloads that demand high throughput at a budget-friendly price.

7B model inferenceChatbot deploymentImage classification APIsReal-time NLP services
Use case / 04
📚

AI Education & Research

Affordable GPU access for learning, research, and open-source contributions without the overhead of expensive data center GPUs.

ML courses and workshopsAcademic researchKaggle competitionsOpen-source model development

Performance benchmarks

Llama 3.1 8B Inference
~3,500 tokens/s
FP16, vLLM batched
Llama 3.1 8B (Q4_K_M)
~65 tokens/s
llama.cpp, single stream
Stable Diffusion XL
~16 img/min
1024x1024, base + refiner
Mistral 7B QLoRA
~720 tokens/s
INT4 fine-tuning
Memory Bandwidth
1,792 GB/s
GDDR7, 512-bit bus
vs RTX 4090
+28-50%
LLM tokens/s uplift

Serve Llama 3.1 8B on RTX 5090 with vLLM

Spin up an OpenAI-compatible inference endpoint on a single RTX 5090. The 32GB GDDR7 fits Llama 3.1 8B in FP16 with room for an 8K context window.

bash
Spheron
# SSH into your RTX 5090 instancessh root@<instance-ip> # Install vLLM (CUDA 12.x compatible)pip install vllm # Serve Llama 3.1 8B in FP16 on a single RTX 5090vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \  --dtype float16 \  --max-model-len 8192 \  --gpu-memory-utilization 0.9 \  --port 8000 # Test the OpenAI-compatible endpointcurl http://localhost:8000/v1/chat/completions \  -H "Content-Type: application/json" \  -d '{    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",    "messages": [{"role": "user", "content": "Hello"}]  }'

RTX 5090 vs alternatives

Related resources

FAQ / 11

Frequently asked questions

Also consider