Spheron GPU Catalog

Rent NVIDIA L40S GPUs on Demand from $0.72/hr

48GB GDDR6 ECC Ada Lovelace data center GPU, tuned for inference, video, and visual AI.

At a glance

You can rent an NVIDIA L40S on Spheron starting at $0.72/hr per GPU per hour on dedicated (99.99% SLA, non-interruptible), with spot pricing cheaper still. Per-minute billing, no long-term contracts, and instances deploy in under 2 minutes across data center partners in multiple regions. Each card ships with 48GB of GDDR6 ECC memory, 4th generation Tensor Cores with FP8 support, 3rd generation RT Cores, and hardware AV1 encode. The L40S is purpose-built for production inference of 7B-30B LLMs, Stable Diffusion and SDXL serving, video transcoding pipelines, and mixed AI + graphics workloads where you need data center reliability without H100 pricing.

GPU ArchitectureNVIDIA Ada Lovelace
VRAM48 GB GDDR6 (with ECC)
Memory Bandwidth864 GB/s

Technical specifications

GPU Architecture
NVIDIA Ada Lovelace
VRAM
48 GB GDDR6 (with ECC)
Memory Bandwidth
864 GB/s
Tensor Cores
4th Generation
CUDA Cores
18,176
RT Cores
3rd Generation
FP32 Performance
91.6 TFLOPS
FP16 Performance
183.2 TFLOPS
INT8 Performance
733 TOPS
System RAM
128 GB DDR5
vCPUs
22 vCPUs
Storage
625 GB NVMe SSD
Network
PCIe Gen4
TDP
350W

Pricing comparison

ProviderPrice/hrSavings
SpheronYour price
$0.72/hr-
RunPod
$0.79/hr1.1x more expensive
Lambda Labs
$1.29/hr1.8x more expensive
CoreWeave
$1.89/hr2.6x more expensive
AWS (g6e.xlarge)
$1.86/hr2.6x more expensive
Custom & Reserved

Need More L40S Than What's Listed?

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more L40S capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the L40S

Scenario 01

Pick L40S if

You're running production inference for 7B-30B LLMs, SDXL serving, or video transcoding pipelines and need ECC + data center drivers without H100 pricing. Also the pick when you need FP8 support but don't need HBM bandwidth, and when AV1 hardware encode is on the requirements list.

Recommended fit
Scenario 02

Pick A100 80GB instead if

Your workload is training-heavy and bandwidth-bound. A100 has 2 TB/s HBM2e (vs 864 GB/s GDDR6 on L40S), making it faster for pre-training and fine-tuning. L40S wins at inference, A100 wins at training.

Recommended fit
Scenario 03

Pick RTX 4090 instead if

Your model fits in 24GB and you're running dev / testing workloads where ECC and multi-tenant isolation don't matter. RTX 4090 is roughly half the hourly rate of L40S.

Recommended fit
Scenario 04

Pick H100 instead if

You need HBM3 bandwidth (3.35 TB/s) or NVLink for multi-GPU tensor parallelism. H100 is the right pick for 70B+ inference or any training job where memory bandwidth is the bottleneck.

Recommended fit

Ideal use cases

Use case / 01

AI Inference at Scale

Run cost-effective inference workloads with 48GB memory and INT8 support for high-throughput production deployments.

Production LLM inference (up to 30B params)Multi-model servingRecommendation system deploymentReal-time classification APIs
Use case / 02
🎬

Video Processing & Encoding

Leverage hardware-accelerated video pipelines for live streaming, transcoding, and video analytics at scale.

Live video transcodingCloud gamingVideo analyticsReal-time virtual production
Use case / 03
🖼️

Visual Computing & Rendering

Combine AI acceleration with professional graphics capabilities for rendering and visualization workloads.

3D rendering workloadsVirtual desktop infrastructure (VDI)Architectural visualizationProduct design rendering
Use case / 04
🔄

Mixed AI + Graphics Workloads

Take advantage of the L40S's unique combination of AI and graphics acceleration for next-generation creative and visual AI applications.

AI-powered video editingGenerative AI for visual contentNeural radiance fields (NeRF)Real-time style transfer

Performance benchmarks

LLaMA 2 13B Inference
2,800 tokens/s
FP16 batch 32
Stable Diffusion XL
32 img/min
1024x1024 FP16
Video Transcoding
8x real-time
4K H.265 to H.264
BERT Large Inference
6,200 seq/sec
INT8
Ray Tracing Performance
3rd Gen RT Cores
hardware RT, A100 has none
VDI User Density
3x more users
vs previous gen per GPU

Serve Llama 3.1 8B at FP8 on L40S

L40S's 48GB GDDR6 ECC and FP8 Tensor Cores make it a strong fit for production 7B-13B inference with heavy concurrency. vLLM gives you an OpenAI-compatible endpoint in one command.

bash
Spheron
# SSH into your L40S instancessh root@<instance-ip> # Install vLLMpip install vllm # Launch Llama 3.1 8B FP8 with high concurrencyvllm serve meta-llama/Llama-3.1-8B-Instruct \  --quantization fp8 \  --max-model-len 16384 \  --max-num-seqs 64 \  --gpu-memory-utilization 0.9 # Test the endpointcurl http://localhost:8000/v1/completions \  -H "Content-Type: application/json" \  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Hello","max_tokens":50}'

For 30B models (Qwen 2.5 32B, Mixtral 8x7B at AWQ), FP8 weights still fit with room for KV cache at moderate batch sizes.

Related resources

FAQ / 11

Frequently asked questions

Also consider