What should AI teams look for when buying GPU cloud in 2026?

Four things in order: hardware control (can you run custom kernels, drivers, and CUDA versions), performance consistency (no noisy-neighbor throttling across long training runs), transparent pricing (no egress or warm-up surprises), and right-sized hardware (don't rent an H100 for a 7B inference job). Specs on a datasheet don't guarantee any of these. Check our GPU cloud benchmarks for measured throughput.

Is bare metal worth it over virtualized GPU instances?

For training and any production inference workload, yes. Bare metal removes the hypervisor, so you get full PCIe bandwidth, no shared memory contention, and no noisy-neighbor slowdowns. Teams that move from virtualized clouds to bare metal commonly measure 15-20% faster compute and noticeably better multi-node networking. For short experiments and dev loops, virtualized is fine.

How do I compare total cost between hyperscalers and neo-clouds?

Hourly GPU rate is only 40-60% of the real bill. Add egress ($0.08-0.12/GB on AWS/GCP/Azure, free or flat on most neo-clouds), persistent storage ($0.08-0.15/GB/month), NAT gateway and IP fees, and minimum rental commitments. Run a realistic monthly simulation including model checkpoint egress and data movement. Hyperscalers typically land 2-3x higher than specialist clouds once you add everything in.

Which GPU should I start with for a new AI product?

Match the GPU to your biggest model and your concurrency target. For 7B-13B inference, RTX 5090 ($0.76/hr) or L40S ($0.72/hr) are the cheapest competent options. For 30B-70B training or inference, A100 80GB ($1.07/hr on-demand, $0.60/hr spot) or H100 SXM5 ($2.50/hr on-demand, $1.03/hr spot) is the sweet spot. For 100B+ or long-context serving, H200 ($4.54/hr) or B200 spot ($2.12/hr) are the best value.

When does a reserved or committed GPU contract make sense?

Only after you have 30+ days of stable baseline load you can measure. Committing to reserved capacity before you understand your actual utilization pattern is how teams end up paying for idle GPUs. Use on-demand until the pattern is stable, then negotiate reserved pricing against a known workload. Spot capacity handles bursts; reserved handles the floor.

AI GPU Buyers Guide 2026: How to Evaluate Cloud GPU Providers

_Updated April 2026 with current Spheron pricing and the framework we use with teams evaluating GPU cloud providers._

Every GPU cloud ad promises the same three things: fastest hardware, lowest price, infinite scale. Most of them are wrong about two of them. Specs on a datasheet are the starting point, not the answer. What separates a provider you can ship on from one that breaks you six months in is how the platform behaves under real workloads: whether performance holds up across long runs, whether pricing stays predictable as you scale, and whether you actually control the machine.

This is a working AI buyers guide for teams shopping GPU cloud in 2026. It walks through what to evaluate, what to measure, and what the current market looks like, including the GPU, pricing, and workload decisions we see teams get wrong most often. For side-by-side provider comparisons, see our top 10 cloud GPU providers analysis and the GPU cloud pricing comparison 2026.

The Four Questions That Actually Matter

Most buyer conversations start from the wrong question. Teams ask "which provider has H100s available?" when they should be asking:

Control. How much of the machine do I own? Can I install my own CUDA, patch the kernel, run profilers?
Consistency. Does throughput stay stable across a 48-hour training run? What happens when the data center fills up?
Pricing truth. What does this cost per month including egress, storage, minimums, and idle time?
Right-sizing. Am I buying the GPU this workload actually needs, or the one the provider wants me to buy?

If a provider can't answer these clearly, the rest of the pitch doesn't matter. For accurate workload sizing, see our GPU memory requirements for LLMs guide and the GPU requirements cheat sheet.

Control: Do You Actually Own the Machine?

Fastest GPU on paper means nothing if you can't configure the environment around it. Many cloud platforms lock you into container sandboxes, restrict driver installation, or hide the hardware behind layers of virtualization that look clean in benchmarks and fail in production.

This matters for more workloads than people realize. LLM fine-tuning, RLHF pipelines, multi-node training with NCCL tuning, custom CUDA kernels, video AI pipelines with GStreamer bindings, and any research workload that touches low-level profiling tools all need real control. Without it, you hit mystery slowdowns at the worst possible moment.

Spheron gives full VM access with root on every deployment. You configure the OS, install your own CUDA version, patch the kernel, run Nsight or DCGM, and do the things that keep training jobs on schedule. Every instance runs on bare metal, which means no hypervisor tax, no shared-memory contention, and no noisy neighbors eating your PCIe bandwidth. Teams consistently measure 15-20% faster compute and cleaner multi-node networking compared to virtualized alternatives.

Before you pick a provider, run a simple test: deploy a small training job, install a custom kernel or driver version, and run nvidia-smi with full permissions. If any of that is blocked, you're running in someone else's sandbox.

Consistency: The Thing Benchmarks Hide

Performance consistency is where most GPU clouds quietly fail. A GPU that hits peak throughput on a morning benchmark but slows down in the afternoon when the data center fills up is not useful for a 48-hour training run. An inference endpoint that swings from 80 ms to 400 ms without explanation is a production liability.

Two design choices cause most of this pain: (1) virtualized GPUs sharing PCIe lanes and HBM bandwidth with other tenants, and (2) single-region deployments that collapse when the region gets oversubscribed. Bare metal fixes the first. A multi-provider, multi-region footprint fixes the second.

Spheron aggregates supply from data center partners across multiple regions globally, which means a workload isn't pinned to a single geography or a single failure zone. If one partner slows down, jobs continue elsewhere. If a region goes offline, it doesn't take your inference endpoint with it. Combined with bare-metal single-tenancy, this is why teams building production agents, real-time inference, and 24/7 batch pipelines report better stability than on larger clouds that advertise more raw capacity.

Measure this yourself during evaluation. Run the same workload at different times of day for a week and compare throughput distributions. If p99 throughput is more than 20% below p50, you have a consistency problem that will show up in production.

Pricing Truth: What the Hourly Rate Hides

The hourly GPU rate is the piece of the pricing story providers advertise. It's also the least reliable predictor of the monthly bill. On hyperscalers, the hourly rate is typically 40-60% of the total; the rest comes from egress bandwidth, persistent storage, NAT gateways, cross-region replication, snapshot fees, and minimum rental commitments.

A realistic comparison looks like this. For a team serving 70B inference at roughly 10k requests per hour with model checkpoint syncs twice a day:

Hourly GPU: $2.50-12.29 depending on provider (H100 SXM5; AWS p5 at ~$6.88/GPU post-2025 cut, Azure ND H100 v5 at ~$12.29/GPU)
Egress: $0-150/month (free on neo-clouds, $100-150 on hyperscalers for typical traffic)
Storage: $0-60/month (flat or included on neo-clouds, $50-60 on hyperscalers)
NAT gateway / IP: $0-45/month (included on neo-clouds, $30-45 on hyperscalers)

The delta between a specialist cloud and a hyperscaler on the same silicon commonly lands 2-3x across the full bill. Egress alone can exceed the GPU cost when you're moving large checkpoints. The GPU cost optimization playbook walks through the patterns that save the most money in practice.

Spheron's billing is per-minute GPU time with no hidden warm-up charges, no egress surprises, and no idle penalties. If the GPU is running, you pay. If it's off, you don't. That simplicity matters more than people expect, especially for iterative development where instances cycle on and off all day.

Right-Sizing: The Most Expensive Mistake

The single most common mistake we see: a team with a 7B model running on H100 SXM5 because "that's what everyone recommends." An H100 SXM5 at $2.50/hr for a workload an L40S at $0.72/hr could handle is a 3.5x cost multiplier on the same latency budget. Over a year of 24/7 inference, that's tens of thousands of dollars evaporated.

Match the GPU to the model and the concurrency target. Here's the working rule of thumb:

<7B model, small batch, dev work: RTX 4090 ($0.55/hr) or RTX 5090 ($0.76/hr)
7B-13B production inference: L40S ($0.72/hr) or A100 80GB ($1.07/hr)
30B-70B training or inference: A100 80GB or H100 SXM5 ($2.50/hr on-demand, $1.03/hr spot)
70B+ long-context inference: H200 ($4.54/hr) or B200 spot ($2.12/hr)
100B+ or frontier training: H200, B200, B300 spot ($2.45/hr), or GB200 clusters

For detailed workload-to-GPU matching, see best GPU for AI inference in 2026 and best NVIDIA GPUs for LLMs.

GPU Comparison Matrix

Quick reference for every GPU Spheron offers, current per-GPU per-hour pricing, and the workload where each shines. Link goes to the dedicated rental page with live inventory.

GPU	VRAM	Best Use Case	On-Demand	Spot	Rent
RTX 4090	24 GB	Dev, fine-tuning, diffusion	$0.55/hr	N/A	Rent →
RTX 5090	32 GB	Dev, small-model inference	$0.76/hr	N/A	Rent →
RTX PRO 6000	96 GB	Workstation-class, 70B QLoRA	$0.93/hr	$0.72/hr	Rent →
L40S	48 GB	7B-30B inference, video AI	$0.72/hr	N/A	Rent →
A100 80G	80 GB	Mid-size training & inference	$1.07/hr	$0.60/hr	Rent →
GH200	96 GB	Grace Hopper hybrid compute	$1.97/hr	N/A	Rent →
H100 SXM5	80 GB	70B training, multi-GPU HGX	$2.50/hr	$1.03/hr	Rent →
H200 SXM5	141 GB	70B+ inference, long context	$4.54/hr	N/A	Rent →
B200 SXM6	192 GB	FP4 inference, frontier training	$6.02/hr	$2.12/hr	Rent →
B300 SXM6	288 GB	Frontier training, long context	$6.80/hr	$2.45/hr	Rent →

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Where Spheron Fits in the Market

Most GPU clouds fall into one of three buckets: hyperscalers (AWS, GCP, Azure) sell scale but charge aggressively and lock you into enterprise procurement cycles; specialist clouds (Lambda, CoreWeave, Nebius) sell performance but limit regions and hardware variety; marketplaces (Vast.ai, SFCompute) sell variety but reliability is uneven.

Spheron blends the three: bare-metal performance, marketplace-style pricing, and a multi-provider regional footprint behind a single console. You aren't locked into any one data center operator, which means supply stays available during shortages and pricing stays competitive because providers underneath compete for your workload. For how this compares head-to-head with specific competitors, see our analyses of Spheron vs RunPod, Spheron vs CoreWeave, and Spheron vs Vast.ai.

What Changed in 2026

A few shifts worth calling out if you last evaluated the market more than six months ago:

B200 and B300 entered mainstream availability. B200 SXM6 is now $6.02/hr on-demand on Spheron with $2.12/hr spot, a big drop from the $6-8/hr quotes common in late 2025. B200 spot is genuinely competitive with H100 PCIe on-demand and offers 2.4x the memory bandwidth plus native FP4.
A100 repriced. A100 80GB SXM4 is $1.07/hr on-demand and $0.60/hr spot. That's roughly 3x cheaper than GCP's A2 instances ($3.30/hr) for the same silicon and still one of the best training and fine-tuning options for models up to 70B.
Hyperscaler gap narrowed but is still wide. AWS cut P5 instance pricing by 44% in June 2025, bringing H100 SXM from ~$12.29/GPU down to ~$6.88/GPU, and GCP trimmed A3 rates, but Azure's ND H100 v5 is still roughly $12.29/GPU. The gap between hyperscalers and specialist clouds is still 2-3x on identical silicon.
Spot became a real production option. With proper checkpointing, spot instances on H100 ($1.03/hr) and B300 ($2.45/hr) now handle long training runs reliably. The spot GPU training case study walks through the specifics of a 70B training run on spot that saved 73%.
Inference cost-per-token compressed. B200 and H200 brought per-token costs for 70B inference into the $0.10-0.20/M range on spot. See cost-per-token math in the GPU cloud pricing comparison.

A Buying Framework You Can Use Today

If you take one thing from this guide, use this decision flow when evaluating any provider:

Prove control. Deploy, install a custom driver, run a profiler. If blocked, move on.
Prove consistency. Run the same workload across different times of day and across regions if the provider offers multi-region. Measure p99 throughput, not peak.
Prove pricing. Build a realistic monthly cost simulation with egress, storage, and minimums included. Don't trust hourly rate comparisons in isolation.
Prove right-sizing. Match GPU to model. If your workload fits in 48 GB, don't pay for 80.
Start small, measure, then scale. Use on-demand until you have 30+ days of stable baseline load. Move to spot for fault-tolerant workloads. Negotiate reserved only against a known pattern.

Teams that follow this framework typically save 50-75% versus hyperscalers and get better stability than their previous cloud, because they buy based on measured behavior, not marketing claims.

GPU buying is about matching workload to hardware and pricing model, not chasing the newest GPU. Spheron gives you bare-metal control, transparent billing, and the hardware palette to match every stage of your AI stack across data center partners globally.
View all GPU pricing → | Rent H100 → | Rent B200 → | Get started on Spheron →