How to Migrate Your AI Workloads from AWS, GCP, and Azure to Alternative GPU Clouds

You already know hyperscaler GPU pricing is expensive. An H100 on AWS runs roughly $3.90/hr on-demand. Azure's ND96isr H100 v5 works out to about $12.29 per GPU per hour. GCP's A3 instances have dropped to around $3.00/GPU after aggressive cuts, but that's still 2-3x what you'd pay on alternative GPU clouds where the same hardware starts at $1.49/hr.

The problem isn't awareness — it's inertia. Your training scripts reference S3 buckets, your CI/CD pipeline deploys to EC2, and your team knows the AWS console by muscle memory. Migrating feels risky and time-consuming.

It doesn't have to be. This guide walks through the full migration path, step by step, from auditing what you're currently spending to running your first workload on cheaper infrastructure. The goal is a portable setup that lets you deploy on whichever provider has the best price — without rewriting your code every time you switch.

Step 1: Audit Your Current GPU Spend

Before migrating anything, you need a clear picture of what you're actually paying. Most teams underestimate their GPU costs because the charges are spread across compute, storage, networking, and data transfer.

Pull the following numbers from your cloud billing dashboard for the last 90 days:

Compute costs. Filter by GPU instance types (AWS: p4d, p5 families. GCP: a2, a3 families. Azure: NC, ND series). Calculate your average hourly spend and average utilization. If you're running p5.48xlarge instances at $32.77/hr and your GPU utilization averages 40%, you're paying for 60% idle compute.

Data egress costs. This is the hidden tax that makes migration feel scary. All three hyperscalers charge for outbound data transfer: AWS charges $0.09/GB for the first 10 TB/month, GCP charges $0.087/GB, and Azure charges $0.12/GB. If you're moving 5 TB of training data out, that's $450-600 in egress fees alone. It's a real cost — but it's a one-time cost. Compare it to the monthly savings from cheaper GPUs and you'll see it pays for itself in days.

Storage costs. Identify how much data you have stored (datasets, checkpoints, model artifacts) and what tier it's on. Training datasets on AWS S3 Standard cost $0.023/GB/month. A 10 TB dataset costs $230/month just to park.

Networking costs. Multi-node training generates inter-node traffic. AWS charges for cross-AZ data transfer. Check whether your training runs are distributed across availability zones — if so, you're paying a network premium.

Once you have these numbers, calculate your effective cost-per-GPU-hour including all overhead. It's usually 20-40% higher than the listed instance price.

Step 2: Containerize Your Workloads

This is the single most important step in the entire migration. If your training and inference code runs inside Docker containers, it runs anywhere — AWS, alternative clouds, bare metal, your local machine. If it doesn't, you're tied to whatever environment it was built for.

Most ML workloads are already partially containerized. If yours isn't, here's the pattern:

dockerfile

FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04

# System dependencies
RUN apt-get update && apt-get install -y python3 python3-pip git

# Python dependencies
COPY requirements.txt /app/requirements.txt
RUN pip3 install -r /app/requirements.txt

# Application code
COPY . /app
WORKDIR /app

# Configuration via environment variables, not hardcoded paths
ENV MODEL_PATH=/data/models
ENV DATASET_PATH=/data/datasets
ENV CHECKPOINT_PATH=/data/checkpoints

CMD ["python3", "train.py"]

Three principles make this portable:

Externalize all configuration. Model paths, dataset locations, hyperparameters, API keys — everything goes into environment variables or config files, never hardcoded. When you move to a new provider, you change the environment variables, not the code.

Use bind mounts for data. Mount your datasets and checkpoint directories from the host filesystem rather than baking them into the image. This keeps your container small and your data portable.

Pin your dependencies. Use exact versions in requirements.txt (not torch>=2.0 but torch==2.5.1+cu124). Different providers may have different base images, and pinned dependencies prevent subtle breakage.

If your workload already runs in Docker, skip ahead. If it uses provider-specific tooling like SageMaker, you'll need to extract the training logic into standalone scripts first — more on that below.

Step 3: Decouple from Provider-Specific Services

This is where most migrations stall. Over time, codebases accumulate dependencies on provider-specific services: S3 for storage, SageMaker for training orchestration, CloudWatch for logging, IAM for authentication.

Here's how to handle each category:

Storage: S3 / GCS / Azure Blob → Portable Alternatives

You have two options: use an S3-compatible storage service (most alternative providers offer this), or use a provider-agnostic tool.

The fastest path is to swap out the SDK calls. If your code uses boto3 to read from S3, you can point it at any S3-compatible endpoint by changing two environment variables:

python

import boto3

s3 = boto3.client(
    's3',
    endpoint_url=os.environ.get('S3_ENDPOINT', 'https://s3.amazonaws.com'),
    aws_access_key_id=os.environ['S3_ACCESS_KEY'],
    aws_secret_access_key=os.environ['S3_SECRET_KEY']
)

When running on AWS, S3_ENDPOINT points to S3. On an alternative provider with S3-compatible storage, it points to their endpoint. Same code, different config.

For initial data transfer, use rclone — it handles multi-cloud copies efficiently and supports resumable transfers:

bash

rclone copy s3:my-training-data /local/data --progress --transfers 16

Training Orchestration: SageMaker / Vertex AI / Azure ML → Docker + Scripts

If your training runs through a managed service, extract the training code into standalone scripts. Most managed services wrap ordinary PyTorch or TensorFlow training loops with provider-specific launchers. The core logic is portable — the wrapper isn't.

The replacement is straightforward: a Docker container with your training script, launched with docker run or torchrun for distributed training. You lose some convenience (automatic hyperparameter tuning, built-in experiment tracking), but you gain portability and dramatically lower costs.

For experiment tracking, switch to a self-hosted or cloud-agnostic tool like Weights & Biases, MLflow, or Neptune. These work identically regardless of where your training runs.

Logging and Monitoring: CloudWatch / Stackdriver → Portable Tools

Replace provider-specific logging with standard output. Write logs to stdout/stderr, and use a provider-agnostic log aggregator if you need persistence. Most GPU clouds support basic monitoring out of the box, and tools like Grafana with Prometheus work everywhere.

Step 4: Test on the New Provider

Don't migrate your entire fleet at once. Start with a single, non-critical workload — an evaluation run, a small fine-tuning job, or a batch inference task.

The testing checklist:

Verify GPU access. Run nvidia-smi inside your container to confirm the GPU is visible and the CUDA version matches your requirements.

Run a short training job. Use the same hyperparameters as your production run but train for only 100-500 steps. Compare loss curves, throughput (samples/sec), and GPU utilization against your hyperscaler baseline.

Test checkpoint save/restore. Save a checkpoint, terminate the instance, start a new one, and resume from the checkpoint. This validates your persistence setup and confirms you can survive instance interruptions.

Benchmark throughput. For inference, measure tokens/sec and p95 latency under load. For training, measure samples/sec and GPU utilization. These numbers tell you whether the alternative provider's hardware performs equivalently.

If performance matches your baseline (within 5-10%), you're ready to migrate more workloads. If it doesn't, check for NUMA topology differences, NVLink vs. PCIe connectivity, or network bandwidth constraints between nodes.

Step 5: Migrate Data

Data migration is the most time-consuming step, but it only happens once. The approach depends on how much data you're moving.

Under 1 TB: Direct transfer over the network. Use rclone or rsync with compression and parallel transfers. At typical cloud egress speeds, 1 TB transfers in 2-4 hours.

1-10 TB: Same approach, but schedule it during off-peak hours when egress bandwidth is less contested. Consider compressing datasets first — training data often compresses 2-5x with zstd.

Over 10 TB: Consider a staged approach. Transfer the most-used datasets first and keep older or rarely-used data on the original cloud (accessed via network when needed). Over time, migrate everything.

The egress cost math: at AWS's $0.09/GB rate, transferring 10 TB costs $900. If your new GPU cloud saves you $2,000/month on compute, that egress cost pays for itself in two weeks. Don't let it block you — it's a rounding error relative to the ongoing savings.

Step 6: Set Up Your Multi-Cloud Workflow

Once you've validated performance on the alternative provider, the smartest move is to maintain portability rather than locking into a new single provider. Your containerized workloads can now run anywhere, so take advantage of that.

The production setup looks like this:

Container registry. Push your Docker images to a provider-agnostic registry (Docker Hub, GitHub Container Registry, or a self-hosted registry). Any GPU cloud can pull from these.

Infrastructure as Code. Define your GPU requirements in a config file rather than provider-specific templates. Instead of a CloudFormation template, maintain a simple YAML that specifies GPU type, count, storage, and environment variables.

Data sync. Keep your primary datasets on whichever storage is cheapest, and use rclone to sync to new providers as needed. Checkpoints should save to storage that's local to wherever the training is running.

Orchestration. For simple workloads, a shell script that provisions instances, pulls your container, and starts training is sufficient. For complex multi-node training, tools like Kubernetes with the NVIDIA GPU Operator provide portable orchestration across any cloud.

Spheron AI simplifies this by giving you access to baremetal and VM GPU servers across 5+ providers through a single console. You specify the GPU type, count, and storage you need — and get a ready-to-use server where you can run Docker, Kubernetes, or anything else your workload requires. No separate accounts, no provider-specific APIs, no billing fragmentation.

The Migration Checklist

Here's the condensed version for teams ready to move:

Audit spend — Calculate true cost-per-GPU-hour including egress, storage, and networking overhead.
Containerize — Package workloads in Docker with externalized configuration and pinned dependencies.
Decouple — Replace provider-specific storage, orchestration, and logging with portable alternatives.
Test — Run a small workload on the new provider. Compare throughput, latency, and cost against baseline.
Migrate data — Transfer datasets using rclone. Start with critical data, backfill the rest.
Go multi-cloud — Maintain portability so you can always deploy on the cheapest available option.

What Teams Actually Save

The math varies by workload, but the pattern is consistent. Teams running H100s on AWS at $3.90/hr that switch to marketplace pricing at $1.49-$2.10/hr see 45-60% cost reductions on compute alone. Factor in zero or lower egress fees, cheaper storage, and simpler billing from alternative providers, and the total savings often exceed 50%.

The migration itself takes 1-2 weeks for a typical team with 3-5 GPU workloads. The one-time egress cost is typically recouped within the first month of running on cheaper infrastructure.

The hardest part isn't the technical migration — it's deciding to start. Everything after that is just Docker containers and config files.