Fine-tuning a single model is straightforward. Fine-tuning 10 models simultaneously — each with different datasets, hyperparameters, and quality requirements — is an infrastructure problem that most guides ignore entirely.
This case study documents how an AI API company built a multi-tenant fine-tuning pipeline that runs 10 concurrent QLoRA jobs on bare metal H100 servers. The architecture handles job scheduling, GPU allocation, checkpoint management, and quality validation — all without a single job interfering with another.
The result: 10 customer-specific model variants fine-tuned in parallel, completing in 18 hours instead of the 180 hours it would take running sequentially. Total cost per fine-tuning job: $89.
The Business Context
The company operates an AI API platform serving enterprise customers. Each customer needs a model variant fine-tuned on their proprietary data — customer support transcripts, internal documentation, domain-specific terminology, and compliance-sensitive language patterns.
Their customer pipeline looked like this: onboard a new enterprise client, collect their training data (typically 50K-500K instruction-response pairs), fine-tune a model variant, evaluate quality against the customer's benchmarks, then deploy to a dedicated inference endpoint. The entire process from data collection to deployment needed to happen within 48 hours to meet their SLA.
The bottleneck was fine-tuning. Running one job at a time on a single GPU server meant each customer waited in a queue. With 10 customers onboarding in the same week, the queue stretched to 12 days — far beyond their 48-hour SLA.
Hardware Configuration
They provisioned two bare metal servers, each with 8x H100 80GB GPUs connected via NVLink.
Why bare metal over VMs: Fine-tuning workloads are GPU-memory-bound. Virtualization overhead reduces available VRAM by 2-5% and adds latency to GPU memory operations. On a 80GB GPU running QLoRA with a 70B base model, that 2-5% overhead is the difference between fitting the job and running out of memory. Bare metal eliminated this margin entirely.
Why H100 over A100: The H100's 80GB HBM3 memory with 3.35 TB/s bandwidth provides roughly 2x the memory bandwidth of the A100. For QLoRA fine-tuning, where the bottleneck is reading quantized base model weights and writing gradient updates to the adapter layers, this bandwidth advantage translates directly to faster training iterations.
Server specifications:
| Component | Per Server | Total (2 Servers) |
|---|---|---|
| GPUs | 8x H100 80GB SXM | 16x H100 |
| Total VRAM | 640 GB | 1,280 GB |
| GPU Interconnect | NVLink 4 (900 GB/s) | — |
| System RAM | 2 TB DDR5 | 4 TB |
| Storage | 8 TB NVMe (local) + 20 TB NAS | — |
| Network | 400 Gbps InfiniBand | — |
The Multi-Tenant Architecture
The core challenge was running 10 independent fine-tuning jobs without interference — no GPU memory leaks between jobs, no shared state corruption, no one job starving another of compute.
GPU Allocation Strategy
Each QLoRA fine-tuning job for a Llama 3.3 70B base model requires approximately 48-55 GB of VRAM: the 4-bit quantized base model (~35 GB), adapter weights (~2 GB), optimizer states (~4 GB), activations and KV cache (~7-14 GB depending on sequence length). This means each job fits on a single H100 80GB GPU with 25-32 GB of headroom.
With 16 GPUs across 2 servers, they allocated one GPU per job — running 10 concurrent jobs with 6 GPUs held in reserve for job failures, reruns, and evaluation workloads.
The allocation was static, not dynamic. Each job was pinned to a specific GPU using CUDA_VISIBLE_DEVICES. Static allocation eliminated the risk of GPU memory fragmentation that occurs when jobs dynamically share GPUs.
# Job launcher — each job gets exactly one GPU
for i in $(seq 0 9); do
GPU_ID=$((i % 8))
SERVER_ID=$((i / 8))
CUDA_VISIBLE_DEVICES=$GPU_ID python finetune.py \
--config configs/customer_${i}.yaml \
--output_dir /nas/checkpoints/customer_${i} \
--gpu_id $GPU_ID \
&
doneQLoRA Configuration
Every job used the same base model (Llama 3.3 70B, 4-bit quantized with bitsandbytes NF4) but different LoRA configurations tuned to each customer's data characteristics.
The base QLoRA configuration:
from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig
import torch
# Quantization config — same for all jobs
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# LoRA config — per-customer tuning
lora_config = LoraConfig(
r=64, # Rank — higher for complex domains
lora_alpha=128, # Scaling factor
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)The LoRA rank (r) varied by customer. Customers with highly specialized domains (medical, legal) used r=128 for more expressive adapters. Customers with general-purpose customization (tone, formatting) used r=32 for faster training and smaller adapter files.
Job Scheduling and Monitoring
They built a lightweight job scheduler that tracked each fine-tuning run's progress and handled failures automatically.
# Simplified job monitor
import subprocess
import json
import time
from pathlib import Path
class FinetuneJobMonitor:
def __init__(self, jobs_config_path):
self.jobs = json.loads(Path(jobs_config_path).read_text())
self.status = {j["customer_id"]: "pending" for j in self.jobs}
def check_gpu_health(self, gpu_id, server):
"""Check GPU memory usage and temperature"""
result = subprocess.run(
["ssh", server, f"nvidia-smi --id={gpu_id} --query-gpu=memory.used,temperature.gpu --format=csv,noheader"],
capture_output=True, text=True
)
mem_used, temp = result.stdout.strip().split(", ")
return {
"memory_mb": int(mem_used.replace(" MiB", "")),
"temp_c": int(temp)
}
def check_training_progress(self, customer_id):
"""Read latest training metrics from log file"""
log_path = f"/nas/logs/{customer_id}/trainer_state.json"
try:
state = json.loads(Path(log_path).read_text())
return {
"step": state["global_step"],
"loss": state["log_history"][-1].get("loss", None),
"learning_rate": state["log_history"][-1].get("learning_rate", None),
}
except (FileNotFoundError, json.JSONDecodeError):
return None
def run(self, check_interval=60):
while any(s != "completed" for s in self.status.values()):
for job in self.jobs:
cid = job["customer_id"]
if self.status[cid] in ("completed", "failed"):
continue
progress = self.check_training_progress(cid)
if progress and progress["step"] >= job["total_steps"]:
self.status[cid] = "completed"
print(f"[{cid}] Completed at step {progress['step']}")
elif progress:
gpu_health = self.check_gpu_health(
job["gpu_id"], job["server"]
)
print(f"[{cid}] Step {progress['step']}/{job['total_steps']} "
f"loss={progress['loss']:.4f} "
f"GPU mem={gpu_health['memory_mb']}MB "
f"temp={gpu_health['temp_c']}C")
time.sleep(check_interval)Checkpoint and Artifact Management
With 10 jobs running simultaneously, checkpoint storage adds up fast. Each QLoRA checkpoint is relatively small (the adapter weights are typically 200-800 MB depending on LoRA rank), but they saved checkpoints every 200 steps, and each job ran 2,000-5,000 steps.
Storage strategy:
- Local NVMe for active checkpoints (fast write speed, no network bottleneck during training)
- Network-attached storage (NAS) for completed checkpoints and final adapters
- Retention policy: Keep only the 2 most recent checkpoints per job on local storage. Archive final adapters to NAS permanently.
# Checkpoint cleanup cron — runs every 30 minutes
find /local-nvme/checkpoints/*/checkpoint-* -maxdepth 0 -type d |
sort -t- -k2 -n |
head -n -2 |
xargs rm -rfPerformance Results
All 10 fine-tuning jobs completed within 18 hours. The longest job (a legal domain customer with 500K training examples and r=128) took 17.2 hours. The shortest (a customer service tone adaptation with 50K examples and r=32) completed in 4.1 hours.
Training Metrics
| Metric | Average Across 10 Jobs | Range |
|---|---|---|
| Training steps | 3,200 | 1,500 - 5,000 |
| Training time | 11.4 hours | 4.1 - 17.2 hours |
| Peak GPU utilization | 94% | 89% - 97% |
| Peak VRAM usage | 62 GB | 48 - 71 GB |
| Final training loss | 0.82 | 0.61 - 1.14 |
| Adapter size | 480 MB | 190 - 820 MB |
GPU utilization averaged 94% across all 16 GPUs — indicating that QLoRA fine-tuning on bare metal H100s is almost perfectly compute-bound, with minimal idle time from data loading or checkpoint writes.
Quality Validation
Each fine-tuned adapter was evaluated against the customer's held-out test set (10% of their training data, never seen during training) plus a general capability benchmark (MMLU subset) to verify the adapter didn't degrade base model performance.
| Customer Domain | Base Model Score | Fine-Tuned Score | General Capability Δ |
|---|---|---|---|
| Legal contracts | 71.2% | 93.8% | -0.3% |
| Medical records | 68.5% | 91.2% | -0.5% |
| Customer support | 74.1% | 89.7% | -0.1% |
| Financial analysis | 69.8% | 90.4% | -0.4% |
| Technical docs | 76.3% | 92.1% | -0.2% |
| E-commerce | 72.0% | 87.3% | +0.1% |
| HR/recruiting | 70.5% | 88.9% | -0.2% |
| Insurance claims | 67.9% | 90.6% | -0.6% |
| Real estate | 73.4% | 89.1% | -0.1% |
| Logistics | 71.8% | 87.8% | -0.3% |
Average domain-specific improvement: +20.1 percentage points. Average general capability degradation: -0.26% — effectively zero. This confirms that QLoRA with appropriate rank selection preserves the base model's general capabilities while dramatically improving domain performance.
Cost Analysis
| Item | Cost |
|---|---|
| 2x bare metal 8x H100 servers (18 hrs × $16.00/hr each) | $576 |
| Network-attached storage (20 TB, 1 month) | $160 |
| Total for 10 fine-tuning jobs | $736 |
| Cost per fine-tuning job | $73.60 |
Compare this to the alternatives:
| Approach | Cost per Job | Time for 10 Jobs | Total Cost |
|---|---|---|---|
| Sequential on 1x 8-GPU server | $115 | 7.5 days | $1,150 |
| Concurrent on 2x 8-GPU servers | $73.60 | 18 hours | $736 |
| API-based fine-tuning (typical) | $200-500 | 2-5 days | $2,000-5,000 |
| Managed fine-tuning platform | $300-800 | 1-3 days | $3,000-8,000 |
The concurrent approach is 36% cheaper than sequential (because the servers are utilized for 18 hours instead of sitting idle between jobs) and 73-91% cheaper than managed fine-tuning platforms.
Lessons Learned
Static GPU allocation beats dynamic scheduling for fine-tuning. Dynamic GPU schedulers (like Kubernetes with GPU sharing) add complexity and introduce memory fragmentation risks. For fine-tuning workloads where each job's VRAM requirement is predictable, pinning jobs to specific GPUs is simpler and more reliable.
LoRA rank is the most impactful hyperparameter. Across 10 customer deployments, the single variable that most affected final quality was LoRA rank. Domains with specialized vocabulary and reasoning patterns (legal, medical) needed r=128. General-purpose adaptations (tone, formatting) worked well with r=32. Over-provisioning rank wastes training time; under-provisioning caps quality.
Bare metal eliminates the VRAM margin problem. On virtualized GPU instances, VRAM overhead from the hypervisor consumed 2-5 GB per GPU. At r=128 with a 70B base model, some jobs peaked at 71 GB VRAM usage — leaving only 9 GB of headroom on an 80GB GPU. With virtualization overhead, these jobs would have OOM'd. Bare metal gave them the full 80 GB.
Checkpoint storage is cheap — losing a training run is expensive. They budgeted $160/month for 20 TB of network storage. A single lost training run (due to a job failure without recent checkpoints) would cost $57 in wasted GPU time plus hours of delay. Aggressive checkpointing with cheap network storage is always the right tradeoff.
For teams running regular fine-tuning workloads, bare metal GPU servers provide the most predictable and cost-effective infrastructure. On Spheron AI, you can provision multi-GPU bare metal servers with H100, H200, and B300 GPUs — available as both Spot and Dedicated instances through a single console.