Case Study: Running 10 Concurrent Fine-Tuning Jobs on Bare Metal H100s — Architecture and Cost Breakdown

Fine-tuning a single model is straightforward. Fine-tuning 10 models simultaneously — each with different datasets, hyperparameters, and quality requirements — is an infrastructure problem that most guides ignore entirely.

This case study documents how an AI API company built a multi-tenant fine-tuning pipeline that runs 10 concurrent QLoRA jobs on bare metal H100 servers. The architecture handles job scheduling, GPU allocation, checkpoint management, and quality validation — all without a single job interfering with another.

The result: 10 customer-specific model variants fine-tuned in parallel, completing in 18 hours instead of the 180 hours it would take running sequentially. Total cost per fine-tuning job: $89.

The Business Context

The company operates an AI API platform serving enterprise customers. Each customer needs a model variant fine-tuned on their proprietary data — customer support transcripts, internal documentation, domain-specific terminology, and compliance-sensitive language patterns.

Their customer pipeline looked like this: onboard a new enterprise client, collect their training data (typically 50K-500K instruction-response pairs), fine-tune a model variant, evaluate quality against the customer's benchmarks, then deploy to a dedicated inference endpoint. The entire process from data collection to deployment needed to happen within 48 hours to meet their SLA.

The bottleneck was fine-tuning. Running one job at a time on a single GPU server meant each customer waited in a queue. With 10 customers onboarding in the same week, the queue stretched to 12 days — far beyond their 48-hour SLA.

Hardware Configuration

They provisioned two bare metal servers, each with 8x H100 80GB GPUs connected via NVLink.

Why bare metal over VMs: Fine-tuning workloads are GPU-memory-bound. Virtualization overhead reduces available VRAM by 2-5% and adds latency to GPU memory operations. On a 80GB GPU running QLoRA with a 70B base model, that 2-5% overhead is the difference between fitting the job and running out of memory. Bare metal eliminated this margin entirely.

Why H100 over A100: The H100's 80GB HBM3 memory with 3.35 TB/s bandwidth provides roughly 2x the memory bandwidth of the A100. For QLoRA fine-tuning, where the bottleneck is reading quantized base model weights and writing gradient updates to the adapter layers, this bandwidth advantage translates directly to faster training iterations.

Server specifications:

Component	Per Server	Total (2 Servers)
GPUs	8x H100 80GB SXM	16x H100
Total VRAM	640 GB	1,280 GB
GPU Interconnect	NVLink 4 (900 GB/s)	—
System RAM	2 TB DDR5	4 TB
Storage	8 TB NVMe (local) + 20 TB NAS	—
Network	400 Gbps InfiniBand	—

The Multi-Tenant Architecture

The core challenge was running 10 independent fine-tuning jobs without interference — no GPU memory leaks between jobs, no shared state corruption, no one job starving another of compute.

GPU Allocation Strategy

Each QLoRA fine-tuning job for a Llama 3.3 70B base model requires approximately 48-55 GB of VRAM: the 4-bit quantized base model (~35 GB), adapter weights (~2 GB), optimizer states (~4 GB), activations and KV cache (~7-14 GB depending on sequence length). This means each job fits on a single H100 80GB GPU with 25-32 GB of headroom.

With 16 GPUs across 2 servers, they allocated one GPU per job — running 10 concurrent jobs with 6 GPUs held in reserve for job failures, reruns, and evaluation workloads.

The allocation was static, not dynamic. Each job was pinned to a specific GPU using CUDA_VISIBLE_DEVICES. Static allocation eliminated the risk of GPU memory fragmentation that occurs when jobs dynamically share GPUs.

bash

# Job launcher — each job gets exactly one GPU
for i in $(seq 0 9); do
    GPU_ID=$((i % 8))
    SERVER_ID=$((i / 8))

    CUDA_VISIBLE_DEVICES=$GPU_ID python finetune.py \
        --config configs/customer_${i}.yaml \
        --output_dir /nas/checkpoints/customer_${i} \
        --gpu_id $GPU_ID \
        &
done

QLoRA Configuration

Every job used the same base model (Llama 3.3 70B, 4-bit quantized with bitsandbytes NF4) but different LoRA configurations tuned to each customer's data characteristics.

The base QLoRA configuration:

python

from peft import LoraConfig, get_peft_model
from transformers import BitsAndBytesConfig
import torch

# Quantization config — same for all jobs
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# LoRA config — per-customer tuning
lora_config = LoraConfig(
    r=64,                    # Rank — higher for complex domains
    lora_alpha=128,          # Scaling factor
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

The LoRA rank (r) varied by customer. Customers with highly specialized domains (medical, legal) used r=128 for more expressive adapters. Customers with general-purpose customization (tone, formatting) used r=32 for faster training and smaller adapter files.

Job Scheduling and Monitoring

They built a lightweight job scheduler that tracked each fine-tuning run's progress and handled failures automatically.

python

# Simplified job monitor
import subprocess
import json
import time
from pathlib import Path

class FinetuneJobMonitor:
    def __init__(self, jobs_config_path):
        self.jobs = json.loads(Path(jobs_config_path).read_text())
        self.status = {j["customer_id"]: "pending" for j in self.jobs}

    def check_gpu_health(self, gpu_id, server):
        """Check GPU memory usage and temperature"""
        result = subprocess.run(
            ["ssh", server, f"nvidia-smi --id={gpu_id} --query-gpu=memory.used,temperature.gpu --format=csv,noheader"],
            capture_output=True, text=True
        )
        mem_used, temp = result.stdout.strip().split(", ")
        return {
            "memory_mb": int(mem_used.replace(" MiB", "")),
            "temp_c": int(temp)
        }

    def check_training_progress(self, customer_id):
        """Read latest training metrics from log file"""
        log_path = f"/nas/logs/{customer_id}/trainer_state.json"
        try:
            state = json.loads(Path(log_path).read_text())
            return {
                "step": state["global_step"],
                "loss": state["log_history"][-1].get("loss", None),
                "learning_rate": state["log_history"][-1].get("learning_rate", None),
            }
        except (FileNotFoundError, json.JSONDecodeError):
            return None

    def run(self, check_interval=60):
        while any(s != "completed" for s in self.status.values()):
            for job in self.jobs:
                cid = job["customer_id"]
                if self.status[cid] in ("completed", "failed"):
                    continue

                progress = self.check_training_progress(cid)
                if progress and progress["step"] >= job["total_steps"]:
                    self.status[cid] = "completed"
                    print(f"[{cid}] Completed at step {progress['step']}")
                elif progress:
                    gpu_health = self.check_gpu_health(
                        job["gpu_id"], job["server"]
                    )
                    print(f"[{cid}] Step {progress['step']}/{job['total_steps']} "
                          f"loss={progress['loss']:.4f} "
                          f"GPU mem={gpu_health['memory_mb']}MB "
                          f"temp={gpu_health['temp_c']}C")

            time.sleep(check_interval)

Checkpoint and Artifact Management

With 10 jobs running simultaneously, checkpoint storage adds up fast. Each QLoRA checkpoint is relatively small (the adapter weights are typically 200-800 MB depending on LoRA rank), but they saved checkpoints every 200 steps, and each job ran 2,000-5,000 steps.

Storage strategy:

Local NVMe for active checkpoints (fast write speed, no network bottleneck during training)
Network-attached storage (NAS) for completed checkpoints and final adapters
Retention policy: Keep only the 2 most recent checkpoints per job on local storage. Archive final adapters to NAS permanently.

bash

# Checkpoint cleanup cron — runs every 30 minutes
find /local-nvme/checkpoints/*/checkpoint-* -maxdepth 0 -type d |
    sort -t- -k2 -n |
    head -n -2 |
    xargs rm -rf

Performance Results

All 10 fine-tuning jobs completed within 18 hours. The longest job (a legal domain customer with 500K training examples and r=128) took 17.2 hours. The shortest (a customer service tone adaptation with 50K examples and r=32) completed in 4.1 hours.

Training Metrics

Metric	Average Across 10 Jobs	Range
Training steps	3,200	1,500 - 5,000
Training time	11.4 hours	4.1 - 17.2 hours
Peak GPU utilization	94%	89% - 97%
Peak VRAM usage	62 GB	48 - 71 GB
Final training loss	0.82	0.61 - 1.14
Adapter size	480 MB	190 - 820 MB

GPU utilization averaged 94% across all 16 GPUs — indicating that QLoRA fine-tuning on bare metal H100s is almost perfectly compute-bound, with minimal idle time from data loading or checkpoint writes.

Quality Validation

Each fine-tuned adapter was evaluated against the customer's held-out test set (10% of their training data, never seen during training) plus a general capability benchmark (MMLU subset) to verify the adapter didn't degrade base model performance.

Customer Domain	Base Model Score	Fine-Tuned Score	General Capability Δ
Legal contracts	71.2%	93.8%	-0.3%
Medical records	68.5%	91.2%	-0.5%
Customer support	74.1%	89.7%	-0.1%
Financial analysis	69.8%	90.4%	-0.4%
Technical docs	76.3%	92.1%	-0.2%
E-commerce	72.0%	87.3%	+0.1%
HR/recruiting	70.5%	88.9%	-0.2%
Insurance claims	67.9%	90.6%	-0.6%
Real estate	73.4%	89.1%	-0.1%
Logistics	71.8%	87.8%	-0.3%

Average domain-specific improvement: +20.1 percentage points. Average general capability degradation: -0.26% — effectively zero. This confirms that QLoRA with appropriate rank selection preserves the base model's general capabilities while dramatically improving domain performance.

Cost Analysis

Item	Cost
2x bare metal 8x H100 servers (18 hrs × $16.00/hr each)	$576
Network-attached storage (20 TB, 1 month)	$160
Total for 10 fine-tuning jobs	$736
Cost per fine-tuning job	$73.60

Compare this to the alternatives:

Approach	Cost per Job	Time for 10 Jobs	Total Cost
Sequential on 1x 8-GPU server	$115	7.5 days	$1,150
Concurrent on 2x 8-GPU servers	$73.60	18 hours	$736
API-based fine-tuning (typical)	$200-500	2-5 days	$2,000-5,000
Managed fine-tuning platform	$300-800	1-3 days	$3,000-8,000

The concurrent approach is 36% cheaper than sequential (because the servers are utilized for 18 hours instead of sitting idle between jobs) and 73-91% cheaper than managed fine-tuning platforms.

Lessons Learned

Static GPU allocation beats dynamic scheduling for fine-tuning. Dynamic GPU schedulers (like Kubernetes with GPU sharing) add complexity and introduce memory fragmentation risks. For fine-tuning workloads where each job's VRAM requirement is predictable, pinning jobs to specific GPUs is simpler and more reliable.

LoRA rank is the most impactful hyperparameter. Across 10 customer deployments, the single variable that most affected final quality was LoRA rank. Domains with specialized vocabulary and reasoning patterns (legal, medical) needed r=128. General-purpose adaptations (tone, formatting) worked well with r=32. Over-provisioning rank wastes training time; under-provisioning caps quality.

Bare metal eliminates the VRAM margin problem. On virtualized GPU instances, VRAM overhead from the hypervisor consumed 2-5 GB per GPU. At r=128 with a 70B base model, some jobs peaked at 71 GB VRAM usage — leaving only 9 GB of headroom on an 80GB GPU. With virtualization overhead, these jobs would have OOM'd. Bare metal gave them the full 80 GB.

Checkpoint storage is cheap — losing a training run is expensive. They budgeted $160/month for 20 TB of network storage. A single lost training run (due to a job failure without recent checkpoints) would cost $57 in wasted GPU time plus hours of delay. Aggressive checkpointing with cheap network storage is always the right tradeoff.

For teams running regular fine-tuning workloads, bare metal GPU servers provide the most predictable and cost-effective infrastructure. On Spheron AI, you can provision multi-GPU bare metal servers with H100, H200, and B300 GPUs — available as both Spot and Dedicated instances through a single console.