ML Infrastructure Framework

Overview

A comprehensive skill for designing and managing machine learning infrastructure — covering GPU cluster architecture, storage systems, networking, job scheduling, monitoring, and cost management. Enables teams to build and operate reliable, cost-efficient ML platforms that scale from single-GPU experiments to thousand-GPU training runs.

When to Use

Designing ML infrastructure from scratch
Choosing between cloud providers for GPU workloads
Setting up job scheduling and resource management
Building CI/CD for ML model training and deployment
Optimizing infrastructure costs for ML workloads
Planning multi-tenant GPU cluster architecture

Quick Start


# Infrastructure-as-code for ML cluster
# Option 1: Kubernetes + GPU operator
helm install gpu-operator nvidia/gpu-operator
kubectl apply -f ml-platform/

# Option 2: SLURM cluster
ansible-playbook setup-slurm.yml

# Option 3: SkyPilot (multi-cloud)
pip install skypilot
sky launch --gpus A100:8 train.yaml

Architecture Decision Matrix

Requirement	Solution	Pros	Cons
< 10 GPUs	Single cloud instance	Simple, fast	No fault tolerance
10-100 GPUs	Kubernetes + GPU Operator	Flexible, multi-tenant	Complex setup
100+ GPUs	SLURM + bare metal	Best performance	High maintenance
Multi-cloud	SkyPilot	Cost optimization	Added abstraction
Spot/preemptible	Any + checkpointing	70% cost savings	Interruption handling

Cloud Provider Comparison

Provider	GPU Types	Pricing (H100/hr)	Networking	Best For
AWS	A10G, A100, H100, P5	~$32 (p5.48xlarge)	EFA	Enterprise, broad services
GCP	A100, H100, TPU	~$30 (a3-highgpu-8g)	gVNIC	TPU access, GKE
Azure	A100, H100	~$27 (ND-H100-v5)	InfiniBand	Enterprise, Azure ML
Lambda	A100, H100	~$20 (8xH100)	InfiniBand	ML-focused, simple
CoreWeave	A100, H100	~$22 (8xH100)	InfiniBand	GPU-native cloud
RunPod	A100, H100	~$18 (8xH100)	Varies	Budget, serverless

Monitoring Stack


# Prometheus + Grafana for ML infrastructure
# Key metrics to track:
metrics:
  gpu:
    - gpu_utilization_percent    # Target: >80%
    - gpu_memory_used_bytes      # Monitor for OOM risk
    - gpu_temperature_celsius    # Alert >85°C
    - gpu_power_draw_watts       # Cost correlation

  training:
    - training_loss              # Model convergence
    - throughput_samples_per_sec # Training speed
    - checkpoint_save_duration   # I/O bottleneck
    - gradient_norm              # Training stability

  infrastructure:
    - node_cpu_utilization       # Data loading bottleneck
    - network_bandwidth_bytes    # Inter-node communication
    - storage_iops               # Data pipeline throughput
    - job_queue_depth            # Scheduling efficiency

Best Practices

Use Infrastructure-as-Code — Terraform/Pulumi for reproducible GPU clusters
Implement preemption handling — Checkpoint frequently when using spot instances
Separate compute and storage — Use shared filesystems for datasets and checkpoints
Monitor GPU utilization — Low utilization (<50%) means wasted money
Use mixed instance types — Cheap CPUs for data preprocessing, expensive GPUs for training only
Automate cluster teardown — Prevent idle GPU instances from burning budget
Use spot/preemptible instances — 60-70% savings for fault-tolerant workloads
Cache frequently-used data — Local SSD for active datasets, network storage for archival
Implement access controls — Multi-tenant clusters need resource quotas and authentication
Track cost per experiment — Tag resources by project/user for accountability

Troubleshooting

GPU underutilization


# Check data pipeline throughput
nvidia-smi dmon -s u -d 1  # Monitor GPU utilization
# If GPU util < 50%, likely bottlenecked on data loading
# Solution: more CPU workers, faster storage, prefetch

Inter-node communication slow


# Check network
iperf3 -c <other-node> -P 8  # Test bandwidth
# Ensure using InfiniBand, not Ethernet
ibstat  # Check IB status

Storage IOPS bottleneck


# Use local NVMe for hot data
# Parallel filesystem (Lustre, BeeGFS) for shared data
# Prefetch next batch while training

⚠️ Loading Issue

Ultimate Infrastructure Framework

ML Infrastructure Framework

Overview

When to Use

Quick Start

Architecture Decision Matrix

Cloud Provider Comparison

Monitoring Stack

Best Practices

Troubleshooting

GPU underutilization

Inter-node communication slow

Storage IOPS bottleneck

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace