Ultimate Infrastructure Framework
Enterprise-grade skill for serverless, cloud, platform, running. Includes structured workflows, validation checks, and reusable patterns for ai research.
ML Infrastructure Framework
Overview
A comprehensive skill for designing and managing machine learning infrastructure — covering GPU cluster architecture, storage systems, networking, job scheduling, monitoring, and cost management. Enables teams to build and operate reliable, cost-efficient ML platforms that scale from single-GPU experiments to thousand-GPU training runs.
When to Use
- Designing ML infrastructure from scratch
- Choosing between cloud providers for GPU workloads
- Setting up job scheduling and resource management
- Building CI/CD for ML model training and deployment
- Optimizing infrastructure costs for ML workloads
- Planning multi-tenant GPU cluster architecture
Quick Start
# Infrastructure-as-code for ML cluster # Option 1: Kubernetes + GPU operator helm install gpu-operator nvidia/gpu-operator kubectl apply -f ml-platform/ # Option 2: SLURM cluster ansible-playbook setup-slurm.yml # Option 3: SkyPilot (multi-cloud) pip install skypilot sky launch --gpus A100:8 train.yaml
Architecture Decision Matrix
| Requirement | Solution | Pros | Cons |
|---|---|---|---|
| < 10 GPUs | Single cloud instance | Simple, fast | No fault tolerance |
| 10-100 GPUs | Kubernetes + GPU Operator | Flexible, multi-tenant | Complex setup |
| 100+ GPUs | SLURM + bare metal | Best performance | High maintenance |
| Multi-cloud | SkyPilot | Cost optimization | Added abstraction |
| Spot/preemptible | Any + checkpointing | 70% cost savings | Interruption handling |
Cloud Provider Comparison
| Provider | GPU Types | Pricing (H100/hr) | Networking | Best For |
|---|---|---|---|---|
| AWS | A10G, A100, H100, P5 | ~$32 (p5.48xlarge) | EFA | Enterprise, broad services |
| GCP | A100, H100, TPU | ~$30 (a3-highgpu-8g) | gVNIC | TPU access, GKE |
| Azure | A100, H100 | ~$27 (ND-H100-v5) | InfiniBand | Enterprise, Azure ML |
| Lambda | A100, H100 | ~$20 (8xH100) | InfiniBand | ML-focused, simple |
| CoreWeave | A100, H100 | ~$22 (8xH100) | InfiniBand | GPU-native cloud |
| RunPod | A100, H100 | ~$18 (8xH100) | Varies | Budget, serverless |
Monitoring Stack
# Prometheus + Grafana for ML infrastructure # Key metrics to track: metrics: gpu: - gpu_utilization_percent # Target: >80% - gpu_memory_used_bytes # Monitor for OOM risk - gpu_temperature_celsius # Alert >85°C - gpu_power_draw_watts # Cost correlation training: - training_loss # Model convergence - throughput_samples_per_sec # Training speed - checkpoint_save_duration # I/O bottleneck - gradient_norm # Training stability infrastructure: - node_cpu_utilization # Data loading bottleneck - network_bandwidth_bytes # Inter-node communication - storage_iops # Data pipeline throughput - job_queue_depth # Scheduling efficiency
Best Practices
- Use Infrastructure-as-Code — Terraform/Pulumi for reproducible GPU clusters
- Implement preemption handling — Checkpoint frequently when using spot instances
- Separate compute and storage — Use shared filesystems for datasets and checkpoints
- Monitor GPU utilization — Low utilization (<50%) means wasted money
- Use mixed instance types — Cheap CPUs for data preprocessing, expensive GPUs for training only
- Automate cluster teardown — Prevent idle GPU instances from burning budget
- Use spot/preemptible instances — 60-70% savings for fault-tolerant workloads
- Cache frequently-used data — Local SSD for active datasets, network storage for archival
- Implement access controls — Multi-tenant clusters need resource quotas and authentication
- Track cost per experiment — Tag resources by project/user for accountability
Troubleshooting
GPU underutilization
# Check data pipeline throughput nvidia-smi dmon -s u -d 1 # Monitor GPU utilization # If GPU util < 50%, likely bottlenecked on data loading # Solution: more CPU workers, faster storage, prefetch
Inter-node communication slow
# Check network iperf3 -c <other-node> -P 8 # Test bandwidth # Ensure using InfiniBand, not Ethernet ibstat # Check IB status
Storage IOPS bottleneck
# Use local NVMe for hot data # Parallel filesystem (Lustre, BeeGFS) for shared data # Prefetch next batch while training
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.