U

Ultimate Infrastructure Framework

Enterprise-grade skill for serverless, cloud, platform, running. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

ML Infrastructure Framework

Overview

A comprehensive skill for designing and managing machine learning infrastructure — covering GPU cluster architecture, storage systems, networking, job scheduling, monitoring, and cost management. Enables teams to build and operate reliable, cost-efficient ML platforms that scale from single-GPU experiments to thousand-GPU training runs.

When to Use

  • Designing ML infrastructure from scratch
  • Choosing between cloud providers for GPU workloads
  • Setting up job scheduling and resource management
  • Building CI/CD for ML model training and deployment
  • Optimizing infrastructure costs for ML workloads
  • Planning multi-tenant GPU cluster architecture

Quick Start

# Infrastructure-as-code for ML cluster # Option 1: Kubernetes + GPU operator helm install gpu-operator nvidia/gpu-operator kubectl apply -f ml-platform/ # Option 2: SLURM cluster ansible-playbook setup-slurm.yml # Option 3: SkyPilot (multi-cloud) pip install skypilot sky launch --gpus A100:8 train.yaml

Architecture Decision Matrix

RequirementSolutionProsCons
< 10 GPUsSingle cloud instanceSimple, fastNo fault tolerance
10-100 GPUsKubernetes + GPU OperatorFlexible, multi-tenantComplex setup
100+ GPUsSLURM + bare metalBest performanceHigh maintenance
Multi-cloudSkyPilotCost optimizationAdded abstraction
Spot/preemptibleAny + checkpointing70% cost savingsInterruption handling

Cloud Provider Comparison

ProviderGPU TypesPricing (H100/hr)NetworkingBest For
AWSA10G, A100, H100, P5~$32 (p5.48xlarge)EFAEnterprise, broad services
GCPA100, H100, TPU~$30 (a3-highgpu-8g)gVNICTPU access, GKE
AzureA100, H100~$27 (ND-H100-v5)InfiniBandEnterprise, Azure ML
LambdaA100, H100~$20 (8xH100)InfiniBandML-focused, simple
CoreWeaveA100, H100~$22 (8xH100)InfiniBandGPU-native cloud
RunPodA100, H100~$18 (8xH100)VariesBudget, serverless

Monitoring Stack

# Prometheus + Grafana for ML infrastructure # Key metrics to track: metrics: gpu: - gpu_utilization_percent # Target: >80% - gpu_memory_used_bytes # Monitor for OOM risk - gpu_temperature_celsius # Alert >85°C - gpu_power_draw_watts # Cost correlation training: - training_loss # Model convergence - throughput_samples_per_sec # Training speed - checkpoint_save_duration # I/O bottleneck - gradient_norm # Training stability infrastructure: - node_cpu_utilization # Data loading bottleneck - network_bandwidth_bytes # Inter-node communication - storage_iops # Data pipeline throughput - job_queue_depth # Scheduling efficiency

Best Practices

  1. Use Infrastructure-as-Code — Terraform/Pulumi for reproducible GPU clusters
  2. Implement preemption handling — Checkpoint frequently when using spot instances
  3. Separate compute and storage — Use shared filesystems for datasets and checkpoints
  4. Monitor GPU utilization — Low utilization (<50%) means wasted money
  5. Use mixed instance types — Cheap CPUs for data preprocessing, expensive GPUs for training only
  6. Automate cluster teardown — Prevent idle GPU instances from burning budget
  7. Use spot/preemptible instances — 60-70% savings for fault-tolerant workloads
  8. Cache frequently-used data — Local SSD for active datasets, network storage for archival
  9. Implement access controls — Multi-tenant clusters need resource quotas and authentication
  10. Track cost per experiment — Tag resources by project/user for accountability

Troubleshooting

GPU underutilization

# Check data pipeline throughput nvidia-smi dmon -s u -d 1 # Monitor GPU utilization # If GPU util < 50%, likely bottlenecked on data loading # Solution: more CPU workers, faster storage, prefetch

Inter-node communication slow

# Check network iperf3 -c <other-node> -P 8 # Test bandwidth # Ensure using InfiniBand, not Ethernet ibstat # Check IB status

Storage IOPS bottleneck

# Use local NVMe for hot data # Parallel filesystem (Lustre, BeeGFS) for shared data # Prefetch next batch while training
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates