Ultimate Infrastructure Skypilot
Comprehensive skill designed for multi, cloud, orchestration, workloads. Includes structured workflows, validation checks, and reusable patterns for ai research.
SkyPilot ā Multi-Cloud ML Infrastructure
Overview
A comprehensive skill for running ML workloads across clouds using SkyPilot ā the framework that automatically finds the cheapest GPU instances across AWS, GCP, Azure, Lambda, RunPod, and more. SkyPilot handles provisioning, setup, execution, and teardown ā letting you focus on ML while it optimizes cost and availability.
When to Use
- Running ML training across multiple cloud providers
- Need automatic spot instance management with failover
- Want cost optimization across clouds
- Running development clusters that auto-terminate
- Need reproducible cloud environments
- Managing multi-node distributed training
Quick Start
# Install pip install skypilot # Check cloud credentials sky check # Launch a GPU instance sky launch --gpus A100 my_task.yaml # Or quick interactive session sky launch --gpus A100 --idle-minutes-to-autostop 30
Task Definition
# train.yaml name: llm-training resources: cloud: aws # or gcp, azure, lambda, any accelerators: A100:8 disk_size: 500 use_spot: true setup: | pip install torch transformers accelerate git clone https://github.com/my-org/training-code run: | cd training-code accelerate launch --num_processes=8 train.py \ --model meta-llama/Llama-3-8B \ --output_dir /data/checkpoints
Multi-Cloud Cost Optimization
# Let SkyPilot find cheapest option across all clouds sky launch --gpus A100:8 train.yaml # SkyPilot automatically: # 1. Checks prices across AWS, GCP, Azure, Lambda, etc. # 2. Tries spot instances first (60-70% cheaper) # 3. Falls back to on-demand if spot unavailable # 4. Selects cheapest region # Show pricing across clouds sky show-gpus --all
GPU Pricing (approximate):
āāāāāāāāāā¬āāāāāāāāā¬āāāāāāāāāā¬āāāāāāāāāāāāā¬āāāāāāāāāāāāāā
ā GPU ā Cloud ā Region ā Spot $/hr ā OnDemand ā
āāāāāāāāāā¼āāāāāāāāā¼āāāāāāāāāā¼āāāāāāāāāāāāā¼āāāāāāāāāāāāāā¤
ā A100 ā Lambda ā us-east ā ā ā $1.29 ā
ā A100 ā GCP ā us-east ā $0.98 ā $3.67 ā
ā A100 ā AWS ā us-east ā $1.12 ā $4.10 ā
ā H100 ā Lambda ā us-east ā ā ā $2.49 ā
ā H100 ā GCP ā us-west ā $1.85 ā $11.49 ā
āāāāāāāāāā“āāāāāāāāā“āāāāāāāāāā“āāāāāāāāāāāāā“āāāāāāāāāāāāāā
Spot Instance Management
# train_with_spot.yaml resources: accelerators: A100:8 use_spot: true spot_recovery: FAILOVER # Auto-recover on preemption # SkyPilot automatically: # - Saves checkpoints before preemption # - Relaunches on different spot instance/cloud # - Resumes training from last checkpoint run: | python train.py \ --resume-from-checkpoint /data/checkpoints/latest \ --checkpoint-dir /data/checkpoints
Multi-Node Training
# distributed_train.yaml name: multi-node-training num_nodes: 4 # 4 nodes Ć 8 GPUs = 32 GPUs resources: accelerators: H100:8 use_spot: true setup: | pip install deepspeed transformers run: | deepspeed --num_gpus=8 --num_nodes=4 \ --hostfile ~/ray_bootstrap_config.yaml \ train.py --deepspeed ds_config.json
Managed Jobs
# Submit a managed job (survives spot preemptions) sky jobs launch train.yaml # Check job status sky jobs queue # Stream logs sky jobs logs JOB_ID # Cancel job sky jobs cancel JOB_ID
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
cloud | any | Target cloud (aws, gcp, azure, lambda, any) |
accelerators | None | GPU type and count (A100:8, H100:4) |
use_spot | false | Use spot/preemptible instances |
spot_recovery | FAILOVER | Recovery strategy |
disk_size | 256 | Disk size in GB |
disk_tier | medium | Disk performance tier |
num_nodes | 1 | Number of nodes |
idle_minutes_to_autostop | None | Auto-terminate after idle |
region | None | Specific cloud region |
zone | None | Specific availability zone |
Best Practices
- Use
cloud: anyā Let SkyPilot find the cheapest option across all providers - Enable spot instances ā 60-70% savings with automatic failover
- Set auto-stop ā
--idle-minutes-to-autostop 30prevents forgotten instances - Use managed jobs ā
sky jobs launchhandles preemption recovery automatically - Mount cloud storage ā Use SkyPilot file mounts for datasets and checkpoints
- Use
sky show-gpusā Check real-time pricing before launching - Add setup steps ā Install dependencies in
setup:for reproducibility - Use multi-node for large models ā SkyPilot handles network setup automatically
- Check credentials ā Run
sky checkto verify all cloud providers are configured - Use YAML tasks ā Version control your infrastructure alongside code
Troubleshooting
No available instances
# Check availability across clouds sky show-gpus A100 --all # Try different GPU type or cloud sky launch --gpus H100 train.yaml # Or use on-demand instead of spot resources: use_spot: false
Spot preemption during training
# Enable spot recovery with checkpointing resources: use_spot: true spot_recovery: FAILOVER # Ensure your training script saves checkpoints frequently
Multi-node networking issues
# SkyPilot uses SSH tunneling by default # For better performance, ensure nodes are in the same region/zone resources: region: us-east-1 zone: us-east-1a
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.