SkyPilot — Multi-Cloud ML Infrastructure

Overview

A comprehensive skill for running ML workloads across clouds using SkyPilot — the framework that automatically finds the cheapest GPU instances across AWS, GCP, Azure, Lambda, RunPod, and more. SkyPilot handles provisioning, setup, execution, and teardown — letting you focus on ML while it optimizes cost and availability.

When to Use

Running ML training across multiple cloud providers
Need automatic spot instance management with failover
Want cost optimization across clouds
Running development clusters that auto-terminate
Need reproducible cloud environments
Managing multi-node distributed training

Quick Start


# Install
pip install skypilot

# Check cloud credentials
sky check

# Launch a GPU instance
sky launch --gpus A100 my_task.yaml

# Or quick interactive session
sky launch --gpus A100 --idle-minutes-to-autostop 30

Task Definition


# train.yaml
name: llm-training

resources:
  cloud: aws  # or gcp, azure, lambda, any
  accelerators: A100:8
  disk_size: 500
  use_spot: true

setup: |
  pip install torch transformers accelerate
  git clone https://github.com/my-org/training-code

run: |
  cd training-code
  accelerate launch --num_processes=8 train.py \
    --model meta-llama/Llama-3-8B \
    --output_dir /data/checkpoints

Multi-Cloud Cost Optimization


# Let SkyPilot find cheapest option across all clouds
sky launch --gpus A100:8 train.yaml
# SkyPilot automatically:
# 1. Checks prices across AWS, GCP, Azure, Lambda, etc.
# 2. Tries spot instances first (60-70% cheaper)
# 3. Falls back to on-demand if spot unavailable
# 4. Selects cheapest region

# Show pricing across clouds
sky show-gpus --all

GPU Pricing (approximate):
┌────────┬────────┬─────────┬────────────┬─────────────┐
│ GPU    │ Cloud  │ Region  │ Spot $/hr  │ OnDemand    │
├────────┼────────┼─────────┼────────────┼─────────────┤
│ A100   │ Lambda │ us-east │    —       │ $1.29       │
│ A100   │ GCP    │ us-east │ $0.98      │ $3.67       │
│ A100   │ AWS    │ us-east │ $1.12      │ $4.10       │
│ H100   │ Lambda │ us-east │    —       │ $2.49       │
│ H100   │ GCP    │ us-west │ $1.85      │ $11.49      │
└────────┴────────┴─────────┴────────────┴─────────────┘

Spot Instance Management


# train_with_spot.yaml
resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption

# SkyPilot automatically:
# - Saves checkpoints before preemption
# - Relaunches on different spot instance/cloud
# - Resumes training from last checkpoint

run: |
  python train.py \
    --resume-from-checkpoint /data/checkpoints/latest \
    --checkpoint-dir /data/checkpoints

Multi-Node Training


# distributed_train.yaml
name: multi-node-training

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs

resources:
  accelerators: H100:8
  use_spot: true

setup: |
  pip install deepspeed transformers

run: |
  deepspeed --num_gpus=8 --num_nodes=4 \
    --hostfile ~/ray_bootstrap_config.yaml \
    train.py --deepspeed ds_config.json

Managed Jobs


# Submit a managed job (survives spot preemptions)
sky jobs launch train.yaml

# Check job status
sky jobs queue

# Stream logs
sky jobs logs JOB_ID

# Cancel job
sky jobs cancel JOB_ID

Configuration Reference

Parameter	Default	Description
`cloud`	`any`	Target cloud (aws, gcp, azure, lambda, any)
`accelerators`	None	GPU type and count (A100:8, H100:4)
`use_spot`	false	Use spot/preemptible instances
`spot_recovery`	FAILOVER	Recovery strategy
`disk_size`	256	Disk size in GB
`disk_tier`	medium	Disk performance tier
`num_nodes`	1	Number of nodes
`idle_minutes_to_autostop`	None	Auto-terminate after idle
`region`	None	Specific cloud region
`zone`	None	Specific availability zone

Best Practices

Use cloud: any — Let SkyPilot find the cheapest option across all providers
Enable spot instances — 60-70% savings with automatic failover
Set auto-stop — --idle-minutes-to-autostop 30 prevents forgotten instances
Use managed jobs — sky jobs launch handles preemption recovery automatically
Mount cloud storage — Use SkyPilot file mounts for datasets and checkpoints
Use sky show-gpus — Check real-time pricing before launching
Add setup steps — Install dependencies in setup: for reproducibility
Use multi-node for large models — SkyPilot handles network setup automatically
Check credentials — Run sky check to verify all cloud providers are configured
Use YAML tasks — Version control your infrastructure alongside code

Troubleshooting

No available instances


# Check availability across clouds
sky show-gpus A100 --all
# Try different GPU type or cloud
sky launch --gpus H100 train.yaml
# Or use on-demand instead of spot
resources:
  use_spot: false

Spot preemption during training


# Enable spot recovery with checkpointing
resources:
  use_spot: true
  spot_recovery: FAILOVER
# Ensure your training script saves checkpoints frequently

Multi-node networking issues


# SkyPilot uses SSH tunneling by default
# For better performance, ensure nodes are in the same region/zone
resources:
  region: us-east-1
  zone: us-east-1a

⚠️ Loading Issue

Ultimate Infrastructure Skypilot