U

Ultimate Infrastructure Skypilot

Comprehensive skill designed for multi, cloud, orchestration, workloads. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

SkyPilot — Multi-Cloud ML Infrastructure

Overview

A comprehensive skill for running ML workloads across clouds using SkyPilot — the framework that automatically finds the cheapest GPU instances across AWS, GCP, Azure, Lambda, RunPod, and more. SkyPilot handles provisioning, setup, execution, and teardown — letting you focus on ML while it optimizes cost and availability.

When to Use

  • Running ML training across multiple cloud providers
  • Need automatic spot instance management with failover
  • Want cost optimization across clouds
  • Running development clusters that auto-terminate
  • Need reproducible cloud environments
  • Managing multi-node distributed training

Quick Start

# Install pip install skypilot # Check cloud credentials sky check # Launch a GPU instance sky launch --gpus A100 my_task.yaml # Or quick interactive session sky launch --gpus A100 --idle-minutes-to-autostop 30

Task Definition

# train.yaml name: llm-training resources: cloud: aws # or gcp, azure, lambda, any accelerators: A100:8 disk_size: 500 use_spot: true setup: | pip install torch transformers accelerate git clone https://github.com/my-org/training-code run: | cd training-code accelerate launch --num_processes=8 train.py \ --model meta-llama/Llama-3-8B \ --output_dir /data/checkpoints

Multi-Cloud Cost Optimization

# Let SkyPilot find cheapest option across all clouds sky launch --gpus A100:8 train.yaml # SkyPilot automatically: # 1. Checks prices across AWS, GCP, Azure, Lambda, etc. # 2. Tries spot instances first (60-70% cheaper) # 3. Falls back to on-demand if spot unavailable # 4. Selects cheapest region # Show pricing across clouds sky show-gpus --all
GPU Pricing (approximate):
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ GPU    │ Cloud  │ Region  │ Spot $/hr  │ OnDemand    │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│ A100   │ Lambda │ us-east │    —       │ $1.29       │
│ A100   │ GCP    │ us-east │ $0.98      │ $3.67       │
│ A100   │ AWS    │ us-east │ $1.12      │ $4.10       │
│ H100   │ Lambda │ us-east │    —       │ $2.49       │
│ H100   │ GCP    │ us-west │ $1.85      │ $11.49      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Spot Instance Management

# train_with_spot.yaml resources: accelerators: A100:8 use_spot: true spot_recovery: FAILOVER # Auto-recover on preemption # SkyPilot automatically: # - Saves checkpoints before preemption # - Relaunches on different spot instance/cloud # - Resumes training from last checkpoint run: | python train.py \ --resume-from-checkpoint /data/checkpoints/latest \ --checkpoint-dir /data/checkpoints

Multi-Node Training

# distributed_train.yaml name: multi-node-training num_nodes: 4 # 4 nodes Ɨ 8 GPUs = 32 GPUs resources: accelerators: H100:8 use_spot: true setup: | pip install deepspeed transformers run: | deepspeed --num_gpus=8 --num_nodes=4 \ --hostfile ~/ray_bootstrap_config.yaml \ train.py --deepspeed ds_config.json

Managed Jobs

# Submit a managed job (survives spot preemptions) sky jobs launch train.yaml # Check job status sky jobs queue # Stream logs sky jobs logs JOB_ID # Cancel job sky jobs cancel JOB_ID

Configuration Reference

ParameterDefaultDescription
cloudanyTarget cloud (aws, gcp, azure, lambda, any)
acceleratorsNoneGPU type and count (A100:8, H100:4)
use_spotfalseUse spot/preemptible instances
spot_recoveryFAILOVERRecovery strategy
disk_size256Disk size in GB
disk_tiermediumDisk performance tier
num_nodes1Number of nodes
idle_minutes_to_autostopNoneAuto-terminate after idle
regionNoneSpecific cloud region
zoneNoneSpecific availability zone

Best Practices

  1. Use cloud: any — Let SkyPilot find the cheapest option across all providers
  2. Enable spot instances — 60-70% savings with automatic failover
  3. Set auto-stop — --idle-minutes-to-autostop 30 prevents forgotten instances
  4. Use managed jobs — sky jobs launch handles preemption recovery automatically
  5. Mount cloud storage — Use SkyPilot file mounts for datasets and checkpoints
  6. Use sky show-gpus — Check real-time pricing before launching
  7. Add setup steps — Install dependencies in setup: for reproducibility
  8. Use multi-node for large models — SkyPilot handles network setup automatically
  9. Check credentials — Run sky check to verify all cloud providers are configured
  10. Use YAML tasks — Version control your infrastructure alongside code

Troubleshooting

No available instances

# Check availability across clouds sky show-gpus A100 --all # Try different GPU type or cloud sky launch --gpus H100 train.yaml # Or use on-demand instead of spot resources: use_spot: false

Spot preemption during training

# Enable spot recovery with checkpointing resources: use_spot: true spot_recovery: FAILOVER # Ensure your training script saves checkpoints frequently

Multi-node networking issues

# SkyPilot uses SSH tunneling by default # For better performance, ensure nodes are in the same region/zone resources: region: us-east-1 zone: us-east-1a
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates