I

Infrastructure Lambda Elite

Production-ready skill that handles reserved, demand, cloud, instances. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Lambda Labs GPU Cloud Infrastructure

Overview

A comprehensive skill for provisioning and managing GPU compute infrastructure on Lambda Labs — the cloud platform offering on-demand and reserved NVIDIA H100, A100, and A10G instances for AI/ML workloads. Covers instance management, multi-node training setup, persistent storage, cost optimization, and integration with popular ML frameworks.

When to Use

  • Need GPU instances for ML training or inference
  • Running distributed training across multiple GPU nodes
  • Want simpler pricing than AWS/GCP for GPU workloads
  • Need persistent storage for datasets and checkpoints
  • Setting up development environments for ML
  • Cost-optimizing GPU compute for training runs

Quick Start

# Install Lambda CLI pip install lambda-cloud # Configure credentials lambda config set api-key YOUR_API_KEY # List available instances lambda instances list-types # Launch an instance lambda instances create \ --instance-type gpu_8x_h100_sxm \ --region us-east-1 \ --ssh-key my-key \ --name "training-node" # SSH into instance ssh ubuntu@<instance-ip>

Instance Types

InstanceGPUsGPU MemoryvCPUsRAMStoragePrice/hr
1x A10124 GB30200 GB512 GB SSD~$0.75
1x A100140/80 GB30200 GB512 GB SSD~$1.29
8x A1008640 GB1241.8 TB6.1 TB SSD~$10.32
1x H100180 GB26200 GB512 GB SSD~$2.49
8x H1008640 GB2081.8 TB6.1 TB SSD~$19.92

Multi-Node Training Setup

# Launch 4 nodes for i in $(seq 0 3); do lambda instances create \ --instance-type gpu_8x_h100_sxm \ --region us-east-1 \ --ssh-key my-key \ --name "train-node-$i" \ --filesystem my-dataset done # On each node — set environment export MASTER_ADDR=<node-0-ip> export MASTER_PORT=29500 export WORLD_SIZE=32 # 4 nodes × 8 GPUs export NODE_RANK=$i # 0, 1, 2, 3 # Launch training torchrun --nproc_per_node=8 \ --nnodes=4 \ --node_rank=$NODE_RANK \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ train.py

Persistent Storage

# Create a persistent filesystem lambda filesystems create \ --name my-dataset \ --region us-east-1 # Attach to instance at launch lambda instances create \ --instance-type gpu_8x_h100_sxm \ --filesystem my-dataset # Filesystem is mounted at /home/ubuntu/my-dataset # Data persists across instance restarts

Best Practices

  1. Use persistent filesystems — Don't lose checkpoints when instances terminate
  2. Use reserved instances for long runs — 1-year reservations save 30-50%
  3. Monitor spot availability — On-demand instances may not always be available
  4. Use multi-node for large models — 8x H100 nodes with InfiniBand
  5. Set up auto-checkpointing — Save every 30 minutes in case of preemption
  6. Use Lambda's pre-built images — PyTorch, CUDA, and drivers pre-installed
  7. Attach filesystems at launch — Faster than downloading datasets each time
  8. Use SSH keys, not passwords — More secure and scriptable
  9. Track costs with tags — Tag instances by project for cost allocation
  10. Terminate idle instances — No auto-shutdown; manual cleanup required

Troubleshooting

Instance launch fails

# Check availability lambda instances list-types --region us-east-1 # Try different region lambda instances list-types --region us-west-2

NCCL errors on multi-node

# Lambda instances use InfiniBand — set correct interface export NCCL_SOCKET_IFNAME=eno1 export NCCL_IB_DISABLE=0

Filesystem not mounting

# Verify filesystem and instance are in same region lambda filesystems list # Check mount point ls /home/ubuntu/
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates