Lambda Labs GPU Cloud Infrastructure

Overview

A comprehensive skill for provisioning and managing GPU compute infrastructure on Lambda Labs — the cloud platform offering on-demand and reserved NVIDIA H100, A100, and A10G instances for AI/ML workloads. Covers instance management, multi-node training setup, persistent storage, cost optimization, and integration with popular ML frameworks.

When to Use

Need GPU instances for ML training or inference
Running distributed training across multiple GPU nodes
Want simpler pricing than AWS/GCP for GPU workloads
Need persistent storage for datasets and checkpoints
Setting up development environments for ML
Cost-optimizing GPU compute for training runs

Quick Start


# Install Lambda CLI
pip install lambda-cloud

# Configure credentials
lambda config set api-key YOUR_API_KEY

# List available instances
lambda instances list-types

# Launch an instance
lambda instances create \
  --instance-type gpu_8x_h100_sxm \
  --region us-east-1 \
  --ssh-key my-key \
  --name "training-node"

# SSH into instance
ssh ubuntu@<instance-ip>

Instance Types

Instance	GPUs	GPU Memory	vCPUs	RAM	Storage	Price/hr
1x A10	1	24 GB	30	200 GB	512 GB SSD	~$0.75
1x A100	1	40/80 GB	30	200 GB	512 GB SSD	~$1.29
8x A100	8	640 GB	124	1.8 TB	6.1 TB SSD	~$10.32
1x H100	1	80 GB	26	200 GB	512 GB SSD	~$2.49
8x H100	8	640 GB	208	1.8 TB	6.1 TB SSD	~$19.92

Multi-Node Training Setup


# Launch 4 nodes
for i in $(seq 0 3); do
  lambda instances create \
    --instance-type gpu_8x_h100_sxm \
    --region us-east-1 \
    --ssh-key my-key \
    --name "train-node-$i" \
    --filesystem my-dataset
done

# On each node — set environment
export MASTER_ADDR=<node-0-ip>
export MASTER_PORT=29500
export WORLD_SIZE=32  # 4 nodes × 8 GPUs
export NODE_RANK=$i   # 0, 1, 2, 3

# Launch training
torchrun --nproc_per_node=8 \
  --nnodes=4 \
  --node_rank=$NODE_RANK \
  --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  train.py

Persistent Storage


# Create a persistent filesystem
lambda filesystems create \
  --name my-dataset \
  --region us-east-1

# Attach to instance at launch
lambda instances create \
  --instance-type gpu_8x_h100_sxm \
  --filesystem my-dataset

# Filesystem is mounted at /home/ubuntu/my-dataset
# Data persists across instance restarts

Best Practices

Use persistent filesystems — Don't lose checkpoints when instances terminate
Use reserved instances for long runs — 1-year reservations save 30-50%
Monitor spot availability — On-demand instances may not always be available
Use multi-node for large models — 8x H100 nodes with InfiniBand
Set up auto-checkpointing — Save every 30 minutes in case of preemption
Use Lambda's pre-built images — PyTorch, CUDA, and drivers pre-installed
Attach filesystems at launch — Faster than downloading datasets each time
Use SSH keys, not passwords — More secure and scriptable
Track costs with tags — Tag instances by project for cost allocation
Terminate idle instances — No auto-shutdown; manual cleanup required

Troubleshooting

Instance launch fails


# Check availability
lambda instances list-types --region us-east-1
# Try different region
lambda instances list-types --region us-west-2

NCCL errors on multi-node


# Lambda instances use InfiniBand — set correct interface
export NCCL_SOCKET_IFNAME=eno1
export NCCL_IB_DISABLE=0

Filesystem not mounting


# Verify filesystem and instance are in same region
lambda filesystems list
# Check mount point
ls /home/ubuntu/

⚠️ Loading Issue

Infrastructure Lambda Elite

Lambda Labs GPU Cloud Infrastructure

Overview

When to Use

Quick Start

Instance Types

Multi-Node Training Setup

Persistent Storage

Best Practices

Troubleshooting

Instance launch fails

NCCL errors on multi-node

Filesystem not mounting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace