Pro Distributed Workspace
Streamline your workflow with this high, level, pytorch, framework. Includes structured workflows, validation checks, and reusable patterns for ai research.
Distributed Computing Workspace
Overview
A comprehensive skill for setting up and managing distributed computing environments — covering cluster configuration, job scheduling, resource management, and monitoring. Enables efficient utilization of multi-node GPU clusters for training, inference, and data processing workloads with tools like SLURM, Kubernetes, Ray, and cloud-native solutions.
When to Use
- Setting up GPU clusters for ML training
- Managing multi-node job scheduling
- Configuring SLURM or Kubernetes for ML workloads
- Building development environments for distributed computing
- Monitoring and debugging distributed jobs
- Optimizing resource utilization across a cluster
Quick Start
# SLURM cluster — submit a training job sbatch train_job.sh # Kubernetes — deploy training job kubectl apply -f training-job.yaml # Ray cluster — start and connect ray start --head --num-gpus=4 ray job submit -- python train.py
Cluster Setup Patterns
SLURM Job Script
#!/bin/bash #SBATCH --job-name=llm-train #SBATCH --nodes=4 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --cpus-per-task=96 #SBATCH --mem=0 # Use all available memory #SBATCH --time=48:00:00 #SBATCH --partition=gpu #SBATCH --output=logs/%j.out #SBATCH --error=logs/%j.err # Set master address export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1) export MASTER_PORT=29500 export WORLD_SIZE=$((SLURM_NNODES * 8)) # Launch distributed training srun torchrun \ --nnodes=$SLURM_NNODES \ --nproc_per_node=8 \ --rdzv_id=$SLURM_JOB_ID \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ train.py --config config.yaml
Kubernetes Training Job
apiVersion: batch/v1 kind: Job metadata: name: distributed-training spec: parallelism: 4 completions: 4 template: spec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["torchrun"] args: - "--nproc_per_node=8" - "--nnodes=4" - "train.py" resources: limits: nvidia.com/gpu: 8 memory: "512Gi" requests: cpu: "96" volumeMounts: - name: shared-data mountPath: /data volumes: - name: shared-data persistentVolumeClaim: claimName: training-data restartPolicy: OnFailure
Ray Cluster Configuration
import ray from ray.train.torch import TorchTrainer from ray.train import ScalingConfig ray.init() trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=8, use_gpu=True, resources_per_worker={"GPU": 1, "CPU": 12}, ), run_config=ray.train.RunConfig( storage_path="/shared/results", checkpoint_config=ray.train.CheckpointConfig( num_to_keep=3, ), ), ) result = trainer.fit()
Resource Management
| Resource | Tool | Purpose |
|---|---|---|
| GPUs | SLURM, K8s, Ray | Allocate and schedule GPU access |
| Storage | NFS, Lustre, S3 | Shared data and checkpoint storage |
| Network | InfiniBand, RoCE | High-speed inter-node communication |
| Monitoring | Prometheus, Grafana | Track utilization and health |
| Logging | ELK, Loki | Centralized log aggregation |
| Containers | Docker, Singularity | Reproducible environments |
Environment Configuration
| Variable | Purpose | Example |
|---|---|---|
MASTER_ADDR | Master node hostname | node-0 |
MASTER_PORT | Communication port | 29500 |
WORLD_SIZE | Total number of processes | 32 |
RANK | Global process rank | 0-31 |
LOCAL_RANK | Per-node process rank | 0-7 |
NCCL_SOCKET_IFNAME | Network interface for NCCL | eth0 |
NCCL_IB_DISABLE | Disable InfiniBand | 0 (enabled) |
CUDA_VISIBLE_DEVICES | Available GPUs | 0,1,2,3 |
Best Practices
- Use containerized environments — Docker/Singularity ensures reproducibility across nodes
- Shared filesystem for checkpoints — NFS or Lustre ensures all nodes can save/load
- Set NCCL environment variables — Tune for your network topology
- Monitor GPU utilization — Low utilization means communication or data loading bottlenecks
- Use preemption-aware checkpointing — Save frequently on preemptible instances
- Test at small scale first — Run on 1-2 nodes before scaling to full cluster
- Log everything centrally — Distributed logs are hard to correlate without aggregation
- Version your training configs — Track hyperparameters, data versions, and code versions together
- Set memory limits — Prevent OOM kills by setting explicit memory bounds
- Automate cluster teardown — Clean up resources automatically to avoid idle costs
Troubleshooting
Nodes can't communicate
# Check NCCL connectivity NCCL_DEBUG=INFO torchrun --nproc_per_node=2 test_connectivity.py # Verify firewall rules sudo iptables -L | grep 29500 # Open port if blocked sudo iptables -A INPUT -p tcp --dport 29500 -j ACCEPT
Jobs stuck in SLURM queue
# Check partition availability sinfo -p gpu # Check job priority sprio -j $JOBID # Check resource availability squeue --partition=gpu --states=RUNNING
GPU memory fragmentation
# Enable memory pool management import torch torch.cuda.set_per_process_memory_fraction(0.95) torch.cuda.memory.set_per_process_memory_fraction(0.95) # Or use PYTORCH_CUDA_ALLOC_CONF os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.