Distributed Computing Workspace

Overview

A comprehensive skill for setting up and managing distributed computing environments — covering cluster configuration, job scheduling, resource management, and monitoring. Enables efficient utilization of multi-node GPU clusters for training, inference, and data processing workloads with tools like SLURM, Kubernetes, Ray, and cloud-native solutions.

When to Use

Setting up GPU clusters for ML training
Managing multi-node job scheduling
Configuring SLURM or Kubernetes for ML workloads
Building development environments for distributed computing
Monitoring and debugging distributed jobs
Optimizing resource utilization across a cluster

Quick Start


# SLURM cluster — submit a training job
sbatch train_job.sh

# Kubernetes — deploy training job
kubectl apply -f training-job.yaml

# Ray cluster — start and connect
ray start --head --num-gpus=4
ray job submit -- python train.py

Cluster Setup Patterns

SLURM Job Script


#!/bin/bash
#SBATCH --job-name=llm-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=96
#SBATCH --mem=0                    # Use all available memory
#SBATCH --time=48:00:00
#SBATCH --partition=gpu
#SBATCH --output=logs/%j.out
#SBATCH --error=logs/%j.err

# Set master address
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=29500
export WORLD_SIZE=$((SLURM_NNODES * 8))

# Launch distributed training
srun torchrun \
  --nnodes=$SLURM_NNODES \
  --nproc_per_node=8 \
  --rdzv_id=$SLURM_JOB_ID \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  train.py --config config.yaml

Kubernetes Training Job


apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  parallelism: 4
  completions: 4
  template:
    spec:
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:25.04-py3
        command: ["torchrun"]
        args:
          - "--nproc_per_node=8"
          - "--nnodes=4"
          - "train.py"
        resources:
          limits:
            nvidia.com/gpu: 8
            memory: "512Gi"
          requests:
            cpu: "96"
        volumeMounts:
          - name: shared-data
            mountPath: /data
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data
      restartPolicy: OnFailure

Ray Cluster Configuration


import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

ray.init()

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(
        num_workers=8,
        use_gpu=True,
        resources_per_worker={"GPU": 1, "CPU": 12},
    ),
    run_config=ray.train.RunConfig(
        storage_path="/shared/results",
        checkpoint_config=ray.train.CheckpointConfig(
            num_to_keep=3,
        ),
    ),
)

result = trainer.fit()

Resource Management

Resource	Tool	Purpose
GPUs	SLURM, K8s, Ray	Allocate and schedule GPU access
Storage	NFS, Lustre, S3	Shared data and checkpoint storage
Network	InfiniBand, RoCE	High-speed inter-node communication
Monitoring	Prometheus, Grafana	Track utilization and health
Logging	ELK, Loki	Centralized log aggregation
Containers	Docker, Singularity	Reproducible environments

Environment Configuration

Variable	Purpose	Example
`MASTER_ADDR`	Master node hostname	`node-0`
`MASTER_PORT`	Communication port	`29500`
`WORLD_SIZE`	Total number of processes	`32`
`RANK`	Global process rank	`0-31`
`LOCAL_RANK`	Per-node process rank	`0-7`
`NCCL_SOCKET_IFNAME`	Network interface for NCCL	`eth0`
`NCCL_IB_DISABLE`	Disable InfiniBand	`0` (enabled)
`CUDA_VISIBLE_DEVICES`	Available GPUs	`0,1,2,3`

Best Practices

Use containerized environments — Docker/Singularity ensures reproducibility across nodes
Shared filesystem for checkpoints — NFS or Lustre ensures all nodes can save/load
Set NCCL environment variables — Tune for your network topology
Monitor GPU utilization — Low utilization means communication or data loading bottlenecks
Use preemption-aware checkpointing — Save frequently on preemptible instances
Test at small scale first — Run on 1-2 nodes before scaling to full cluster
Log everything centrally — Distributed logs are hard to correlate without aggregation
Version your training configs — Track hyperparameters, data versions, and code versions together
Set memory limits — Prevent OOM kills by setting explicit memory bounds
Automate cluster teardown — Clean up resources automatically to avoid idle costs

Troubleshooting

Nodes can't communicate


# Check NCCL connectivity
NCCL_DEBUG=INFO torchrun --nproc_per_node=2 test_connectivity.py

# Verify firewall rules
sudo iptables -L | grep 29500
# Open port if blocked
sudo iptables -A INPUT -p tcp --dport 29500 -j ACCEPT

Jobs stuck in SLURM queue


# Check partition availability
sinfo -p gpu
# Check job priority
sprio -j $JOBID
# Check resource availability
squeue --partition=gpu --states=RUNNING

GPU memory fragmentation


# Enable memory pool management
import torch
torch.cuda.set_per_process_memory_fraction(0.95)
torch.cuda.memory.set_per_process_memory_fraction(0.95)
# Or use PYTORCH_CUDA_ALLOC_CONF
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

⚠️ Loading Issue

Pro Distributed Workspace