P

Pro Distributed Workspace

Streamline your workflow with this high, level, pytorch, framework. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Distributed Computing Workspace

Overview

A comprehensive skill for setting up and managing distributed computing environments — covering cluster configuration, job scheduling, resource management, and monitoring. Enables efficient utilization of multi-node GPU clusters for training, inference, and data processing workloads with tools like SLURM, Kubernetes, Ray, and cloud-native solutions.

When to Use

  • Setting up GPU clusters for ML training
  • Managing multi-node job scheduling
  • Configuring SLURM or Kubernetes for ML workloads
  • Building development environments for distributed computing
  • Monitoring and debugging distributed jobs
  • Optimizing resource utilization across a cluster

Quick Start

# SLURM cluster — submit a training job sbatch train_job.sh # Kubernetes — deploy training job kubectl apply -f training-job.yaml # Ray cluster — start and connect ray start --head --num-gpus=4 ray job submit -- python train.py

Cluster Setup Patterns

SLURM Job Script

#!/bin/bash #SBATCH --job-name=llm-train #SBATCH --nodes=4 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=8 #SBATCH --cpus-per-task=96 #SBATCH --mem=0 # Use all available memory #SBATCH --time=48:00:00 #SBATCH --partition=gpu #SBATCH --output=logs/%j.out #SBATCH --error=logs/%j.err # Set master address export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1) export MASTER_PORT=29500 export WORLD_SIZE=$((SLURM_NNODES * 8)) # Launch distributed training srun torchrun \ --nnodes=$SLURM_NNODES \ --nproc_per_node=8 \ --rdzv_id=$SLURM_JOB_ID \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ train.py --config config.yaml

Kubernetes Training Job

apiVersion: batch/v1 kind: Job metadata: name: distributed-training spec: parallelism: 4 completions: 4 template: spec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:25.04-py3 command: ["torchrun"] args: - "--nproc_per_node=8" - "--nnodes=4" - "train.py" resources: limits: nvidia.com/gpu: 8 memory: "512Gi" requests: cpu: "96" volumeMounts: - name: shared-data mountPath: /data volumes: - name: shared-data persistentVolumeClaim: claimName: training-data restartPolicy: OnFailure

Ray Cluster Configuration

import ray from ray.train.torch import TorchTrainer from ray.train import ScalingConfig ray.init() trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=8, use_gpu=True, resources_per_worker={"GPU": 1, "CPU": 12}, ), run_config=ray.train.RunConfig( storage_path="/shared/results", checkpoint_config=ray.train.CheckpointConfig( num_to_keep=3, ), ), ) result = trainer.fit()

Resource Management

ResourceToolPurpose
GPUsSLURM, K8s, RayAllocate and schedule GPU access
StorageNFS, Lustre, S3Shared data and checkpoint storage
NetworkInfiniBand, RoCEHigh-speed inter-node communication
MonitoringPrometheus, GrafanaTrack utilization and health
LoggingELK, LokiCentralized log aggregation
ContainersDocker, SingularityReproducible environments

Environment Configuration

VariablePurposeExample
MASTER_ADDRMaster node hostnamenode-0
MASTER_PORTCommunication port29500
WORLD_SIZETotal number of processes32
RANKGlobal process rank0-31
LOCAL_RANKPer-node process rank0-7
NCCL_SOCKET_IFNAMENetwork interface for NCCLeth0
NCCL_IB_DISABLEDisable InfiniBand0 (enabled)
CUDA_VISIBLE_DEVICESAvailable GPUs0,1,2,3

Best Practices

  1. Use containerized environments — Docker/Singularity ensures reproducibility across nodes
  2. Shared filesystem for checkpoints — NFS or Lustre ensures all nodes can save/load
  3. Set NCCL environment variables — Tune for your network topology
  4. Monitor GPU utilization — Low utilization means communication or data loading bottlenecks
  5. Use preemption-aware checkpointing — Save frequently on preemptible instances
  6. Test at small scale first — Run on 1-2 nodes before scaling to full cluster
  7. Log everything centrally — Distributed logs are hard to correlate without aggregation
  8. Version your training configs — Track hyperparameters, data versions, and code versions together
  9. Set memory limits — Prevent OOM kills by setting explicit memory bounds
  10. Automate cluster teardown — Clean up resources automatically to avoid idle costs

Troubleshooting

Nodes can't communicate

# Check NCCL connectivity NCCL_DEBUG=INFO torchrun --nproc_per_node=2 test_connectivity.py # Verify firewall rules sudo iptables -L | grep 29500 # Open port if blocked sudo iptables -A INPUT -p tcp --dport 29500 -j ACCEPT

Jobs stuck in SLURM queue

# Check partition availability sinfo -p gpu # Check job priority sprio -j $JOBID # Check resource availability squeue --partition=gpu --states=RUNNING

GPU memory fragmentation

# Enable memory pool management import torch torch.cuda.set_per_process_memory_fraction(0.95) torch.cuda.memory.set_per_process_memory_fraction(0.95) # Or use PYTORCH_CUDA_ALLOC_CONF os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates