P

Pytorch Lightning System

All-in-one skill covering deep, learning, framework, pytorch. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

PyTorch Lightning System

Organize and scale deep learning training with PyTorch Lightning, a framework that eliminates boilerplate while maintaining full PyTorch flexibility. This skill covers LightningModule design, training configuration, distributed training, logging, and production deployment workflows.

When to Use This Skill

Choose PyTorch Lightning System when you need to:

  • Structure PyTorch training code with clean separation of model, data, and training logic
  • Scale training from single GPU to multi-GPU or multi-node with minimal code changes
  • Add experiment tracking, checkpointing, and early stopping without manual implementation
  • Train models with mixed precision, gradient accumulation, or custom training loops

Consider alternatives when:

  • You need a completely custom training loop with no framework constraints (use raw PyTorch)
  • You need fast prototyping with pre-built architectures (use Hugging Face Transformers)
  • You need non-neural ML models (use scikit-learn or XGBoost)

Quick Start

pip install lightning torch torchvision
import lightning as L import torch import torch.nn.functional as F from torch.utils.data import DataLoader, random_split from torchvision import datasets, transforms class ImageClassifier(L.LightningModule): def __init__(self, num_classes=10, lr=1e-3): super().__init__() self.save_hyperparameters() self.model = torch.nn.Sequential( torch.nn.Flatten(), torch.nn.Linear(28 * 28, 256), torch.nn.ReLU(), torch.nn.Dropout(0.2), torch.nn.Linear(256, num_classes) ) def forward(self, x): return self.model(x) def training_step(self, batch, batch_idx): x, y = batch logits = self(x) loss = F.cross_entropy(logits, y) acc = (logits.argmax(dim=1) == y).float().mean() self.log("train_loss", loss, prog_bar=True) self.log("train_acc", acc, prog_bar=True) return loss def validation_step(self, batch, batch_idx): x, y = batch logits = self(x) loss = F.cross_entropy(logits, y) acc = (logits.argmax(dim=1) == y).float().mean() self.log("val_loss", loss, prog_bar=True) self.log("val_acc", acc, prog_bar=True) def configure_optimizers(self): return torch.optim.Adam(self.parameters(), lr=self.hparams.lr) # Data transform = transforms.Compose([transforms.ToTensor()]) dataset = datasets.MNIST("./data", train=True, download=True, transform=transform) train_ds, val_ds = random_split(dataset, [55000, 5000]) # Train model = ImageClassifier() trainer = L.Trainer(max_epochs=10, accelerator="auto") trainer.fit(model, DataLoader(train_ds, batch_size=64), DataLoader(val_ds, batch_size=64))

Core Concepts

LightningModule Methods

MethodPurposeWhen Called
__init__Define model architectureInstantiation
forwardInference/predictionManual call or predict
training_stepTraining loss computationEach training batch
validation_stepValidation metric computationEach validation batch
test_stepTest evaluationEach test batch
configure_optimizersOptimizer and scheduler setupTraining start
on_train_epoch_endEnd-of-epoch logicAfter each epoch

Trainer Configuration

import lightning as L from lightning.pytorch.callbacks import ( ModelCheckpoint, EarlyStopping, LearningRateMonitor ) from lightning.pytorch.loggers import TensorBoardLogger trainer = L.Trainer( max_epochs=50, accelerator="gpu", devices=2, # 2 GPUs strategy="ddp", # Distributed data parallel precision="16-mixed", # Mixed precision training gradient_clip_val=1.0, accumulate_grad_batches=4, # Effective batch = 4 × actual batch callbacks=[ ModelCheckpoint( monitor="val_loss", mode="min", save_top_k=3 ), EarlyStopping( monitor="val_loss", patience=5, mode="min" ), LearningRateMonitor(logging_interval="step") ], logger=TensorBoardLogger("logs/", name="experiment"), log_every_n_steps=10, check_val_every_n_epoch=1 )

Configuration

ParameterDescriptionDefault
max_epochsMaximum training epochs1000
acceleratorHardware (cpu, gpu, tpu, auto)"auto"
devicesNumber of devices"auto"
strategyDistributed strategy (ddp, fsdp, deepspeed)"auto"
precisionTraining precision (32, 16-mixed, bf16-mixed)"32-true"
gradient_clip_valGradient norm clipping valueNone
accumulate_grad_batchesGradient accumulation steps1
log_every_n_stepsLogging frequency50

Best Practices

  1. Use save_hyperparameters() in __init__ — Call self.save_hyperparameters() to automatically save all constructor arguments. This enables checkpoint loading with Model.load_from_checkpoint(path) without manually specifying hyperparameters, and logs them to your experiment tracker.

  2. Let Lightning handle device placement — Never call .to(device) or .cuda() manually. Lightning handles device placement automatically based on the accelerator setting. Manual device management conflicts with Lightning's device handling and breaks multi-GPU training.

  3. Use self.log() for all metrics — All metrics logged with self.log() appear in your logger (TensorBoard, W&B), are available for callbacks (early stopping, checkpointing), and work correctly across distributed training. Don't use print() or manual logging.

  4. Start with single GPU, then scale — Develop and debug on one GPU with Trainer(accelerator="gpu", devices=1). Once training works, add devices=2, strategy="ddp" for multi-GPU. Lightning handles data distribution and gradient synchronization automatically.

  5. Use LightningDataModule for data organization — Wrap your data loading in a LightningDataModule class with prepare_data(), setup(), train_dataloader(), and val_dataloader() methods. This keeps data logic separate from model logic and enables data reuse.

Common Issues

Training runs but validation metrics never improve — Check that validation_step is actually being called by setting check_val_every_n_epoch=1. Also verify that the validation DataLoader shuffling is disabled (shuffle=False) — shuffled validation causes inconsistent metrics between epochs.

Multi-GPU training hangs at start — DDP strategy requires all processes to synchronize. If one GPU process crashes, others hang indefinitely. Check that all GPUs have sufficient memory, NCCL is installed correctly, and no GPU is occupied by another process. Start with devices=1 to verify single-GPU works first.

Checkpoint loading fails with missing keys — The checkpoint was saved with a different model architecture. Use strict=False in load_from_checkpoint() to load partial weights, or ensure the model class matches exactly. Check the checkpoint's hyper_parameters to verify the expected architecture.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates