Pytorch Lightning System
All-in-one skill covering deep, learning, framework, pytorch. Includes structured workflows, validation checks, and reusable patterns for scientific.
PyTorch Lightning System
Organize and scale deep learning training with PyTorch Lightning, a framework that eliminates boilerplate while maintaining full PyTorch flexibility. This skill covers LightningModule design, training configuration, distributed training, logging, and production deployment workflows.
When to Use This Skill
Choose PyTorch Lightning System when you need to:
- Structure PyTorch training code with clean separation of model, data, and training logic
- Scale training from single GPU to multi-GPU or multi-node with minimal code changes
- Add experiment tracking, checkpointing, and early stopping without manual implementation
- Train models with mixed precision, gradient accumulation, or custom training loops
Consider alternatives when:
- You need a completely custom training loop with no framework constraints (use raw PyTorch)
- You need fast prototyping with pre-built architectures (use Hugging Face Transformers)
- You need non-neural ML models (use scikit-learn or XGBoost)
Quick Start
pip install lightning torch torchvision
import lightning as L import torch import torch.nn.functional as F from torch.utils.data import DataLoader, random_split from torchvision import datasets, transforms class ImageClassifier(L.LightningModule): def __init__(self, num_classes=10, lr=1e-3): super().__init__() self.save_hyperparameters() self.model = torch.nn.Sequential( torch.nn.Flatten(), torch.nn.Linear(28 * 28, 256), torch.nn.ReLU(), torch.nn.Dropout(0.2), torch.nn.Linear(256, num_classes) ) def forward(self, x): return self.model(x) def training_step(self, batch, batch_idx): x, y = batch logits = self(x) loss = F.cross_entropy(logits, y) acc = (logits.argmax(dim=1) == y).float().mean() self.log("train_loss", loss, prog_bar=True) self.log("train_acc", acc, prog_bar=True) return loss def validation_step(self, batch, batch_idx): x, y = batch logits = self(x) loss = F.cross_entropy(logits, y) acc = (logits.argmax(dim=1) == y).float().mean() self.log("val_loss", loss, prog_bar=True) self.log("val_acc", acc, prog_bar=True) def configure_optimizers(self): return torch.optim.Adam(self.parameters(), lr=self.hparams.lr) # Data transform = transforms.Compose([transforms.ToTensor()]) dataset = datasets.MNIST("./data", train=True, download=True, transform=transform) train_ds, val_ds = random_split(dataset, [55000, 5000]) # Train model = ImageClassifier() trainer = L.Trainer(max_epochs=10, accelerator="auto") trainer.fit(model, DataLoader(train_ds, batch_size=64), DataLoader(val_ds, batch_size=64))
Core Concepts
LightningModule Methods
| Method | Purpose | When Called |
|---|---|---|
__init__ | Define model architecture | Instantiation |
forward | Inference/prediction | Manual call or predict |
training_step | Training loss computation | Each training batch |
validation_step | Validation metric computation | Each validation batch |
test_step | Test evaluation | Each test batch |
configure_optimizers | Optimizer and scheduler setup | Training start |
on_train_epoch_end | End-of-epoch logic | After each epoch |
Trainer Configuration
import lightning as L from lightning.pytorch.callbacks import ( ModelCheckpoint, EarlyStopping, LearningRateMonitor ) from lightning.pytorch.loggers import TensorBoardLogger trainer = L.Trainer( max_epochs=50, accelerator="gpu", devices=2, # 2 GPUs strategy="ddp", # Distributed data parallel precision="16-mixed", # Mixed precision training gradient_clip_val=1.0, accumulate_grad_batches=4, # Effective batch = 4 × actual batch callbacks=[ ModelCheckpoint( monitor="val_loss", mode="min", save_top_k=3 ), EarlyStopping( monitor="val_loss", patience=5, mode="min" ), LearningRateMonitor(logging_interval="step") ], logger=TensorBoardLogger("logs/", name="experiment"), log_every_n_steps=10, check_val_every_n_epoch=1 )
Configuration
| Parameter | Description | Default |
|---|---|---|
max_epochs | Maximum training epochs | 1000 |
accelerator | Hardware (cpu, gpu, tpu, auto) | "auto" |
devices | Number of devices | "auto" |
strategy | Distributed strategy (ddp, fsdp, deepspeed) | "auto" |
precision | Training precision (32, 16-mixed, bf16-mixed) | "32-true" |
gradient_clip_val | Gradient norm clipping value | None |
accumulate_grad_batches | Gradient accumulation steps | 1 |
log_every_n_steps | Logging frequency | 50 |
Best Practices
-
Use
save_hyperparameters()in__init__— Callself.save_hyperparameters()to automatically save all constructor arguments. This enables checkpoint loading withModel.load_from_checkpoint(path)without manually specifying hyperparameters, and logs them to your experiment tracker. -
Let Lightning handle device placement — Never call
.to(device)or.cuda()manually. Lightning handles device placement automatically based on theacceleratorsetting. Manual device management conflicts with Lightning's device handling and breaks multi-GPU training. -
Use
self.log()for all metrics — All metrics logged withself.log()appear in your logger (TensorBoard, W&B), are available for callbacks (early stopping, checkpointing), and work correctly across distributed training. Don't useprint()or manual logging. -
Start with single GPU, then scale — Develop and debug on one GPU with
Trainer(accelerator="gpu", devices=1). Once training works, adddevices=2, strategy="ddp"for multi-GPU. Lightning handles data distribution and gradient synchronization automatically. -
Use
LightningDataModulefor data organization — Wrap your data loading in aLightningDataModuleclass withprepare_data(),setup(),train_dataloader(), andval_dataloader()methods. This keeps data logic separate from model logic and enables data reuse.
Common Issues
Training runs but validation metrics never improve — Check that validation_step is actually being called by setting check_val_every_n_epoch=1. Also verify that the validation DataLoader shuffling is disabled (shuffle=False) — shuffled validation causes inconsistent metrics between epochs.
Multi-GPU training hangs at start — DDP strategy requires all processes to synchronize. If one GPU process crashes, others hang indefinitely. Check that all GPUs have sufficient memory, NCCL is installed correctly, and no GPU is occupied by another process. Start with devices=1 to verify single-GPU works first.
Checkpoint loading fails with missing keys — The checkpoint was saved with a different model architecture. Use strict=False in load_from_checkpoint() to load partial weights, or ensure the model class matches exactly. Check the checkpoint's hyper_parameters to verify the expected architecture.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.