PyTorch Lightning System

Organize and scale deep learning training with PyTorch Lightning, a framework that eliminates boilerplate while maintaining full PyTorch flexibility. This skill covers LightningModule design, training configuration, distributed training, logging, and production deployment workflows.

When to Use This Skill

Choose PyTorch Lightning System when you need to:

Structure PyTorch training code with clean separation of model, data, and training logic
Scale training from single GPU to multi-GPU or multi-node with minimal code changes
Add experiment tracking, checkpointing, and early stopping without manual implementation
Train models with mixed precision, gradient accumulation, or custom training loops

Consider alternatives when:

You need a completely custom training loop with no framework constraints (use raw PyTorch)
You need fast prototyping with pre-built architectures (use Hugging Face Transformers)
You need non-neural ML models (use scikit-learn or XGBoost)

Quick Start


pip install lightning torch torchvision


import lightning as L
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

class ImageClassifier(L.LightningModule):
    def __init__(self, num_classes=10, lr=1e-3):
        super().__init__()
        self.save_hyperparameters()
        self.model = torch.nn.Sequential(
            torch.nn.Flatten(),
            torch.nn.Linear(28 * 28, 256),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.2),
            torch.nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()
        self.log("train_loss", loss, prog_bar=True)
        self.log("train_acc", acc, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

# Data
transform = transforms.Compose([transforms.ToTensor()])
dataset = datasets.MNIST("./data", train=True, download=True, transform=transform)
train_ds, val_ds = random_split(dataset, [55000, 5000])

# Train
model = ImageClassifier()
trainer = L.Trainer(max_epochs=10, accelerator="auto")
trainer.fit(model, DataLoader(train_ds, batch_size=64),
            DataLoader(val_ds, batch_size=64))

Core Concepts

LightningModule Methods

Method	Purpose	When Called
`__init__`	Define model architecture	Instantiation
`forward`	Inference/prediction	Manual call or predict
`training_step`	Training loss computation	Each training batch
`validation_step`	Validation metric computation	Each validation batch
`test_step`	Test evaluation	Each test batch
`configure_optimizers`	Optimizer and scheduler setup	Training start
`on_train_epoch_end`	End-of-epoch logic	After each epoch

Trainer Configuration


import lightning as L
from lightning.pytorch.callbacks import (
    ModelCheckpoint, EarlyStopping, LearningRateMonitor
)
from lightning.pytorch.loggers import TensorBoardLogger

trainer = L.Trainer(
    max_epochs=50,
    accelerator="gpu",
    devices=2,                      # 2 GPUs
    strategy="ddp",                 # Distributed data parallel
    precision="16-mixed",           # Mixed precision training
    gradient_clip_val=1.0,
    accumulate_grad_batches=4,      # Effective batch = 4 × actual batch
    callbacks=[
        ModelCheckpoint(
            monitor="val_loss",
            mode="min",
            save_top_k=3
        ),
        EarlyStopping(
            monitor="val_loss",
            patience=5,
            mode="min"
        ),
        LearningRateMonitor(logging_interval="step")
    ],
    logger=TensorBoardLogger("logs/", name="experiment"),
    log_every_n_steps=10,
    check_val_every_n_epoch=1
)

Configuration

Parameter	Description	Default
`max_epochs`	Maximum training epochs	`1000`
`accelerator`	Hardware (cpu, gpu, tpu, auto)	`"auto"`
`devices`	Number of devices	`"auto"`
`strategy`	Distributed strategy (ddp, fsdp, deepspeed)	`"auto"`
`precision`	Training precision (32, 16-mixed, bf16-mixed)	`"32-true"`
`gradient_clip_val`	Gradient norm clipping value	`None`
`accumulate_grad_batches`	Gradient accumulation steps	`1`
`log_every_n_steps`	Logging frequency	`50`

Best Practices

Use save_hyperparameters() in __init__ — Call self.save_hyperparameters() to automatically save all constructor arguments. This enables checkpoint loading with Model.load_from_checkpoint(path) without manually specifying hyperparameters, and logs them to your experiment tracker.
Let Lightning handle device placement — Never call .to(device) or .cuda() manually. Lightning handles device placement automatically based on the accelerator setting. Manual device management conflicts with Lightning's device handling and breaks multi-GPU training.
Use self.log() for all metrics — All metrics logged with self.log() appear in your logger (TensorBoard, W&B), are available for callbacks (early stopping, checkpointing), and work correctly across distributed training. Don't use print() or manual logging.
Start with single GPU, then scale — Develop and debug on one GPU with Trainer(accelerator="gpu", devices=1). Once training works, add devices=2, strategy="ddp" for multi-GPU. Lightning handles data distribution and gradient synchronization automatically.
Use LightningDataModule for data organization — Wrap your data loading in a LightningDataModule class with prepare_data(), setup(), train_dataloader(), and val_dataloader() methods. This keeps data logic separate from model logic and enables data reuse.

Common Issues

Training runs but validation metrics never improve — Check that validation_step is actually being called by setting check_val_every_n_epoch=1. Also verify that the validation DataLoader shuffling is disabled (shuffle=False) — shuffled validation causes inconsistent metrics between epochs.

Multi-GPU training hangs at start — DDP strategy requires all processes to synchronize. If one GPU process crashes, others hang indefinitely. Check that all GPUs have sufficient memory, NCCL is installed correctly, and no GPU is occupied by another process. Start with devices=1 to verify single-GPU works first.

Checkpoint loading fails with missing keys — The checkpoint was saved with a different model architecture. Use strict=False in load_from_checkpoint() to load partial weights, or ensure the model class matches exactly. Check the checkpoint's hyper_parameters to verify the expected architecture.

⚠️ Loading Issue

Pytorch Lightning System

PyTorch Lightning System

When to Use This Skill

Quick Start

Core Concepts

LightningModule Methods

Trainer Configuration

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace