Comprehensive MLOps with Weights & Biases

Overview

Weights & Biases (W&B) is the leading MLOps platform for experiment tracking, hyperparameter optimization, model versioning, and team collaboration across the entire machine learning lifecycle. With over 200,000 ML practitioners and 100+ framework integrations, W&B provides a unified dashboard for logging metrics, comparing runs, managing artifacts, and orchestrating hyperparameter sweeps. This template covers everything from basic experiment tracking through advanced production MLOps workflows including model registry, dataset versioning, and automated evaluation pipelines.

When to Use

Experiment tracking: You need to log, visualize, and compare training metrics across dozens or hundreds of runs
Hyperparameter optimization: You want automated sweep strategies (Bayesian, grid, random) with early stopping
Model registry: You need versioned model artifacts with lineage tracking from data to deployment
Team collaboration: Multiple researchers need shared dashboards, reports, and reproducible experiments
Production monitoring: You need to track model performance, data drift, and pipeline health over time
Artifact management: You need to version datasets, models, and evaluation results with full provenance

Choose alternatives when you need fully open-source self-hosted solutions (consider MLflow), or when you only need basic TensorBoard-style visualization.

Quick Start


# Install W&B
pip install wandb

# Authenticate (opens browser for API key)
wandb login

# Or set API key via environment variable
export WANDB_API_KEY=your_api_key_here


import wandb

# Initialize a tracked run
run = wandb.init(
    project="my-first-project",
    config={
        "learning_rate": 0.001,
        "epochs": 10,
        "batch_size": 32,
        "architecture": "ResNet50"
    }
)

# Log metrics during training
for epoch in range(10):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)
    wandb.log({
        "epoch": epoch,
        "train/loss": train_loss,
        "val/loss": val_loss
    })

# Finish the run
wandb.finish()

Core Concepts

Projects and Runs

Every experiment execution is a Run that belongs to a Project. Runs capture configuration, metrics, system information, and artifacts automatically.


run = wandb.init(
    project="image-classification",
    name="resnet50-baseline-v2",       # Human-readable run name
    tags=["baseline", "resnet", "v2"], # Filterable tags
    group="architecture-comparison",    # Group related runs
    job_type="train",                   # Categorize run type
    notes="Testing ResNet50 with augmented data pipeline"
)

# Access run metadata
print(f"Run ID: {run.id}")
print(f"Run URL: {run.url}")
print(f"Run name: {run.name}")

Configuration Tracking

W&B captures hyperparameters and makes them searchable and comparable across all runs.


config = {
    "model": {
        "architecture": "ResNet50",
        "pretrained": True,
        "dropout": 0.3
    },
    "training": {
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 50,
        "optimizer": "AdamW",
        "weight_decay": 0.01
    },
    "data": {
        "dataset": "ImageNet",
        "augmentation": "randaugment",
        "image_size": 224
    }
}

wandb.init(project="my-project", config=config)

# Access nested config
lr = wandb.config["training"]["learning_rate"]

# Update config mid-run
wandb.config.update({"training.warmup_steps": 500})

Metric Logging


# Log scalar metrics
wandb.log({"loss": 0.5, "accuracy": 0.92})

# Log with explicit step
wandb.log({"loss": 0.3}, step=500)

# Log images
wandb.log({"predictions": [
    wandb.Image(img, caption=f"Pred: {pred}")
    for img, pred in zip(images, predictions)
]})

# Log histograms
wandb.log({"weight_distribution": wandb.Histogram(model.fc.weight.data.cpu())})

# Log confusion matrix
wandb.log({"conf_mat": wandb.plot.confusion_matrix(
    probs=None,
    y_true=ground_truth,
    preds=predictions,
    class_names=["cat", "dog", "bird"]
)})

# Log tables for detailed inspection
table = wandb.Table(
    columns=["image", "prediction", "ground_truth", "confidence"],
    data=[
        [wandb.Image(img), pred, gt, conf]
        for img, pred, gt, conf in zip(images, preds, gts, confs)
    ]
)
wandb.log({"prediction_table": table})

Hyperparameter Sweeps


# Define sweep configuration
sweep_config = {
    "method": "bayes",  # Bayesian optimization
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-2
        },
        "batch_size": {
            "values": [16, 32, 64, 128]
        },
        "optimizer": {
            "values": ["adam", "sgd", "adamw"]
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5
        }
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 5,
        "eta": 3
    }
}

# Create sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")

# Define training function
def train():
    run = wandb.init()
    config = wandb.config

    model = build_model(config)
    optimizer = get_optimizer(config.optimizer, config.learning_rate)

    for epoch in range(50):
        train_loss = train_epoch(model, optimizer, config.batch_size)
        val_acc = validate(model)
        wandb.log({"train/loss": train_loss, "val/accuracy": val_acc})

# Launch sweep agent (runs 100 trials)
wandb.agent(sweep_id, function=train, count=100)

Artifacts and Model Registry


# Log a dataset artifact
dataset_artifact = wandb.Artifact(
    name="training-data-v3",
    type="dataset",
    description="Cleaned and augmented training split",
    metadata={"size": "50K images", "split": "train", "version": "3.0"}
)
dataset_artifact.add_dir("data/train/")
wandb.log_artifact(dataset_artifact)

# Log a model artifact
model_artifact = wandb.Artifact(
    name="resnet50-classifier",
    type="model",
    metadata={"accuracy": 0.95, "architecture": "ResNet50"}
)
model_artifact.add_file("checkpoints/best_model.pth")
wandb.log_artifact(model_artifact, aliases=["best", "production"])

# Use artifacts in downstream runs
run = wandb.init(project="evaluation")
artifact = run.use_artifact("training-data-v3:latest")
data_dir = artifact.download()

# Link model to registry
run.link_artifact(model_artifact, "model-registry/production-classifier")

Framework Integrations


# --- HuggingFace Transformers ---
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    report_to="wandb",
    run_name="bert-finetune-v1",
    logging_steps=50,
    save_steps=500
)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
trainer.train()

# --- PyTorch Lightning ---
from pytorch_lightning.loggers import WandbLogger

wandb_logger = WandbLogger(project="lightning-project", log_model="all")
trainer = pl.Trainer(logger=wandb_logger, max_epochs=10)
trainer.fit(model, datamodule=dm)

# --- Keras / TensorFlow ---
from wandb.integration.keras import WandbMetricsLogger

model.fit(x_train, y_train, callbacks=[WandbMetricsLogger()])

Configuration Reference

Parameter	Type	Default	Description
`project`	str	None	Project name for grouping runs
`name`	str	Auto	Human-readable run name
`config`	dict	None	Hyperparameters and settings
`tags`	list	[]	Filterable labels for runs
`group`	str	None	Group name for related runs
`job_type`	str	None	Run category (train, eval, etc.)
`mode`	str	"online"	"online", "offline", or "disabled"
`resume`	str	None	"allow", "must", "never", or run ID
`save_code`	bool	False	Save main script and git patch
`notes`	str	None	Markdown notes for the run

Sweep Parameter	Values	Description
`method`	"bayes", "grid", "random"	Search strategy
`metric.goal`	"minimize", "maximize"	Optimization direction
`early_terminate.type`	"hyperband"	Early stopping strategy
`early_terminate.eta`	int	Reduction factor for Hyperband
`parameters.*.distribution`	"uniform", "log_uniform_values", "normal", "categorical"	Parameter sampling

Best Practices

Use hierarchical metric names: Prefix metrics with train/, val/, test/ to organize dashboards cleanly and enable automatic grouping in the W&B UI.
Log system metrics alongside training metrics: Track GPU utilization, memory usage, and throughput to identify bottlenecks early and correlate hardware performance with model quality.
Version everything as artifacts: Datasets, model checkpoints, evaluation results, and configuration files should all be versioned artifacts with metadata for complete reproducibility.
Use sweep early termination: Configure Hyperband early stopping in sweeps to kill underperforming trials quickly and focus compute budget on promising hyperparameter regions.
Create shareable reports: Use W&B Reports to combine charts, tables, and markdown narratives into documents that serve as experiment summaries for team reviews.
Tag runs systematically: Establish a tagging convention (e.g., experiment phase, model family, dataset version) so runs remain filterable as your project grows to hundreds of experiments.
Set save_code=True: Enable code saving to capture the exact script and git diff for every run, ensuring you can always reproduce any experiment.
Use wandb.watch() for gradient tracking: Call wandb.watch(model) to log gradient and parameter histograms, helping diagnose vanishing or exploding gradients during training.
Configure offline mode for clusters: Use WANDB_MODE=offline for training on nodes without internet, then sync with wandb sync when connectivity is available.
Separate sweep agents from training logic: Keep your training function pure and parameterized by wandb.config so the same code works for manual runs and automated sweeps.

Troubleshooting

Runs not appearing in dashboard Ensure wandb.init() is called before any logging. Check that WANDB_MODE is not set to disabled. Verify API key with wandb login --verify.

Sweep agent exits immediately Confirm the sweep ID is correct and the sweep has not been stopped in the UI. Check that the training function calls wandb.init() at the start.

Large artifacts failing to upload Break large datasets into multiple smaller artifacts. Use artifact.add_reference() for cloud-stored data instead of uploading directly. Increase timeout with WANDB_HTTP_TIMEOUT=300.

Offline runs not syncing Run wandb sync --sync-all in the directory containing the wandb/ folder. Ensure the offline runs directory has not been moved or renamed.

Duplicate metric names causing chart issues Use consistent metric naming across all runs in a project. Avoid logging the same metric name with different step frequencies in a single run.

Memory usage growing during long training Call wandb.log() with commit=True (the default) to flush data. For very long runs, consider increasing the _stats_sample_rate_seconds setting.

Config not showing in UI Pass config to wandb.init(config=...) rather than logging it as a metric. For nested configs, use dot notation in the UI filter: config.model.architecture.

⚠️ Loading Issue

Comprehensive Mlops Weights And