ML Experiment Tracking Toolkit

Comprehensive machine learning experiment tracking and management framework covering experiment logging, hyperparameter tracking, model comparison, reproducibility, and collaboration for ML teams.

When to Use This Skill

Choose ML Experiment Tracking when:

Running multiple model training experiments with different configurations
Comparing model performance across hyperparameter variations
Ensuring ML experiment reproducibility across team members
Managing model artifacts, metrics, and metadata
Building MLOps pipelines with experiment lineage

Consider alternatives when:

Need simple script logging — use built-in Python logging
Need data pipeline orchestration — use Airflow or Prefect
Need model serving — use serving-specific tools (TFServing, Triton)

Quick Start


# Install experiment tracking tools
pip install mlflow wandb

# Activate experiment tracking
claude skill activate complete-ml-experiment-tracking-toolkit

# Set up tracking
claude "Set up MLflow experiment tracking for our NLP model training pipeline"

Example: MLflow Experiment Tracking


import mlflow
from mlflow.tracking import MlflowClient

# Set up experiment
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("sentiment-classification")

# Run experiment with auto-logging
with mlflow.start_run(run_name="distilbert-lr-2e5") as run:
    # Log parameters
    mlflow.log_params({
        "model": "distilbert-base-uncased",
        "learning_rate": 2e-5,
        "batch_size": 16,
        "epochs": 3,
        "max_length": 256,
        "weight_decay": 0.01,
    })

    # Train model
    trainer.train()
    metrics = trainer.evaluate()

    # Log metrics
    mlflow.log_metrics({
        "accuracy": metrics["eval_accuracy"],
        "f1": metrics["eval_f1"],
        "loss": metrics["eval_loss"],
        "train_time_seconds": training_time,
    })

    # Log model artifact
    mlflow.transformers.log_model(
        transformers_model={"model": model, "tokenizer": tokenizer},
        artifact_path="model",
        registered_model_name="sentiment-classifier",
    )

    # Log additional artifacts
    mlflow.log_artifact("training_config.yaml")
    mlflow.log_artifact("confusion_matrix.png")

    # Set tags for organization
    mlflow.set_tags({
        "dataset": "imdb-v2",
        "team": "nlp",
        "stage": "development",
    })

    print(f"Run ID: {run.info.run_id}")

Core Concepts

Experiment Tracking Components

Component	Description	Example
Experiment	Group of related runs	"sentiment-classification"
Run	Single training execution	"distilbert-lr-2e5-epoch3"
Parameters	Input configuration values	learning_rate=2e-5, batch_size=16
Metrics	Output performance measurements	accuracy=0.92, f1=0.89
Artifacts	Files produced by the run	Model weights, plots, configs
Tags	Metadata labels for organization	team=nlp, stage=production
Model Registry	Versioned model catalog	sentiment-v1, sentiment-v2

Platform Comparison

Feature	MLflow	W&B	Neptune	ClearML
Self-hosted	Yes	No	Yes	Yes
Auto-logging	Good	Excellent	Good	Good
Visualization	Basic	Excellent	Good	Good
Collaboration	Basic	Excellent	Good	Good
Model Registry	Yes	Yes	Yes	Yes
Cost	Free (OSS)	Free tier + paid	Free tier + paid	Free (OSS)

Configuration

Parameter	Description	Default
`tracking_uri`	MLflow tracking server URL	`./mlruns` (local)
`experiment_name`	Default experiment name	Required
`auto_log`	Enable framework auto-logging	`true`
`log_models`	Auto-log model artifacts	`true`
`artifact_storage`	Artifact store: local, s3, gcs	`local`
`registry_uri`	Model registry location	Same as tracking

Best Practices

Log everything — storage is cheaper than re-running experiments — Log all hyperparameters, metrics at every epoch, system metrics (GPU utilization, memory), random seeds, and git commit hashes. You never know which detail will be important for reproducing a result.
Use auto-logging for framework integration — MLflow and W&B auto-log training metrics, model parameters, and artifacts for PyTorch, TensorFlow, and HuggingFace. Enable auto-logging first, then add custom logging for domain-specific metrics.
Set random seeds and log them for reproducibility — Set seeds for Python random, NumPy, PyTorch, and CUDA. Log the seed value with each run. Without fixed seeds, you can't reproduce exact results even with identical parameters.
Compare experiments systematically, not manually — Use the tracking UI's comparison features to view metrics side-by-side across runs. Create automated reports that compare new runs against the current best model on all key metrics.
Register production models with stage transitions — Use the model registry to move models through stages: None → Staging → Production → Archived. Each transition should require approval and include automated validation checks.

Common Issues

Experiment tracking server storage grows uncontrollably. Implement artifact retention policies. Delete artifacts from old, unsuccessful runs. Compress large model artifacts. Use S3/GCS with lifecycle policies for automatic cleanup of artifacts older than N days.

Can't reproduce results despite identical logged parameters. Missing sources of randomness: data shuffling order, GPU non-determinism, library version differences, and hardware-specific behavior. Log the full pip freeze output, set torch.use_deterministic_algorithms(True), and use containerized training environments.

Too many experiments make it hard to find relevant runs. Use consistent tagging conventions (team, project, stage) and experiment naming. Create saved views or dashboards for frequently compared experiments. Archive completed experiments to reduce clutter.

⚠️ Loading Issue

Complete ML Experiment Tracking Toolkit

ML Experiment Tracking Toolkit

When to Use This Skill

Quick Start

Example: MLflow Experiment Tracking

Core Concepts

Experiment Tracking Components

Platform Comparison

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace