Complete ML Experiment Tracking Toolkit
Boost productivity with intelligent track machine learning experiments with dashboards. Built for Claude Code with best practices and real-world patterns.
ML Experiment Tracking Toolkit
Comprehensive machine learning experiment tracking and management framework covering experiment logging, hyperparameter tracking, model comparison, reproducibility, and collaboration for ML teams.
When to Use This Skill
Choose ML Experiment Tracking when:
- Running multiple model training experiments with different configurations
- Comparing model performance across hyperparameter variations
- Ensuring ML experiment reproducibility across team members
- Managing model artifacts, metrics, and metadata
- Building MLOps pipelines with experiment lineage
Consider alternatives when:
- Need simple script logging — use built-in Python logging
- Need data pipeline orchestration — use Airflow or Prefect
- Need model serving — use serving-specific tools (TFServing, Triton)
Quick Start
# Install experiment tracking tools pip install mlflow wandb # Activate experiment tracking claude skill activate complete-ml-experiment-tracking-toolkit # Set up tracking claude "Set up MLflow experiment tracking for our NLP model training pipeline"
Example: MLflow Experiment Tracking
import mlflow from mlflow.tracking import MlflowClient # Set up experiment mlflow.set_tracking_uri("http://mlflow-server:5000") mlflow.set_experiment("sentiment-classification") # Run experiment with auto-logging with mlflow.start_run(run_name="distilbert-lr-2e5") as run: # Log parameters mlflow.log_params({ "model": "distilbert-base-uncased", "learning_rate": 2e-5, "batch_size": 16, "epochs": 3, "max_length": 256, "weight_decay": 0.01, }) # Train model trainer.train() metrics = trainer.evaluate() # Log metrics mlflow.log_metrics({ "accuracy": metrics["eval_accuracy"], "f1": metrics["eval_f1"], "loss": metrics["eval_loss"], "train_time_seconds": training_time, }) # Log model artifact mlflow.transformers.log_model( transformers_model={"model": model, "tokenizer": tokenizer}, artifact_path="model", registered_model_name="sentiment-classifier", ) # Log additional artifacts mlflow.log_artifact("training_config.yaml") mlflow.log_artifact("confusion_matrix.png") # Set tags for organization mlflow.set_tags({ "dataset": "imdb-v2", "team": "nlp", "stage": "development", }) print(f"Run ID: {run.info.run_id}")
Core Concepts
Experiment Tracking Components
| Component | Description | Example |
|---|---|---|
| Experiment | Group of related runs | "sentiment-classification" |
| Run | Single training execution | "distilbert-lr-2e5-epoch3" |
| Parameters | Input configuration values | learning_rate=2e-5, batch_size=16 |
| Metrics | Output performance measurements | accuracy=0.92, f1=0.89 |
| Artifacts | Files produced by the run | Model weights, plots, configs |
| Tags | Metadata labels for organization | team=nlp, stage=production |
| Model Registry | Versioned model catalog | sentiment-v1, sentiment-v2 |
Platform Comparison
| Feature | MLflow | W&B | Neptune | ClearML |
|---|---|---|---|---|
| Self-hosted | Yes | No | Yes | Yes |
| Auto-logging | Good | Excellent | Good | Good |
| Visualization | Basic | Excellent | Good | Good |
| Collaboration | Basic | Excellent | Good | Good |
| Model Registry | Yes | Yes | Yes | Yes |
| Cost | Free (OSS) | Free tier + paid | Free tier + paid | Free (OSS) |
Configuration
| Parameter | Description | Default |
|---|---|---|
tracking_uri | MLflow tracking server URL | ./mlruns (local) |
experiment_name | Default experiment name | Required |
auto_log | Enable framework auto-logging | true |
log_models | Auto-log model artifacts | true |
artifact_storage | Artifact store: local, s3, gcs | local |
registry_uri | Model registry location | Same as tracking |
Best Practices
-
Log everything — storage is cheaper than re-running experiments — Log all hyperparameters, metrics at every epoch, system metrics (GPU utilization, memory), random seeds, and git commit hashes. You never know which detail will be important for reproducing a result.
-
Use auto-logging for framework integration — MLflow and W&B auto-log training metrics, model parameters, and artifacts for PyTorch, TensorFlow, and HuggingFace. Enable auto-logging first, then add custom logging for domain-specific metrics.
-
Set random seeds and log them for reproducibility — Set seeds for Python random, NumPy, PyTorch, and CUDA. Log the seed value with each run. Without fixed seeds, you can't reproduce exact results even with identical parameters.
-
Compare experiments systematically, not manually — Use the tracking UI's comparison features to view metrics side-by-side across runs. Create automated reports that compare new runs against the current best model on all key metrics.
-
Register production models with stage transitions — Use the model registry to move models through stages: None → Staging → Production → Archived. Each transition should require approval and include automated validation checks.
Common Issues
Experiment tracking server storage grows uncontrollably. Implement artifact retention policies. Delete artifacts from old, unsuccessful runs. Compress large model artifacts. Use S3/GCS with lifecycle policies for automatic cleanup of artifacts older than N days.
Can't reproduce results despite identical logged parameters. Missing sources of randomness: data shuffling order, GPU non-determinism, library version differences, and hardware-specific behavior. Log the full pip freeze output, set torch.use_deterministic_algorithms(True), and use containerized training environments.
Too many experiments make it hard to find relevant runs. Use consistent tagging conventions (team, project, stage) and experiment naming. Create saved views or dashboards for frequently compared experiments. Archive completed experiments to reduce clutter.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.