C

Complete ML Experiment Tracking Toolkit

Boost productivity with intelligent track machine learning experiments with dashboards. Built for Claude Code with best practices and real-world patterns.

SkillCommunityaiv1.0.0MIT
0 views0 copies

ML Experiment Tracking Toolkit

Comprehensive machine learning experiment tracking and management framework covering experiment logging, hyperparameter tracking, model comparison, reproducibility, and collaboration for ML teams.

When to Use This Skill

Choose ML Experiment Tracking when:

  • Running multiple model training experiments with different configurations
  • Comparing model performance across hyperparameter variations
  • Ensuring ML experiment reproducibility across team members
  • Managing model artifacts, metrics, and metadata
  • Building MLOps pipelines with experiment lineage

Consider alternatives when:

  • Need simple script logging — use built-in Python logging
  • Need data pipeline orchestration — use Airflow or Prefect
  • Need model serving — use serving-specific tools (TFServing, Triton)

Quick Start

# Install experiment tracking tools pip install mlflow wandb # Activate experiment tracking claude skill activate complete-ml-experiment-tracking-toolkit # Set up tracking claude "Set up MLflow experiment tracking for our NLP model training pipeline"

Example: MLflow Experiment Tracking

import mlflow from mlflow.tracking import MlflowClient # Set up experiment mlflow.set_tracking_uri("http://mlflow-server:5000") mlflow.set_experiment("sentiment-classification") # Run experiment with auto-logging with mlflow.start_run(run_name="distilbert-lr-2e5") as run: # Log parameters mlflow.log_params({ "model": "distilbert-base-uncased", "learning_rate": 2e-5, "batch_size": 16, "epochs": 3, "max_length": 256, "weight_decay": 0.01, }) # Train model trainer.train() metrics = trainer.evaluate() # Log metrics mlflow.log_metrics({ "accuracy": metrics["eval_accuracy"], "f1": metrics["eval_f1"], "loss": metrics["eval_loss"], "train_time_seconds": training_time, }) # Log model artifact mlflow.transformers.log_model( transformers_model={"model": model, "tokenizer": tokenizer}, artifact_path="model", registered_model_name="sentiment-classifier", ) # Log additional artifacts mlflow.log_artifact("training_config.yaml") mlflow.log_artifact("confusion_matrix.png") # Set tags for organization mlflow.set_tags({ "dataset": "imdb-v2", "team": "nlp", "stage": "development", }) print(f"Run ID: {run.info.run_id}")

Core Concepts

Experiment Tracking Components

ComponentDescriptionExample
ExperimentGroup of related runs"sentiment-classification"
RunSingle training execution"distilbert-lr-2e5-epoch3"
ParametersInput configuration valueslearning_rate=2e-5, batch_size=16
MetricsOutput performance measurementsaccuracy=0.92, f1=0.89
ArtifactsFiles produced by the runModel weights, plots, configs
TagsMetadata labels for organizationteam=nlp, stage=production
Model RegistryVersioned model catalogsentiment-v1, sentiment-v2

Platform Comparison

FeatureMLflowW&BNeptuneClearML
Self-hostedYesNoYesYes
Auto-loggingGoodExcellentGoodGood
VisualizationBasicExcellentGoodGood
CollaborationBasicExcellentGoodGood
Model RegistryYesYesYesYes
CostFree (OSS)Free tier + paidFree tier + paidFree (OSS)

Configuration

ParameterDescriptionDefault
tracking_uriMLflow tracking server URL./mlruns (local)
experiment_nameDefault experiment nameRequired
auto_logEnable framework auto-loggingtrue
log_modelsAuto-log model artifactstrue
artifact_storageArtifact store: local, s3, gcslocal
registry_uriModel registry locationSame as tracking

Best Practices

  1. Log everything — storage is cheaper than re-running experiments — Log all hyperparameters, metrics at every epoch, system metrics (GPU utilization, memory), random seeds, and git commit hashes. You never know which detail will be important for reproducing a result.

  2. Use auto-logging for framework integration — MLflow and W&B auto-log training metrics, model parameters, and artifacts for PyTorch, TensorFlow, and HuggingFace. Enable auto-logging first, then add custom logging for domain-specific metrics.

  3. Set random seeds and log them for reproducibility — Set seeds for Python random, NumPy, PyTorch, and CUDA. Log the seed value with each run. Without fixed seeds, you can't reproduce exact results even with identical parameters.

  4. Compare experiments systematically, not manually — Use the tracking UI's comparison features to view metrics side-by-side across runs. Create automated reports that compare new runs against the current best model on all key metrics.

  5. Register production models with stage transitions — Use the model registry to move models through stages: None → Staging → Production → Archived. Each transition should require approval and include automated validation checks.

Common Issues

Experiment tracking server storage grows uncontrollably. Implement artifact retention policies. Delete artifacts from old, unsuccessful runs. Compress large model artifacts. Use S3/GCS with lifecycle policies for automatic cleanup of artifacts older than N days.

Can't reproduce results despite identical logged parameters. Missing sources of randomness: data shuffling order, GPU non-determinism, library version differences, and hardware-specific behavior. Log the full pip freeze output, set torch.use_deterministic_algorithms(True), and use containerized training environments.

Too many experiments make it hard to find relevant runs. Use consistent tagging conventions (team, project, stage) and experiment naming. Create saved views or dashboards for frequently compared experiments. Archive completed experiments to reduce clutter.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates