Architect ML Engineer

An agent covering the complete machine learning lifecycle from pipeline development through model training, validation, deployment, and monitoring, focused on building production-ready ML systems that deliver reliable predictions at scale.

When to Use This Agent

Choose ML Engineer when:

Building end-to-end ML pipelines from data ingestion to model serving
Implementing reproducible training workflows with experiment tracking
Designing model validation frameworks with proper testing strategies
Setting up continuous training and deployment automation
Monitoring model performance and data quality in production

Consider alternatives when:

Doing exploratory research without production constraints (use a data science agent)
Optimizing inference latency without changing the ML pipeline (use an ML deployment agent)
Building data ingestion without ML components (use a data engineering agent)

Quick Start


# .claude/agents/architect-ml-engineer.yml
name: ML Engineer
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a senior ML engineer. Build production-ready ML systems
  covering the full lifecycle: data pipelines, feature engineering,
  training, validation, deployment, and monitoring. Prioritize
  reproducibility, reliability, and maintainability.

Example invocation:


claude --agent architect-ml-engineer "Build a training pipeline
  for our recommendation model that includes feature engineering,
  hyperparameter tuning with Optuna, experiment tracking in MLflow,
  and automated model validation before deployment"

Core Concepts

ML Pipeline Architecture

Data Sources → Feature Pipeline → Training Pipeline → Validation → Deployment
     │              │                  │                 │            │
  Raw data    Feature store      Experiment tracker   Test suite   Model registry
  CDC/batch   Transformations    Hyperparameter opt   A/B config   Canary deploy
  Streaming   Versioning         Distributed training Bias checks  Monitoring

Pipeline Component Responsibilities

Component	Input	Output	Tools
Data Validation	Raw data	Quality report	Great Expectations
Feature Engineering	Validated data	Feature matrix	Feast, custom
Training	Features + config	Trained model	PyTorch, XGBoost
Experiment Tracking	Metrics + artifacts	Experiment record	MLflow, W&B
Model Validation	Model + test data	Pass/fail + report	Custom test suite
Model Registry	Validated model	Versioned artifact	MLflow Registry
Deployment	Registered model	Serving endpoint	KServe, Seldon
Monitoring	Predictions + actuals	Drift/performance alerts	Evidently, custom

Reproducible Training Configuration


# training_config.yaml
experiment:
  name: recommendation-v3
  tracking_uri: http://mlflow:5000

data:
  source: s3://data/features/v2/
  split:
    train: 0.7
    validation: 0.15
    test: 0.15
  seed: 42

model:
  architecture: two-tower
  embedding_dim: 128
  hidden_layers: [256, 128, 64]

training:
  epochs: 50
  batch_size: 512
  learning_rate: 0.001
  early_stopping:
    patience: 5
    metric: val_ndcg@10

validation:
  min_ndcg: 0.35
  max_latency_ms: 50
  bias_checks:
    - metric: demographic_parity
      threshold: 0.1

Configuration

Parameter	Description	Default
`experiment_tracker`	Experiment tracking platform	MLflow
`feature_store`	Feature storage and serving	Feast
`orchestrator`	Pipeline orchestration tool	Airflow
`model_registry`	Model versioning system	MLflow Registry
`training_framework`	ML framework	PyTorch
`validation_suite`	Model validation tool	Custom + Great Expectations
`deployment_target`	Serving infrastructure	Kubernetes

Best Practices

Make training pipelines configuration-driven. Every parameter that affects model output—data version, hyperparameters, feature set, split seed—should live in a versioned configuration file, not in code. This makes experiments reproducible without code changes and enables hyperparameter sweeps by simply varying the config. The training script reads the config and logs it alongside results.
Implement automated model validation gates. Before any model reaches production, it must pass automated checks: accuracy above minimum thresholds on the test set, latency within SLA on representative hardware, bias metrics within acceptable ranges, and no regression on critical edge cases. These gates prevent deploying models that looked good in aggregate but fail on important subsets.
Version data and features with the same rigor as code. When a model's performance changes, you need to determine whether the data, features, or model changed. Use DVC or a feature store with versioning to track exactly which data produced which model. Pin data versions in training configs so experiments are reproducible months later.
Build monitoring that detects problems before users do. Track prediction distributions, feature distributions, and model confidence scores in real time. Set alerts for distributional shifts that precede accuracy degradation. Compare production feature distributions against training distributions to catch data pipeline issues before they affect model quality.
Design for retraining from day one. Your first model deployment should include an automated retraining pipeline, even if it runs manually at first. Designing retraining after deployment is much harder because it requires retroactively instrumenting data collection, feature computation, and validation. A pipeline that retrains weekly prevents gradual model staleness.

Common Issues

Experiments cannot be reproduced weeks later. This happens when some aspect of the environment wasn't captured: library versions, random seeds, data preprocessing steps, or system-level dependencies. Use containerized training environments with pinned dependencies, log the full environment specification alongside results, and verify reproducibility by re-running a subset of experiments periodically.

Training pipeline is too slow for rapid iteration. Profile the full pipeline to find bottlenecks. Common culprits: redundant data loading (cache preprocessed data), unoptimized data loaders (increase worker count, use memory-mapped datasets), full dataset training for experiments (use stratified sampling for quick iterations). Separate fast experimentation runs (sampled data, fewer epochs) from full training runs.

Model performs differently on different hardware. Floating-point behavior varies between CPU types, GPU architectures, and library versions. Quantized models are especially sensitive to hardware differences. Always validate on the target deployment hardware before approving a model. Include hardware specification in your model card and CI validation, and test on the exact GPU type used in production.

⚠️ Loading Issue

Architect Ml Engineer

Architect ML Engineer

When to Use This Agent

Quick Start

Core Concepts

ML Pipeline Architecture

Pipeline Component Responsibilities

Reproducible Training Configuration

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner