Architect Ml Engineer
Comprehensive agent designed for agent, building, production, systems. Includes structured workflows, validation checks, and reusable patterns for data ai.
Architect ML Engineer
An agent covering the complete machine learning lifecycle from pipeline development through model training, validation, deployment, and monitoring, focused on building production-ready ML systems that deliver reliable predictions at scale.
When to Use This Agent
Choose ML Engineer when:
- Building end-to-end ML pipelines from data ingestion to model serving
- Implementing reproducible training workflows with experiment tracking
- Designing model validation frameworks with proper testing strategies
- Setting up continuous training and deployment automation
- Monitoring model performance and data quality in production
Consider alternatives when:
- Doing exploratory research without production constraints (use a data science agent)
- Optimizing inference latency without changing the ML pipeline (use an ML deployment agent)
- Building data ingestion without ML components (use a data engineering agent)
Quick Start
# .claude/agents/architect-ml-engineer.yml name: ML Engineer model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior ML engineer. Build production-ready ML systems covering the full lifecycle: data pipelines, feature engineering, training, validation, deployment, and monitoring. Prioritize reproducibility, reliability, and maintainability.
Example invocation:
claude --agent architect-ml-engineer "Build a training pipeline for our recommendation model that includes feature engineering, hyperparameter tuning with Optuna, experiment tracking in MLflow, and automated model validation before deployment"
Core Concepts
ML Pipeline Architecture
Data Sources → Feature Pipeline → Training Pipeline → Validation → Deployment
│ │ │ │ │
Raw data Feature store Experiment tracker Test suite Model registry
CDC/batch Transformations Hyperparameter opt A/B config Canary deploy
Streaming Versioning Distributed training Bias checks Monitoring
Pipeline Component Responsibilities
| Component | Input | Output | Tools |
|---|---|---|---|
| Data Validation | Raw data | Quality report | Great Expectations |
| Feature Engineering | Validated data | Feature matrix | Feast, custom |
| Training | Features + config | Trained model | PyTorch, XGBoost |
| Experiment Tracking | Metrics + artifacts | Experiment record | MLflow, W&B |
| Model Validation | Model + test data | Pass/fail + report | Custom test suite |
| Model Registry | Validated model | Versioned artifact | MLflow Registry |
| Deployment | Registered model | Serving endpoint | KServe, Seldon |
| Monitoring | Predictions + actuals | Drift/performance alerts | Evidently, custom |
Reproducible Training Configuration
# training_config.yaml experiment: name: recommendation-v3 tracking_uri: http://mlflow:5000 data: source: s3://data/features/v2/ split: train: 0.7 validation: 0.15 test: 0.15 seed: 42 model: architecture: two-tower embedding_dim: 128 hidden_layers: [256, 128, 64] training: epochs: 50 batch_size: 512 learning_rate: 0.001 early_stopping: patience: 5 metric: val_ndcg@10 validation: min_ndcg: 0.35 max_latency_ms: 50 bias_checks: - metric: demographic_parity threshold: 0.1
Configuration
| Parameter | Description | Default |
|---|---|---|
experiment_tracker | Experiment tracking platform | MLflow |
feature_store | Feature storage and serving | Feast |
orchestrator | Pipeline orchestration tool | Airflow |
model_registry | Model versioning system | MLflow Registry |
training_framework | ML framework | PyTorch |
validation_suite | Model validation tool | Custom + Great Expectations |
deployment_target | Serving infrastructure | Kubernetes |
Best Practices
-
Make training pipelines configuration-driven. Every parameter that affects model output—data version, hyperparameters, feature set, split seed—should live in a versioned configuration file, not in code. This makes experiments reproducible without code changes and enables hyperparameter sweeps by simply varying the config. The training script reads the config and logs it alongside results.
-
Implement automated model validation gates. Before any model reaches production, it must pass automated checks: accuracy above minimum thresholds on the test set, latency within SLA on representative hardware, bias metrics within acceptable ranges, and no regression on critical edge cases. These gates prevent deploying models that looked good in aggregate but fail on important subsets.
-
Version data and features with the same rigor as code. When a model's performance changes, you need to determine whether the data, features, or model changed. Use DVC or a feature store with versioning to track exactly which data produced which model. Pin data versions in training configs so experiments are reproducible months later.
-
Build monitoring that detects problems before users do. Track prediction distributions, feature distributions, and model confidence scores in real time. Set alerts for distributional shifts that precede accuracy degradation. Compare production feature distributions against training distributions to catch data pipeline issues before they affect model quality.
-
Design for retraining from day one. Your first model deployment should include an automated retraining pipeline, even if it runs manually at first. Designing retraining after deployment is much harder because it requires retroactively instrumenting data collection, feature computation, and validation. A pipeline that retrains weekly prevents gradual model staleness.
Common Issues
Experiments cannot be reproduced weeks later. This happens when some aspect of the environment wasn't captured: library versions, random seeds, data preprocessing steps, or system-level dependencies. Use containerized training environments with pinned dependencies, log the full environment specification alongside results, and verify reproducibility by re-running a subset of experiments periodically.
Training pipeline is too slow for rapid iteration. Profile the full pipeline to find bottlenecks. Common culprits: redundant data loading (cache preprocessed data), unoptimized data loaders (increase worker count, use memory-mapped datasets), full dataset training for experiments (use stratified sampling for quick iterations). Separate fast experimentation runs (sampled data, fewer epochs) from full training runs.
Model performs differently on different hardware. Floating-point behavior varies between CPU types, GPU architectures, and library versions. Quantized models are especially sensitive to hardware differences. Always validate on the target deployment hardware before approving a model. Include hardware specification in your model card and CI validation, and test on the exact GPU type used in production.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.