A

Architect Ml Engineer

Comprehensive agent designed for agent, building, production, systems. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

Architect ML Engineer

An agent covering the complete machine learning lifecycle from pipeline development through model training, validation, deployment, and monitoring, focused on building production-ready ML systems that deliver reliable predictions at scale.

When to Use This Agent

Choose ML Engineer when:

  • Building end-to-end ML pipelines from data ingestion to model serving
  • Implementing reproducible training workflows with experiment tracking
  • Designing model validation frameworks with proper testing strategies
  • Setting up continuous training and deployment automation
  • Monitoring model performance and data quality in production

Consider alternatives when:

  • Doing exploratory research without production constraints (use a data science agent)
  • Optimizing inference latency without changing the ML pipeline (use an ML deployment agent)
  • Building data ingestion without ML components (use a data engineering agent)

Quick Start

# .claude/agents/architect-ml-engineer.yml name: ML Engineer model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior ML engineer. Build production-ready ML systems covering the full lifecycle: data pipelines, feature engineering, training, validation, deployment, and monitoring. Prioritize reproducibility, reliability, and maintainability.

Example invocation:

claude --agent architect-ml-engineer "Build a training pipeline for our recommendation model that includes feature engineering, hyperparameter tuning with Optuna, experiment tracking in MLflow, and automated model validation before deployment"

Core Concepts

ML Pipeline Architecture

Data Sources → Feature Pipeline → Training Pipeline → Validation → Deployment
     │              │                  │                 │            │
  Raw data    Feature store      Experiment tracker   Test suite   Model registry
  CDC/batch   Transformations    Hyperparameter opt   A/B config   Canary deploy
  Streaming   Versioning         Distributed training Bias checks  Monitoring

Pipeline Component Responsibilities

ComponentInputOutputTools
Data ValidationRaw dataQuality reportGreat Expectations
Feature EngineeringValidated dataFeature matrixFeast, custom
TrainingFeatures + configTrained modelPyTorch, XGBoost
Experiment TrackingMetrics + artifactsExperiment recordMLflow, W&B
Model ValidationModel + test dataPass/fail + reportCustom test suite
Model RegistryValidated modelVersioned artifactMLflow Registry
DeploymentRegistered modelServing endpointKServe, Seldon
MonitoringPredictions + actualsDrift/performance alertsEvidently, custom

Reproducible Training Configuration

# training_config.yaml experiment: name: recommendation-v3 tracking_uri: http://mlflow:5000 data: source: s3://data/features/v2/ split: train: 0.7 validation: 0.15 test: 0.15 seed: 42 model: architecture: two-tower embedding_dim: 128 hidden_layers: [256, 128, 64] training: epochs: 50 batch_size: 512 learning_rate: 0.001 early_stopping: patience: 5 metric: val_ndcg@10 validation: min_ndcg: 0.35 max_latency_ms: 50 bias_checks: - metric: demographic_parity threshold: 0.1

Configuration

ParameterDescriptionDefault
experiment_trackerExperiment tracking platformMLflow
feature_storeFeature storage and servingFeast
orchestratorPipeline orchestration toolAirflow
model_registryModel versioning systemMLflow Registry
training_frameworkML frameworkPyTorch
validation_suiteModel validation toolCustom + Great Expectations
deployment_targetServing infrastructureKubernetes

Best Practices

  1. Make training pipelines configuration-driven. Every parameter that affects model output—data version, hyperparameters, feature set, split seed—should live in a versioned configuration file, not in code. This makes experiments reproducible without code changes and enables hyperparameter sweeps by simply varying the config. The training script reads the config and logs it alongside results.

  2. Implement automated model validation gates. Before any model reaches production, it must pass automated checks: accuracy above minimum thresholds on the test set, latency within SLA on representative hardware, bias metrics within acceptable ranges, and no regression on critical edge cases. These gates prevent deploying models that looked good in aggregate but fail on important subsets.

  3. Version data and features with the same rigor as code. When a model's performance changes, you need to determine whether the data, features, or model changed. Use DVC or a feature store with versioning to track exactly which data produced which model. Pin data versions in training configs so experiments are reproducible months later.

  4. Build monitoring that detects problems before users do. Track prediction distributions, feature distributions, and model confidence scores in real time. Set alerts for distributional shifts that precede accuracy degradation. Compare production feature distributions against training distributions to catch data pipeline issues before they affect model quality.

  5. Design for retraining from day one. Your first model deployment should include an automated retraining pipeline, even if it runs manually at first. Designing retraining after deployment is much harder because it requires retroactively instrumenting data collection, feature computation, and validation. A pipeline that retrains weekly prevents gradual model staleness.

Common Issues

Experiments cannot be reproduced weeks later. This happens when some aspect of the environment wasn't captured: library versions, random seeds, data preprocessing steps, or system-level dependencies. Use containerized training environments with pinned dependencies, log the full environment specification alongside results, and verify reproducibility by re-running a subset of experiments periodically.

Training pipeline is too slow for rapid iteration. Profile the full pipeline to find bottlenecks. Common culprits: redundant data loading (cache preprocessed data), unoptimized data loaders (increase worker count, use memory-mapped datasets), full dataset training for experiments (use stratified sampling for quick iterations). Separate fast experimentation runs (sampled data, fewer epochs) from full training runs.

Model performs differently on different hardware. Floating-point behavior varies between CPU types, GPU architectures, and library versions. Quantized models are especially sensitive to hardware differences. Always validate on the target deployment hardware before approving a model. Include hardware specification in your model card and CI validation, and test on the exact GPU type used in production.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates