Architect MLOps Engineer

An agent for building and maintaining ML platforms covering infrastructure automation, CI/CD for ML, model versioning, feature store management, and operational excellence that enables data scientists and ML engineers to ship models reliably.

When to Use This Agent

Choose MLOps Engineer when:

Setting up ML infrastructure (training clusters, serving platforms, feature stores)
Building CI/CD pipelines for ML model training and deployment
Implementing model monitoring, alerting, and automated retraining
Designing reproducible ML environments with containerization
Managing model registries and deployment automation

Consider alternatives when:

Training models or tuning hyperparameters (use an ML engineer agent)
Doing data exploration and analysis (use a data science agent)
Setting up general DevOps without ML components (use a DevOps agent)

Quick Start


# .claude/agents/architect-mlops-engineer.yml
name: MLOps Engineer
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a senior MLOps engineer. Build and maintain ML platforms
  that enable reliable model development, deployment, and monitoring.
  Focus on automation, reproducibility, and operational excellence
  across the ML lifecycle.

Example invocation:


claude --agent architect-mlops-engineer "Set up a CI/CD pipeline
  that automatically trains, validates, and deploys our recommendation
  model when new training data arrives or code changes are pushed"

Core Concepts

MLOps Maturity Levels

Level	Characteristics	Automation
0 - Manual	Notebooks, manual deployment, no tracking	None
1 - Basic	Experiment tracking, scripted training	Partial
2 - Automated	CI/CD for ML, automated testing	Training + deploy
3 - Full	Continuous training, monitoring, auto-retrain	End-to-end

ML CI/CD Pipeline

Code Push → Lint/Test → Train → Validate → Register → Deploy → Monitor
    │          │          │        │           │          │         │
  Git PR    Unit tests  Full or  Accuracy    MLflow    Canary    Drift
  Review    Data tests  sample   Latency    Registry  Blue/green Alerts
  Hooks     Lint        Config   Bias       Version   Rollback  Retrain

Infrastructure Components


# MLOps platform architecture
platform:
  compute:
    training: kubernetes  # GPU node pools
    serving: kubernetes   # CPU/GPU inference nodes
    notebooks: jupyterhub # Development environment

  storage:
    artifacts: s3         # Models, configs, logs
    features: feast       # Online + offline feature store
    experiments: mlflow   # Metrics, parameters, artifacts
    data: delta-lake      # Versioned training data

  orchestration:
    pipelines: airflow    # Training pipeline scheduling
    ci_cd: github-actions # Code and model CI/CD
    monitoring: grafana   # Metrics and alerting

  governance:
    registry: mlflow      # Model versioning and staging
    lineage: openlineage  # Data and model lineage
    access: opa           # Policy-based access control

Configuration

Parameter	Description	Default
`compute_platform`	Training and serving infrastructure	Kubernetes
`ci_cd_tool`	CI/CD platform	GitHub Actions
`artifact_store`	Model and data artifact storage	S3
`experiment_tracker`	Experiment tracking system	MLflow
`feature_store`	Feature serving platform	Feast
`monitoring_stack`	Metrics and alerting tools	Prometheus + Grafana
`container_runtime`	Containerization platform	Docker

Best Practices

Treat ML pipelines as software with full CI/CD. ML code deserves the same engineering rigor as application code: linting, unit tests, integration tests, code review, and automated deployment. Test data preprocessing functions with known inputs and expected outputs. Test model training with a tiny dataset to verify the pipeline runs end-to-end. Run these tests on every pull request.
Containerize everything for reproducibility. Training environments, serving environments, and development environments should all run in containers with pinned dependencies. A model trained in a container with specific library versions should produce identical results regardless of where the container runs. Use multi-stage builds to keep images small while including all necessary dependencies.
Implement progressive model deployment. Never deploy a new model to 100% of traffic immediately. Use canary deployments (5% → 25% → 50% → 100%) with automated rollback triggers. Monitor key metrics at each stage: prediction accuracy, latency, error rate, and business metrics. The deployment pipeline should automatically roll back if any metric degrades beyond a threshold.
Centralize configuration and secrets management. ML pipelines have many configuration parameters: data paths, hyperparameters, serving endpoints, API keys, and database credentials. Use a configuration management system (like Hydra configs + vault for secrets) rather than scattering values across code, environment variables, and config files. Centralization prevents configuration drift between environments.
Build self-healing infrastructure. Design ML infrastructure to recover automatically from common failures: restart crashed training jobs from checkpoints, retry failed API calls with backoff, replace unhealthy serving replicas, and alert on-call engineers only for issues requiring human judgment. Use Kubernetes health checks, readiness probes, and pod disruption budgets to maintain availability.

Common Issues

Training jobs fail intermittently in distributed settings. Distributed training introduces failure modes absent in single-node training: network partitions between workers, GPU memory errors on specific nodes, and stragglers that slow the entire job. Implement checkpointing every N steps to recover from failures. Use fault-tolerant training frameworks like PyTorch Elastic that replace failed workers without restarting the entire job. Log per-worker metrics to identify consistently problematic nodes.

Model serving infrastructure can't handle traffic spikes. Auto-scaling based on CPU or GPU utilization reacts too slowly for traffic bursts. Use predictive scaling based on historical traffic patterns for known spikes (product launches, marketing campaigns). For unpredictable spikes, maintain a buffer of warm instances (20-30% above baseline) and pre-compute predictions for the most common inputs to serve from cache during overload.

Development and production environments produce different model behavior. Environment parity failures come from library version differences, hardware differences (CPU vs GPU floating point), and data pipeline discrepancies. Eliminate these by using identical Docker images across environments, validating model outputs on both CPU and GPU during CI, and running production data through the development pipeline periodically to catch preprocessing differences.

⚠️ Loading Issue

Architect Mlops Engineer

Architect MLOps Engineer

When to Use This Agent

Quick Start

Core Concepts

MLOps Maturity Levels

ML CI/CD Pipeline

Infrastructure Components

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner