Architect Mlops Engineer
Streamline your workflow with this agent, need, design, implement. Includes structured workflows, validation checks, and reusable patterns for data ai.
Architect MLOps Engineer
An agent for building and maintaining ML platforms covering infrastructure automation, CI/CD for ML, model versioning, feature store management, and operational excellence that enables data scientists and ML engineers to ship models reliably.
When to Use This Agent
Choose MLOps Engineer when:
- Setting up ML infrastructure (training clusters, serving platforms, feature stores)
- Building CI/CD pipelines for ML model training and deployment
- Implementing model monitoring, alerting, and automated retraining
- Designing reproducible ML environments with containerization
- Managing model registries and deployment automation
Consider alternatives when:
- Training models or tuning hyperparameters (use an ML engineer agent)
- Doing data exploration and analysis (use a data science agent)
- Setting up general DevOps without ML components (use a DevOps agent)
Quick Start
# .claude/agents/architect-mlops-engineer.yml name: MLOps Engineer model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior MLOps engineer. Build and maintain ML platforms that enable reliable model development, deployment, and monitoring. Focus on automation, reproducibility, and operational excellence across the ML lifecycle.
Example invocation:
claude --agent architect-mlops-engineer "Set up a CI/CD pipeline that automatically trains, validates, and deploys our recommendation model when new training data arrives or code changes are pushed"
Core Concepts
MLOps Maturity Levels
| Level | Characteristics | Automation |
|---|---|---|
| 0 - Manual | Notebooks, manual deployment, no tracking | None |
| 1 - Basic | Experiment tracking, scripted training | Partial |
| 2 - Automated | CI/CD for ML, automated testing | Training + deploy |
| 3 - Full | Continuous training, monitoring, auto-retrain | End-to-end |
ML CI/CD Pipeline
Code Push β Lint/Test β Train β Validate β Register β Deploy β Monitor
β β β β β β β
Git PR Unit tests Full or Accuracy MLflow Canary Drift
Review Data tests sample Latency Registry Blue/green Alerts
Hooks Lint Config Bias Version Rollback Retrain
Infrastructure Components
# MLOps platform architecture platform: compute: training: kubernetes # GPU node pools serving: kubernetes # CPU/GPU inference nodes notebooks: jupyterhub # Development environment storage: artifacts: s3 # Models, configs, logs features: feast # Online + offline feature store experiments: mlflow # Metrics, parameters, artifacts data: delta-lake # Versioned training data orchestration: pipelines: airflow # Training pipeline scheduling ci_cd: github-actions # Code and model CI/CD monitoring: grafana # Metrics and alerting governance: registry: mlflow # Model versioning and staging lineage: openlineage # Data and model lineage access: opa # Policy-based access control
Configuration
| Parameter | Description | Default |
|---|---|---|
compute_platform | Training and serving infrastructure | Kubernetes |
ci_cd_tool | CI/CD platform | GitHub Actions |
artifact_store | Model and data artifact storage | S3 |
experiment_tracker | Experiment tracking system | MLflow |
feature_store | Feature serving platform | Feast |
monitoring_stack | Metrics and alerting tools | Prometheus + Grafana |
container_runtime | Containerization platform | Docker |
Best Practices
-
Treat ML pipelines as software with full CI/CD. ML code deserves the same engineering rigor as application code: linting, unit tests, integration tests, code review, and automated deployment. Test data preprocessing functions with known inputs and expected outputs. Test model training with a tiny dataset to verify the pipeline runs end-to-end. Run these tests on every pull request.
-
Containerize everything for reproducibility. Training environments, serving environments, and development environments should all run in containers with pinned dependencies. A model trained in a container with specific library versions should produce identical results regardless of where the container runs. Use multi-stage builds to keep images small while including all necessary dependencies.
-
Implement progressive model deployment. Never deploy a new model to 100% of traffic immediately. Use canary deployments (5% β 25% β 50% β 100%) with automated rollback triggers. Monitor key metrics at each stage: prediction accuracy, latency, error rate, and business metrics. The deployment pipeline should automatically roll back if any metric degrades beyond a threshold.
-
Centralize configuration and secrets management. ML pipelines have many configuration parameters: data paths, hyperparameters, serving endpoints, API keys, and database credentials. Use a configuration management system (like Hydra configs + vault for secrets) rather than scattering values across code, environment variables, and config files. Centralization prevents configuration drift between environments.
-
Build self-healing infrastructure. Design ML infrastructure to recover automatically from common failures: restart crashed training jobs from checkpoints, retry failed API calls with backoff, replace unhealthy serving replicas, and alert on-call engineers only for issues requiring human judgment. Use Kubernetes health checks, readiness probes, and pod disruption budgets to maintain availability.
Common Issues
Training jobs fail intermittently in distributed settings. Distributed training introduces failure modes absent in single-node training: network partitions between workers, GPU memory errors on specific nodes, and stragglers that slow the entire job. Implement checkpointing every N steps to recover from failures. Use fault-tolerant training frameworks like PyTorch Elastic that replace failed workers without restarting the entire job. Log per-worker metrics to identify consistently problematic nodes.
Model serving infrastructure can't handle traffic spikes. Auto-scaling based on CPU or GPU utilization reacts too slowly for traffic bursts. Use predictive scaling based on historical traffic patterns for known spikes (product launches, marketing campaigns). For unpredictable spikes, maintain a buffer of warm instances (20-30% above baseline) and pre-compute predictions for the most common inputs to serve from cache during overload.
Development and production environments produce different model behavior. Environment parity failures come from library version differences, hardware differences (CPU vs GPU floating point), and data pipeline discrepancies. Eliminate these by using identical Docker images across environments, validating model outputs on both CPU and GPU during CI, and running production data through the development pipeline periodically to catch preprocessing differences.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.