A

Architect Mlops Engineer

Streamline your workflow with this agent, need, design, implement. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

Architect MLOps Engineer

An agent for building and maintaining ML platforms covering infrastructure automation, CI/CD for ML, model versioning, feature store management, and operational excellence that enables data scientists and ML engineers to ship models reliably.

When to Use This Agent

Choose MLOps Engineer when:

  • Setting up ML infrastructure (training clusters, serving platforms, feature stores)
  • Building CI/CD pipelines for ML model training and deployment
  • Implementing model monitoring, alerting, and automated retraining
  • Designing reproducible ML environments with containerization
  • Managing model registries and deployment automation

Consider alternatives when:

  • Training models or tuning hyperparameters (use an ML engineer agent)
  • Doing data exploration and analysis (use a data science agent)
  • Setting up general DevOps without ML components (use a DevOps agent)

Quick Start

# .claude/agents/architect-mlops-engineer.yml name: MLOps Engineer model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior MLOps engineer. Build and maintain ML platforms that enable reliable model development, deployment, and monitoring. Focus on automation, reproducibility, and operational excellence across the ML lifecycle.

Example invocation:

claude --agent architect-mlops-engineer "Set up a CI/CD pipeline that automatically trains, validates, and deploys our recommendation model when new training data arrives or code changes are pushed"

Core Concepts

MLOps Maturity Levels

LevelCharacteristicsAutomation
0 - ManualNotebooks, manual deployment, no trackingNone
1 - BasicExperiment tracking, scripted trainingPartial
2 - AutomatedCI/CD for ML, automated testingTraining + deploy
3 - FullContinuous training, monitoring, auto-retrainEnd-to-end

ML CI/CD Pipeline

Code Push β†’ Lint/Test β†’ Train β†’ Validate β†’ Register β†’ Deploy β†’ Monitor
    β”‚          β”‚          β”‚        β”‚           β”‚          β”‚         β”‚
  Git PR    Unit tests  Full or  Accuracy    MLflow    Canary    Drift
  Review    Data tests  sample   Latency    Registry  Blue/green Alerts
  Hooks     Lint        Config   Bias       Version   Rollback  Retrain

Infrastructure Components

# MLOps platform architecture platform: compute: training: kubernetes # GPU node pools serving: kubernetes # CPU/GPU inference nodes notebooks: jupyterhub # Development environment storage: artifacts: s3 # Models, configs, logs features: feast # Online + offline feature store experiments: mlflow # Metrics, parameters, artifacts data: delta-lake # Versioned training data orchestration: pipelines: airflow # Training pipeline scheduling ci_cd: github-actions # Code and model CI/CD monitoring: grafana # Metrics and alerting governance: registry: mlflow # Model versioning and staging lineage: openlineage # Data and model lineage access: opa # Policy-based access control

Configuration

ParameterDescriptionDefault
compute_platformTraining and serving infrastructureKubernetes
ci_cd_toolCI/CD platformGitHub Actions
artifact_storeModel and data artifact storageS3
experiment_trackerExperiment tracking systemMLflow
feature_storeFeature serving platformFeast
monitoring_stackMetrics and alerting toolsPrometheus + Grafana
container_runtimeContainerization platformDocker

Best Practices

  1. Treat ML pipelines as software with full CI/CD. ML code deserves the same engineering rigor as application code: linting, unit tests, integration tests, code review, and automated deployment. Test data preprocessing functions with known inputs and expected outputs. Test model training with a tiny dataset to verify the pipeline runs end-to-end. Run these tests on every pull request.

  2. Containerize everything for reproducibility. Training environments, serving environments, and development environments should all run in containers with pinned dependencies. A model trained in a container with specific library versions should produce identical results regardless of where the container runs. Use multi-stage builds to keep images small while including all necessary dependencies.

  3. Implement progressive model deployment. Never deploy a new model to 100% of traffic immediately. Use canary deployments (5% β†’ 25% β†’ 50% β†’ 100%) with automated rollback triggers. Monitor key metrics at each stage: prediction accuracy, latency, error rate, and business metrics. The deployment pipeline should automatically roll back if any metric degrades beyond a threshold.

  4. Centralize configuration and secrets management. ML pipelines have many configuration parameters: data paths, hyperparameters, serving endpoints, API keys, and database credentials. Use a configuration management system (like Hydra configs + vault for secrets) rather than scattering values across code, environment variables, and config files. Centralization prevents configuration drift between environments.

  5. Build self-healing infrastructure. Design ML infrastructure to recover automatically from common failures: restart crashed training jobs from checkpoints, retry failed API calls with backoff, replace unhealthy serving replicas, and alert on-call engineers only for issues requiring human judgment. Use Kubernetes health checks, readiness probes, and pod disruption budgets to maintain availability.

Common Issues

Training jobs fail intermittently in distributed settings. Distributed training introduces failure modes absent in single-node training: network partitions between workers, GPU memory errors on specific nodes, and stragglers that slow the entire job. Implement checkpointing every N steps to recover from failures. Use fault-tolerant training frameworks like PyTorch Elastic that replace failed workers without restarting the entire job. Log per-worker metrics to identify consistently problematic nodes.

Model serving infrastructure can't handle traffic spikes. Auto-scaling based on CPU or GPU utilization reacts too slowly for traffic bursts. Use predictive scaling based on historical traffic patterns for known spikes (product launches, marketing campaigns). For unpredictable spikes, maintain a buffer of warm instances (20-30% above baseline) and pre-compute predictions for the most common inputs to serve from cache during overload.

Development and production environments produce different model behavior. Environment parity failures come from library version differences, hardware differences (CPU vs GPU floating point), and data pipeline discrepancies. Eliminate these by using identical Docker images across environments, validating model outputs on both CPU and GPU during CI, and running production data through the development pipeline periodically to catch preprocessing differences.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates