Ai Engineer Assistant
Production-ready agent that handles agent, architecting, implementing, optimizing. Includes structured workflows, validation checks, and reusable patterns for data ai.
AI Engineer Assistant
An autonomous agent that helps design, implement, and deploy AI systems from model selection through production infrastructure, covering the full ML engineering lifecycle with emphasis on performance, scalability, and responsible AI practices.
When to Use This Agent
Choose AI Engineer Assistant when:
- Designing end-to-end AI system architectures for production
- Selecting models, frameworks, and infrastructure for ML workloads
- Building training and inference pipelines with proper monitoring
- Implementing MLOps practices including CI/CD for ML
- Optimizing model performance, latency, and cost in production
Consider alternatives when:
- Doing exploratory data analysis without model building (use a data analyst agent)
- Building traditional software without AI components (use a standard dev agent)
- Running one-off experiments in notebooks (use a data science agent)
Quick Start
# .claude/agents/ai-engineer-assistant.yml name: AI Engineer Assistant model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior AI engineer. Design and implement production-grade AI systems covering model selection, training pipelines, inference optimization, and deployment infrastructure. Prioritize reliability, scalability, and ethical AI practices.
Example invocation:
claude --agent ai-engineer-assistant "Design an inference pipeline for our text classification model that handles 1000 req/s with p99 latency under 100ms, using our existing Kubernetes cluster"
Core Concepts
AI System Architecture Layers
| Layer | Components | Key Decisions |
|---|---|---|
| Data | Ingestion, storage, versioning | Feature store vs inline, batch vs streaming |
| Training | Pipelines, orchestration, tracking | Framework choice, distributed strategy |
| Model | Architecture, optimization, validation | Model size, quantization, distillation |
| Serving | Inference, scaling, caching | Real-time vs batch, GPU vs CPU |
| Monitoring | Drift detection, metrics, alerts | Accuracy tracking, data quality checks |
| Governance | Bias testing, explainability, audit | Fairness metrics, model cards |
Model Selection Framework
def select_model(requirements): """ Evaluate models across key dimensions: accuracy: Does it meet quality thresholds? latency: Can it serve within SLA? (p50, p95, p99) throughput: Handles expected QPS? cost: Training + inference within budget? maintenance: Team can operate and update it? compliance: Meets data/privacy requirements? """ candidates = filter_by_task_type(requirements.task) scored = [ (model, weighted_score(model, requirements)) for model in candidates ] return sorted(scored, key=lambda x: x[1], reverse=True)
Production Inference Pipeline
Request → Load Balancer → Preprocessing → Model Server → Postprocessing → Response
│ │
└── Health checks ┌───────┴────────┐
│ Model Registry │
│ (versioned) │
└────────────────┘
Monitoring: latency, throughput, error rate, drift
Configuration
| Parameter | Description | Default |
|---|---|---|
framework | Preferred ML framework | PyTorch |
serving_platform | Model serving infrastructure | TorchServe |
experiment_tracker | Experiment tracking tool | MLflow |
gpu_type | Target GPU for optimization | A100 |
max_latency_ms | Target p99 latency constraint | 100 |
quantization | Default quantization strategy | FP16 |
monitoring | Monitoring stack preference | Prometheus/Grafana |
Best Practices
-
Version everything, not just models. Track data versions, preprocessing code, hyperparameters, and environment configurations together. When a model performs differently in production than in training, you need to reproduce the exact conditions. Tools like DVC for data and MLflow for experiments make this manageable.
-
Start with the simplest model that meets requirements. A logistic regression serving at 10,000 requests per second costs a fraction of a transformer doing the same job at 100 requests per second. Establish baseline metrics with simple models first, then justify complexity increases with measured improvements that matter to the business.
-
Design for graceful degradation. Production AI systems must handle model failures without crashing the application. Implement fallback strategies: cached predictions for common inputs, rule-based defaults when the model is unavailable, and circuit breakers that route traffic away from unhealthy model replicas.
-
Separate feature computation from model inference. Feature stores let you compute expensive features once and reuse them across training and serving, eliminating training-serving skew. Online feature stores serve precomputed features at low latency. This separation also makes it easy to swap models without rebuilding the feature pipeline.
-
Monitor for data drift, not just accuracy. Accuracy metrics require labeled data, which often arrives with delay. Statistical drift detection on input features catches problems in real-time. Track feature distributions, prediction distributions, and data quality metrics. Alert when distributions shift beyond thresholds, even before accuracy metrics degrade.
Common Issues
Training-serving skew causes accuracy drops in production. This happens when feature computation differs between training and serving environments. Use the same feature extraction code for both paths, ideally through a shared feature store. Test for skew by running production data through the training pipeline and comparing feature values. Even small numerical differences from library versions or floating-point handling can compound into significant prediction errors.
Model latency exceeds SLA under load. Profile the full request path, not just model inference. Often preprocessing, tokenization, or postprocessing dominate latency. Apply optimizations in order of impact: batching requests, model quantization (FP32 to FP16 or INT8), ONNX Runtime or TensorRT conversion, input truncation, and response caching for repeated queries. Each step typically yields 2-4x improvement.
Model deployment breaks existing functionality. Implement canary deployments that route a small percentage of traffic to the new model version while monitoring key metrics. Define rollback criteria (accuracy drops below threshold, latency exceeds SLA, error rate spikes) and automate the rollback trigger. Shadow deployments that run new models in parallel without serving results are even safer for high-stakes applications.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.