A

Ai Engineer Assistant

Production-ready agent that handles agent, architecting, implementing, optimizing. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

AI Engineer Assistant

An autonomous agent that helps design, implement, and deploy AI systems from model selection through production infrastructure, covering the full ML engineering lifecycle with emphasis on performance, scalability, and responsible AI practices.

When to Use This Agent

Choose AI Engineer Assistant when:

  • Designing end-to-end AI system architectures for production
  • Selecting models, frameworks, and infrastructure for ML workloads
  • Building training and inference pipelines with proper monitoring
  • Implementing MLOps practices including CI/CD for ML
  • Optimizing model performance, latency, and cost in production

Consider alternatives when:

  • Doing exploratory data analysis without model building (use a data analyst agent)
  • Building traditional software without AI components (use a standard dev agent)
  • Running one-off experiments in notebooks (use a data science agent)

Quick Start

# .claude/agents/ai-engineer-assistant.yml name: AI Engineer Assistant model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior AI engineer. Design and implement production-grade AI systems covering model selection, training pipelines, inference optimization, and deployment infrastructure. Prioritize reliability, scalability, and ethical AI practices.

Example invocation:

claude --agent ai-engineer-assistant "Design an inference pipeline for our text classification model that handles 1000 req/s with p99 latency under 100ms, using our existing Kubernetes cluster"

Core Concepts

AI System Architecture Layers

LayerComponentsKey Decisions
DataIngestion, storage, versioningFeature store vs inline, batch vs streaming
TrainingPipelines, orchestration, trackingFramework choice, distributed strategy
ModelArchitecture, optimization, validationModel size, quantization, distillation
ServingInference, scaling, cachingReal-time vs batch, GPU vs CPU
MonitoringDrift detection, metrics, alertsAccuracy tracking, data quality checks
GovernanceBias testing, explainability, auditFairness metrics, model cards

Model Selection Framework

def select_model(requirements): """ Evaluate models across key dimensions: accuracy: Does it meet quality thresholds? latency: Can it serve within SLA? (p50, p95, p99) throughput: Handles expected QPS? cost: Training + inference within budget? maintenance: Team can operate and update it? compliance: Meets data/privacy requirements? """ candidates = filter_by_task_type(requirements.task) scored = [ (model, weighted_score(model, requirements)) for model in candidates ] return sorted(scored, key=lambda x: x[1], reverse=True)

Production Inference Pipeline

Request → Load Balancer → Preprocessing → Model Server → Postprocessing → Response
              │                                  │
              └── Health checks          ┌───────┴────────┐
                                         │  Model Registry │
                                         │  (versioned)    │
                                         └────────────────┘
              Monitoring: latency, throughput, error rate, drift

Configuration

ParameterDescriptionDefault
frameworkPreferred ML frameworkPyTorch
serving_platformModel serving infrastructureTorchServe
experiment_trackerExperiment tracking toolMLflow
gpu_typeTarget GPU for optimizationA100
max_latency_msTarget p99 latency constraint100
quantizationDefault quantization strategyFP16
monitoringMonitoring stack preferencePrometheus/Grafana

Best Practices

  1. Version everything, not just models. Track data versions, preprocessing code, hyperparameters, and environment configurations together. When a model performs differently in production than in training, you need to reproduce the exact conditions. Tools like DVC for data and MLflow for experiments make this manageable.

  2. Start with the simplest model that meets requirements. A logistic regression serving at 10,000 requests per second costs a fraction of a transformer doing the same job at 100 requests per second. Establish baseline metrics with simple models first, then justify complexity increases with measured improvements that matter to the business.

  3. Design for graceful degradation. Production AI systems must handle model failures without crashing the application. Implement fallback strategies: cached predictions for common inputs, rule-based defaults when the model is unavailable, and circuit breakers that route traffic away from unhealthy model replicas.

  4. Separate feature computation from model inference. Feature stores let you compute expensive features once and reuse them across training and serving, eliminating training-serving skew. Online feature stores serve precomputed features at low latency. This separation also makes it easy to swap models without rebuilding the feature pipeline.

  5. Monitor for data drift, not just accuracy. Accuracy metrics require labeled data, which often arrives with delay. Statistical drift detection on input features catches problems in real-time. Track feature distributions, prediction distributions, and data quality metrics. Alert when distributions shift beyond thresholds, even before accuracy metrics degrade.

Common Issues

Training-serving skew causes accuracy drops in production. This happens when feature computation differs between training and serving environments. Use the same feature extraction code for both paths, ideally through a shared feature store. Test for skew by running production data through the training pipeline and comparing feature values. Even small numerical differences from library versions or floating-point handling can compound into significant prediction errors.

Model latency exceeds SLA under load. Profile the full request path, not just model inference. Often preprocessing, tokenization, or postprocessing dominate latency. Apply optimizations in order of impact: batching requests, model quantization (FP32 to FP16 or INT8), ONNX Runtime or TensorRT conversion, input truncation, and response caching for repeated queries. Each step typically yields 2-4x improvement.

Model deployment breaks existing functionality. Implement canary deployments that route a small percentage of traffic to the new model version while monitoring key metrics. Define rollback criteria (accuracy drops below threshold, latency exceeds SLA, error rate spikes) and automate the rollback trigger. Shadow deployments that run new models in parallel without serving results are even safer for high-stakes applications.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates