M

Machine Learning Assistant

Battle-tested agent for agent, need, deploy, optimize. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

Machine Learning Assistant

An agent specialized in deploying and serving ML models at scale, covering model optimization, inference infrastructure, real-time serving, and edge deployment for building reliable, performant ML systems that handle production workloads efficiently.

When to Use This Agent

Choose Machine Learning Assistant when:

  • Optimizing trained models for production inference performance
  • Setting up model serving infrastructure (TorchServe, Triton, TFServing)
  • Implementing real-time prediction APIs with latency constraints
  • Deploying models to edge devices or mobile platforms
  • Building model monitoring and automated retraining pipelines

Consider alternatives when:

  • Training models from scratch in research settings (use a data science agent)
  • Designing full AI system architectures (use an AI engineer agent)
  • Working on MLOps infrastructure without model work (use an MLOps agent)

Quick Start

# .claude/agents/machine-learning-assistant.yml name: Machine Learning Assistant model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior ML deployment engineer. Optimize and deploy ML models for production, focusing on inference performance, serving reliability, and operational efficiency. Handle the full path from trained model to production endpoint.

Example invocation:

claude --agent machine-learning-assistant "Optimize our BERT-based sentiment classifier for production: reduce p99 latency below 50ms, support 500 QPS, and deploy on Kubernetes with auto-scaling"

Core Concepts

Model Optimization Pipeline

Trained Model (FP32)
    ↓ Prune (remove low-weight connections)
Pruned Model (10-50% smaller)
    ↓ Quantize (FP32 → FP16 → INT8)
Quantized Model (2-4x faster, 2-4x smaller)
    ↓ Export (ONNX, TorchScript, SavedModel)
Optimized Format
    ↓ Platform optimize (TensorRT, OpenVINO, CoreML)
Deployment-Ready Model
    ↓ Serve (Triton, TorchServe, custom)
Production Endpoint

Serving Architecture Comparison

PlatformBest ForGPU SupportBatchingMulti-Model
TritonMulti-framework, high throughputExcellentDynamicYes
TorchServePyTorch-native modelsGoodYesYes
TF ServingTensorFlow modelsGoodYesYes
ONNX RuntimeCross-platform deploymentGoodManualManual
vLLMLLM inferenceExcellentContinuousLimited
Ray ServeComplex pipelinesGoodCustomYes

Deployment Patterns

# Blue-green deployment for ML models apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: sentiment-classifier spec: predictor: canaryTrafficPercent: 10 model: modelFormat: pytorch storageUri: s3://models/sentiment/v2 resources: limits: nvidia.com/gpu: 1 memory: 4Gi requests: cpu: 2 memory: 2Gi

Configuration

ParameterDescriptionDefault
serving_platformModel serving infrastructureTriton
export_formatModel export formatONNX
quantizationDefault quantization levelFP16
batch_sizeDynamic batching max size32
max_latency_msLatency SLA target100
gpu_typeTarget GPU for optimizationT4
scaling_metricAuto-scaling trigger metricGPU utilization

Best Practices

  1. Benchmark end-to-end, not just model inference. A model that runs in 10ms can sit behind a serving stack that adds 90ms of overhead from serialization, preprocessing, network hops, and postprocessing. Measure total request latency from the client's perspective and optimize the slowest component first. Often it's the data preprocessing or result formatting, not the model itself.

  2. Use dynamic batching to maximize GPU utilization. Serving one request at a time wastes GPU parallelism. Dynamic batching collects requests over a short window and processes them together. Configure the batch window to balance latency (shorter window) against throughput (larger batches). A 5ms batching window with max batch size of 32 often doubles throughput with minimal latency impact.

  3. Quantize aggressively, validate carefully. FP16 quantization rarely affects accuracy and halves memory usage. INT8 quantization typically loses less than 1% accuracy while providing another 2x speedup. Always validate quantized model accuracy on your test set before deploying. Some model architectures and tasks are more sensitive to quantization than others—measure rather than assume.

  4. Implement model versioning with instant rollback. Store models in a versioned registry (MLflow, S3 with versioning) and deploy with canary or blue-green strategies. Keep the previous version warm so rollback takes seconds, not minutes. Define automated rollback triggers based on accuracy metrics, latency, or error rate that execute without human intervention.

  5. Cache predictions for repeated inputs. Many production workloads have skewed input distributions where a small percentage of unique inputs covers a large percentage of requests. A prediction cache (Redis or in-memory LRU) for the top 1,000 inputs can reduce GPU load significantly. Hash the preprocessed input tensor rather than the raw request to maximize cache hit rates.

Common Issues

GPU memory exhaustion under load. This happens when batch sizes grow unconstrained or when model loading doesn't account for activation memory. Set explicit max batch sizes based on profiled memory usage at peak. Reserve 20-30% GPU memory headroom for activation tensors and framework overhead. Use nvidia-smi monitoring to track memory usage patterns under realistic load before setting production limits.

Model accuracy degrades gradually over time in production. Data drift causes the production input distribution to shift away from the training distribution. Implement continuous monitoring that compares production feature distributions against training baselines. Set up automated alerts when drift exceeds thresholds (Population Stability Index > 0.2 or KL divergence > 0.1). Schedule regular retraining with recent production data.

Cold start latency is unacceptable for serverless deployment. Model loading can take 5-30 seconds depending on model size and framework. Mitigate with provisioned concurrency (keep warm instances), model weight caching in shared memory, and smaller model formats (ONNX loads faster than full PyTorch checkpoints). For latency-critical services, use always-on instances rather than scale-to-zero serverless patterns.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates