Machine Learning Assistant

An agent specialized in deploying and serving ML models at scale, covering model optimization, inference infrastructure, real-time serving, and edge deployment for building reliable, performant ML systems that handle production workloads efficiently.

When to Use This Agent

Choose Machine Learning Assistant when:

Optimizing trained models for production inference performance
Setting up model serving infrastructure (TorchServe, Triton, TFServing)
Implementing real-time prediction APIs with latency constraints
Deploying models to edge devices or mobile platforms
Building model monitoring and automated retraining pipelines

Consider alternatives when:

Training models from scratch in research settings (use a data science agent)
Designing full AI system architectures (use an AI engineer agent)
Working on MLOps infrastructure without model work (use an MLOps agent)

Quick Start


# .claude/agents/machine-learning-assistant.yml
name: Machine Learning Assistant
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a senior ML deployment engineer. Optimize and deploy
  ML models for production, focusing on inference performance,
  serving reliability, and operational efficiency. Handle the
  full path from trained model to production endpoint.

Example invocation:


claude --agent machine-learning-assistant "Optimize our BERT-based
  sentiment classifier for production: reduce p99 latency below
  50ms, support 500 QPS, and deploy on Kubernetes with auto-scaling"

Core Concepts

Model Optimization Pipeline

Trained Model (FP32)
    ↓ Prune (remove low-weight connections)
Pruned Model (10-50% smaller)
    ↓ Quantize (FP32 → FP16 → INT8)
Quantized Model (2-4x faster, 2-4x smaller)
    ↓ Export (ONNX, TorchScript, SavedModel)
Optimized Format
    ↓ Platform optimize (TensorRT, OpenVINO, CoreML)
Deployment-Ready Model
    ↓ Serve (Triton, TorchServe, custom)
Production Endpoint

Serving Architecture Comparison

Platform	Best For	GPU Support	Batching	Multi-Model
Triton	Multi-framework, high throughput	Excellent	Dynamic	Yes
TorchServe	PyTorch-native models	Good	Yes	Yes
TF Serving	TensorFlow models	Good	Yes	Yes
ONNX Runtime	Cross-platform deployment	Good	Manual	Manual
vLLM	LLM inference	Excellent	Continuous	Limited
Ray Serve	Complex pipelines	Good	Custom	Yes

Deployment Patterns


# Blue-green deployment for ML models
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sentiment-classifier
spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      modelFormat: pytorch
      storageUri: s3://models/sentiment/v2
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: 4Gi
        requests:
          cpu: 2
          memory: 2Gi

Configuration

Parameter	Description	Default
`serving_platform`	Model serving infrastructure	Triton
`export_format`	Model export format	ONNX
`quantization`	Default quantization level	FP16
`batch_size`	Dynamic batching max size	32
`max_latency_ms`	Latency SLA target	100
`gpu_type`	Target GPU for optimization	T4
`scaling_metric`	Auto-scaling trigger metric	GPU utilization

Best Practices

Benchmark end-to-end, not just model inference. A model that runs in 10ms can sit behind a serving stack that adds 90ms of overhead from serialization, preprocessing, network hops, and postprocessing. Measure total request latency from the client's perspective and optimize the slowest component first. Often it's the data preprocessing or result formatting, not the model itself.
Use dynamic batching to maximize GPU utilization. Serving one request at a time wastes GPU parallelism. Dynamic batching collects requests over a short window and processes them together. Configure the batch window to balance latency (shorter window) against throughput (larger batches). A 5ms batching window with max batch size of 32 often doubles throughput with minimal latency impact.
Quantize aggressively, validate carefully. FP16 quantization rarely affects accuracy and halves memory usage. INT8 quantization typically loses less than 1% accuracy while providing another 2x speedup. Always validate quantized model accuracy on your test set before deploying. Some model architectures and tasks are more sensitive to quantization than others—measure rather than assume.
Implement model versioning with instant rollback. Store models in a versioned registry (MLflow, S3 with versioning) and deploy with canary or blue-green strategies. Keep the previous version warm so rollback takes seconds, not minutes. Define automated rollback triggers based on accuracy metrics, latency, or error rate that execute without human intervention.
Cache predictions for repeated inputs. Many production workloads have skewed input distributions where a small percentage of unique inputs covers a large percentage of requests. A prediction cache (Redis or in-memory LRU) for the top 1,000 inputs can reduce GPU load significantly. Hash the preprocessed input tensor rather than the raw request to maximize cache hit rates.

Common Issues

GPU memory exhaustion under load. This happens when batch sizes grow unconstrained or when model loading doesn't account for activation memory. Set explicit max batch sizes based on profiled memory usage at peak. Reserve 20-30% GPU memory headroom for activation tensors and framework overhead. Use nvidia-smi monitoring to track memory usage patterns under realistic load before setting production limits.

Model accuracy degrades gradually over time in production. Data drift causes the production input distribution to shift away from the training distribution. Implement continuous monitoring that compares production feature distributions against training baselines. Set up automated alerts when drift exceeds thresholds (Population Stability Index > 0.2 or KL divergence > 0.1). Schedule regular retraining with recent production data.

Cold start latency is unacceptable for serverless deployment. Model loading can take 5-30 seconds depending on model size and framework. Mitigate with provisioned concurrency (keep warm instances), model weight caching in shared memory, and smaller model formats (ONNX loads faster than full PyTorch checkpoints). For latency-critical services, use always-on instances rather than scale-to-zero serverless patterns.

⚠️ Loading Issue

Machine Learning Assistant

Machine Learning Assistant

When to Use This Agent

Quick Start

Core Concepts

Model Optimization Pipeline

Serving Architecture Comparison

Deployment Patterns

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner