Pro Replicate Model Runner Toolkit
Professional-grade skill designed for discover, compare, and run AI models via API. Built for Claude Code with best practices and real-world patterns.
Replicate Model Runner Toolkit
Complete Replicate API integration guide for running ML models in the cloud, covering model discovery, inference API usage, custom model deployment, and cost optimization for production applications.
When to Use This Skill
Choose Replicate Model Runner when:
- Running open-source ML models without managing GPU infrastructure
- Integrating image generation, LLMs, or audio models into applications
- Deploying custom models with Cog containers
- Comparing model outputs across different versions
- Building ML-powered features with pay-per-use pricing
Consider alternatives when:
- Need lowest latency — self-host models on your own GPUs
- Need HuggingFace ecosystem — use HuggingFace Inference Endpoints
- Need a managed ML platform — use AWS SageMaker or Vertex AI
Quick Start
# Install Replicate SDK pip install replicate # Set API token export REPLICATE_API_TOKEN=r8_... # Activate toolkit claude skill activate pro-replicate-model-runner-toolkit # Run a model claude "Run SDXL on Replicate to generate product images"
Example: Replicate API Usage
import Replicate from 'replicate'; const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN }); // Run an image generation model async function generateImage(prompt: string) { const output = await replicate.run( "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b", { input: { prompt, negative_prompt: "blurry, low quality", width: 1024, height: 1024, num_outputs: 1, guidance_scale: 7.5, }, } ); return output; // Array of image URLs } // Stream an LLM response async function chatWithModel(messages: { role: string; content: string }[]) { const stream = replicate.stream( "meta/llama-2-70b-chat", { input: { prompt: formatMessages(messages), max_tokens: 1024, temperature: 0.7, }, } ); for await (const event of stream) { process.stdout.write(event.toString()); } } // Webhook-based async prediction async function asyncPrediction(prompt: string, webhookUrl: string) { const prediction = await replicate.predictions.create({ version: "stability-ai/sdxl:39ed52f2...", input: { prompt }, webhook: webhookUrl, webhook_events_filter: ["completed"], }); return prediction.id; // Poll or wait for webhook }
Core Concepts
Replicate API Patterns
| Pattern | Description | Use Case |
|---|---|---|
| Synchronous | replicate.run() — wait for result | Simple scripts, CLI tools |
| Streaming | replicate.stream() — token-by-token | LLM chat, real-time output |
| Async + Webhook | Create prediction, receive webhook | Production apps, long jobs |
| Async + Poll | Create prediction, poll for status | Serverless, queue-based |
Model Categories on Replicate
| Category | Popular Models | Use Case |
|---|---|---|
| Image Generation | SDXL, Flux, Kandinsky | Content creation, design |
| LLMs | Llama, Mistral, Mixtral | Text generation, chat |
| Image-to-Image | ControlNet, IP-Adapter | Style transfer, editing |
| Audio | Whisper, MusicGen, Bark | Transcription, music |
| Video | Stable Video, AnimateDiff | Video generation |
| Upscaling | Real-ESRGAN, SwinIR | Image enhancement |
Configuration
| Parameter | Description | Default |
|---|---|---|
api_token | Replicate API authentication token | Required |
timeout | Prediction timeout (seconds) | 300 |
webhook_url | URL for async prediction callbacks | null |
max_retries | Retry count on failure | 3 |
poll_interval | Status polling interval (ms) | 1000 |
Best Practices
-
Use webhooks for production, polling for development — Webhooks eliminate polling overhead and provide instant notification. For development and debugging, polling is simpler and doesn't require a public endpoint.
-
Pin model versions for production stability — Always specify the full version hash, not just the model name. Model owners can update the default version, which may change behavior. Pinned versions guarantee consistent results.
-
Implement request queuing for cost control — Replicate charges per second of compute. Queue requests and batch them when possible. Set maximum concurrent prediction limits to prevent cost spikes from traffic bursts.
-
Cache model outputs for repeated inputs — If the same prompt generates the same type of output, cache results keyed by a hash of the input parameters. This eliminates redundant compute for common requests.
-
Use the cheapest model that meets quality requirements — Start with smaller, faster models and only upgrade when quality is insufficient. A 7B parameter model running in 2 seconds at $0.001 is often better than a 70B model in 30 seconds at $0.05 for many use cases.
Common Issues
Prediction times out for large models or long generations. Increase the client timeout and use async predictions with webhooks for jobs that take more than 60 seconds. Cold starts on Replicate can add 30-60 seconds for models that aren't frequently used.
Model output quality varies between runs despite identical inputs. Set a fixed seed for deterministic output. Without a seed, the model uses random noise initialization, producing different results each time. Use seed: 42 (or any fixed integer) for reproducibility.
API rate limits cause prediction failures during traffic spikes. Implement exponential backoff retry logic. Queue predictions and limit concurrency to stay within rate limits. Contact Replicate for higher rate limits if you have consistent high-volume needs.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.