Pro Modal
Comprehensive skill designed for python, code, cloud, serverless. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pro Modal
Deploy Python functions to the cloud with Modal's serverless platform for GPU computing, batch processing, and scalable API endpoints. This skill covers function definitions, container configuration, GPU workloads, scheduled jobs, and web endpoint deployment without managing infrastructure.
When to Use This Skill
Choose Pro Modal when you need to:
- Run GPU-accelerated ML inference or training without managing servers
- Deploy Python functions as auto-scaling API endpoints
- Execute batch processing jobs that scale to hundreds of containers
- Schedule recurring data processing or model training tasks
Consider alternatives when:
- You need persistent long-running services (use traditional cloud VMs or Kubernetes)
- You need sub-50ms latency for every request (use edge computing or pre-warmed containers)
- You need to run non-Python workloads (use AWS Lambda or Cloud Functions)
Quick Start
# Install Modal pip install modal # Authenticate modal token new
import modal app = modal.App("my-first-app") # Define a simple function that runs in the cloud @app.function() def square(x): return x ** 2 # Run it @app.local_entrypoint() def main(): result = square.remote(42) print(f"42² = {result}")
# Deploy and run modal run my_app.py
Core Concepts
Container and Resource Configuration
| Decorator/Option | Purpose | Example |
|---|---|---|
@app.function() | Basic cloud function | CPU tasks, data processing |
gpu="T4" | Attach GPU | ML inference |
gpu="A100" | High-end GPU | Model training |
image= | Custom container image | Dependencies, system packages |
schedule= | Cron scheduling | Periodic batch jobs |
@app.cls() | Stateful class with lifecycle | Model loading, connection pools |
@modal.web_endpoint() | HTTP endpoint | REST APIs |
GPU-Accelerated ML Inference
import modal app = modal.App("llm-inference") # Define container image with ML dependencies inference_image = ( modal.Image.debian_slim(python_version="3.11") .pip_install("torch", "transformers", "accelerate") ) @app.cls( image=inference_image, gpu="A10G", container_idle_timeout=300, allow_concurrent_inputs=10 ) class TextGenerator: @modal.enter() def load_model(self): """Load model once when container starts.""" from transformers import pipeline self.pipe = pipeline( "text-generation", model="meta-llama/Llama-2-7b-chat-hf", device_map="auto", torch_dtype="auto" ) @modal.method() def generate(self, prompt, max_tokens=256): result = self.pipe(prompt, max_new_tokens=max_tokens) return result[0]["generated_text"] @modal.web_endpoint(method="POST") def api(self, request: dict): text = self.generate(request["prompt"], request.get("max_tokens", 256)) return {"generated_text": text} @app.local_entrypoint() def main(): gen = TextGenerator() result = gen.generate.remote("Explain quantum computing in simple terms:") print(result)
Batch Processing with Map
import modal app = modal.App("batch-processor") image = modal.Image.debian_slim().pip_install("pillow", "requests") @app.function(image=image, concurrency_limit=50) def process_image(url): """Process a single image — runs in parallel across containers.""" import requests from PIL import Image from io import BytesIO response = requests.get(url, timeout=30) img = Image.open(BytesIO(response.content)) # Resize and convert img = img.resize((512, 512)) img = img.convert("RGB") return { "url": url, "original_size": img.size, "format": img.format, "status": "processed" } @app.local_entrypoint() def main(): urls = [f"https://picsum.photos/id/{i}/800/600" for i in range(100)] # Process all images in parallel — Modal scales automatically results = list(process_image.map(urls)) print(f"Processed {len(results)} images")
Scheduled Jobs
import modal app = modal.App("daily-pipeline") @app.function(schedule=modal.Cron("0 6 * * *")) # Daily at 6 AM UTC def daily_data_sync(): """Runs automatically every day.""" import requests # Fetch fresh data response = requests.get("https://api.example.com/daily-export") data = response.json() # Process and store processed = transform_data(data) upload_to_storage(processed) print(f"Synced {len(data)} records")
Configuration
| Parameter | Description | Default |
|---|---|---|
gpu | GPU type (T4, A10G, A100, H100) | None (CPU only) |
memory | RAM allocation in MB | 128 |
timeout | Maximum execution time (seconds) | 300 |
concurrency_limit | Max parallel container instances | 100 |
container_idle_timeout | Keep-alive duration (seconds) | 60 |
retries | Automatic retry count on failure | 0 |
allow_concurrent_inputs | Requests per container | 1 |
Best Practices
-
Use
@modal.enter()for expensive initialization — Load ML models, establish database connections, and initialize heavy objects in the@modal.enter()lifecycle method. This runs once when the container starts, not on every function call, dramatically reducing per-request latency. -
Right-size your GPU selection — Start with a T4 for inference tasks and only upgrade to A10G or A100 if you measure insufficient performance. GPU costs scale significantly — an A100 costs 10x more per hour than a T4. Profile your workload before committing to expensive hardware.
-
Use
.map()for batch workloads — When processing lists of items, usefunction.map(items)instead of a loop of.remote()calls. Map distributes work across containers automatically and handles failures and retries at the framework level. -
Set
container_idle_timeoutappropriately — For APIs with steady traffic, set 300-600 seconds to keep containers warm and avoid cold starts. For batch jobs that run once, set 0 to release resources immediately after completion. -
Pin dependency versions in your image — Use
pip_install("torch==2.1.0", "transformers==4.36.0")with exact versions rather thanpip_install("torch", "transformers"). Unpinned versions cause non-reproducible builds when packages update between deployments.
Common Issues
Cold start latency is too high — Large container images (especially with PyTorch + model weights) take 30-60 seconds to start. Use modal.Image.from_registry() with a pre-built Docker image, enable keep_warm=1 to maintain a minimum warm container, or use Modal's model caching with modal.Volume to avoid re-downloading weights.
Out of memory errors on GPU — The model fits locally but crashes on Modal's GPU. This happens because Modal containers have less system RAM than your local machine. Increase memory parameter, use torch_dtype=torch.float16 for model loading, or upgrade to a GPU with more VRAM (A10G: 24GB, A100: 40/80GB).
Function calls timing out at 300 seconds — The default timeout is conservative. For long-running tasks like model training or large batch processing, increase timeout in the decorator: @app.function(timeout=3600) for up to 1 hour. For very long tasks, consider breaking work into smaller chunks and using map().
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.