P

Pro Modal

Comprehensive skill designed for python, code, cloud, serverless. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro Modal

Deploy Python functions to the cloud with Modal's serverless platform for GPU computing, batch processing, and scalable API endpoints. This skill covers function definitions, container configuration, GPU workloads, scheduled jobs, and web endpoint deployment without managing infrastructure.

When to Use This Skill

Choose Pro Modal when you need to:

  • Run GPU-accelerated ML inference or training without managing servers
  • Deploy Python functions as auto-scaling API endpoints
  • Execute batch processing jobs that scale to hundreds of containers
  • Schedule recurring data processing or model training tasks

Consider alternatives when:

  • You need persistent long-running services (use traditional cloud VMs or Kubernetes)
  • You need sub-50ms latency for every request (use edge computing or pre-warmed containers)
  • You need to run non-Python workloads (use AWS Lambda or Cloud Functions)

Quick Start

# Install Modal pip install modal # Authenticate modal token new
import modal app = modal.App("my-first-app") # Define a simple function that runs in the cloud @app.function() def square(x): return x ** 2 # Run it @app.local_entrypoint() def main(): result = square.remote(42) print(f"42² = {result}")
# Deploy and run modal run my_app.py

Core Concepts

Container and Resource Configuration

Decorator/OptionPurposeExample
@app.function()Basic cloud functionCPU tasks, data processing
gpu="T4"Attach GPUML inference
gpu="A100"High-end GPUModel training
image=Custom container imageDependencies, system packages
schedule=Cron schedulingPeriodic batch jobs
@app.cls()Stateful class with lifecycleModel loading, connection pools
@modal.web_endpoint()HTTP endpointREST APIs

GPU-Accelerated ML Inference

import modal app = modal.App("llm-inference") # Define container image with ML dependencies inference_image = ( modal.Image.debian_slim(python_version="3.11") .pip_install("torch", "transformers", "accelerate") ) @app.cls( image=inference_image, gpu="A10G", container_idle_timeout=300, allow_concurrent_inputs=10 ) class TextGenerator: @modal.enter() def load_model(self): """Load model once when container starts.""" from transformers import pipeline self.pipe = pipeline( "text-generation", model="meta-llama/Llama-2-7b-chat-hf", device_map="auto", torch_dtype="auto" ) @modal.method() def generate(self, prompt, max_tokens=256): result = self.pipe(prompt, max_new_tokens=max_tokens) return result[0]["generated_text"] @modal.web_endpoint(method="POST") def api(self, request: dict): text = self.generate(request["prompt"], request.get("max_tokens", 256)) return {"generated_text": text} @app.local_entrypoint() def main(): gen = TextGenerator() result = gen.generate.remote("Explain quantum computing in simple terms:") print(result)

Batch Processing with Map

import modal app = modal.App("batch-processor") image = modal.Image.debian_slim().pip_install("pillow", "requests") @app.function(image=image, concurrency_limit=50) def process_image(url): """Process a single image — runs in parallel across containers.""" import requests from PIL import Image from io import BytesIO response = requests.get(url, timeout=30) img = Image.open(BytesIO(response.content)) # Resize and convert img = img.resize((512, 512)) img = img.convert("RGB") return { "url": url, "original_size": img.size, "format": img.format, "status": "processed" } @app.local_entrypoint() def main(): urls = [f"https://picsum.photos/id/{i}/800/600" for i in range(100)] # Process all images in parallel — Modal scales automatically results = list(process_image.map(urls)) print(f"Processed {len(results)} images")

Scheduled Jobs

import modal app = modal.App("daily-pipeline") @app.function(schedule=modal.Cron("0 6 * * *")) # Daily at 6 AM UTC def daily_data_sync(): """Runs automatically every day.""" import requests # Fetch fresh data response = requests.get("https://api.example.com/daily-export") data = response.json() # Process and store processed = transform_data(data) upload_to_storage(processed) print(f"Synced {len(data)} records")

Configuration

ParameterDescriptionDefault
gpuGPU type (T4, A10G, A100, H100)None (CPU only)
memoryRAM allocation in MB128
timeoutMaximum execution time (seconds)300
concurrency_limitMax parallel container instances100
container_idle_timeoutKeep-alive duration (seconds)60
retriesAutomatic retry count on failure0
allow_concurrent_inputsRequests per container1

Best Practices

  1. Use @modal.enter() for expensive initialization — Load ML models, establish database connections, and initialize heavy objects in the @modal.enter() lifecycle method. This runs once when the container starts, not on every function call, dramatically reducing per-request latency.

  2. Right-size your GPU selection — Start with a T4 for inference tasks and only upgrade to A10G or A100 if you measure insufficient performance. GPU costs scale significantly — an A100 costs 10x more per hour than a T4. Profile your workload before committing to expensive hardware.

  3. Use .map() for batch workloads — When processing lists of items, use function.map(items) instead of a loop of .remote() calls. Map distributes work across containers automatically and handles failures and retries at the framework level.

  4. Set container_idle_timeout appropriately — For APIs with steady traffic, set 300-600 seconds to keep containers warm and avoid cold starts. For batch jobs that run once, set 0 to release resources immediately after completion.

  5. Pin dependency versions in your image — Use pip_install("torch==2.1.0", "transformers==4.36.0") with exact versions rather than pip_install("torch", "transformers"). Unpinned versions cause non-reproducible builds when packages update between deployments.

Common Issues

Cold start latency is too high — Large container images (especially with PyTorch + model weights) take 30-60 seconds to start. Use modal.Image.from_registry() with a pre-built Docker image, enable keep_warm=1 to maintain a minimum warm container, or use Modal's model caching with modal.Volume to avoid re-downloading weights.

Out of memory errors on GPU — The model fits locally but crashes on Modal's GPU. This happens because Modal containers have less system RAM than your local machine. Increase memory parameter, use torch_dtype=torch.float16 for model loading, or upgrade to a GPU with more VRAM (A10G: 24GB, A100: 40/80GB).

Function calls timing out at 300 seconds — The default timeout is conservative. For long-running tasks like model training or large batch processing, increase timeout in the decorator: @app.function(timeout=3600) for up to 1 hour. For very long tasks, consider breaking work into smaller chunks and using map().

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates