Pro Modal

Deploy Python functions to the cloud with Modal's serverless platform for GPU computing, batch processing, and scalable API endpoints. This skill covers function definitions, container configuration, GPU workloads, scheduled jobs, and web endpoint deployment without managing infrastructure.

When to Use This Skill

Choose Pro Modal when you need to:

Run GPU-accelerated ML inference or training without managing servers
Deploy Python functions as auto-scaling API endpoints
Execute batch processing jobs that scale to hundreds of containers
Schedule recurring data processing or model training tasks

Consider alternatives when:

You need persistent long-running services (use traditional cloud VMs or Kubernetes)
You need sub-50ms latency for every request (use edge computing or pre-warmed containers)
You need to run non-Python workloads (use AWS Lambda or Cloud Functions)

Quick Start


# Install Modal
pip install modal

# Authenticate
modal token new


import modal

app = modal.App("my-first-app")

# Define a simple function that runs in the cloud
@app.function()
def square(x):
    return x ** 2

# Run it
@app.local_entrypoint()
def main():
    result = square.remote(42)
    print(f"42² = {result}")


# Deploy and run
modal run my_app.py

Core Concepts

Container and Resource Configuration

Decorator/Option	Purpose	Example
`@app.function()`	Basic cloud function	CPU tasks, data processing
`gpu="T4"`	Attach GPU	ML inference
`gpu="A100"`	High-end GPU	Model training
`image=`	Custom container image	Dependencies, system packages
`schedule=`	Cron scheduling	Periodic batch jobs
`@app.cls()`	Stateful class with lifecycle	Model loading, connection pools
`@modal.web_endpoint()`	HTTP endpoint	REST APIs

GPU-Accelerated ML Inference


import modal

app = modal.App("llm-inference")

# Define container image with ML dependencies
inference_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("torch", "transformers", "accelerate")
)

@app.cls(
    image=inference_image,
    gpu="A10G",
    container_idle_timeout=300,
    allow_concurrent_inputs=10
)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        """Load model once when container starts."""
        from transformers import pipeline
        self.pipe = pipeline(
            "text-generation",
            model="meta-llama/Llama-2-7b-chat-hf",
            device_map="auto",
            torch_dtype="auto"
        )

    @modal.method()
    def generate(self, prompt, max_tokens=256):
        result = self.pipe(prompt, max_new_tokens=max_tokens)
        return result[0]["generated_text"]

    @modal.web_endpoint(method="POST")
    def api(self, request: dict):
        text = self.generate(request["prompt"], request.get("max_tokens", 256))
        return {"generated_text": text}

@app.local_entrypoint()
def main():
    gen = TextGenerator()
    result = gen.generate.remote("Explain quantum computing in simple terms:")
    print(result)

Batch Processing with Map


import modal

app = modal.App("batch-processor")

image = modal.Image.debian_slim().pip_install("pillow", "requests")

@app.function(image=image, concurrency_limit=50)
def process_image(url):
    """Process a single image — runs in parallel across containers."""
    import requests
    from PIL import Image
    from io import BytesIO

    response = requests.get(url, timeout=30)
    img = Image.open(BytesIO(response.content))

    # Resize and convert
    img = img.resize((512, 512))
    img = img.convert("RGB")

    return {
        "url": url,
        "original_size": img.size,
        "format": img.format,
        "status": "processed"
    }

@app.local_entrypoint()
def main():
    urls = [f"https://picsum.photos/id/{i}/800/600" for i in range(100)]

    # Process all images in parallel — Modal scales automatically
    results = list(process_image.map(urls))
    print(f"Processed {len(results)} images")

Scheduled Jobs


import modal

app = modal.App("daily-pipeline")

@app.function(schedule=modal.Cron("0 6 * * *"))  # Daily at 6 AM UTC
def daily_data_sync():
    """Runs automatically every day."""
    import requests

    # Fetch fresh data
    response = requests.get("https://api.example.com/daily-export")
    data = response.json()

    # Process and store
    processed = transform_data(data)
    upload_to_storage(processed)

    print(f"Synced {len(data)} records")

Configuration

Parameter	Description	Default
`gpu`	GPU type (T4, A10G, A100, H100)	`None` (CPU only)
`memory`	RAM allocation in MB	`128`
`timeout`	Maximum execution time (seconds)	`300`
`concurrency_limit`	Max parallel container instances	`100`
`container_idle_timeout`	Keep-alive duration (seconds)	`60`
`retries`	Automatic retry count on failure	`0`
`allow_concurrent_inputs`	Requests per container	`1`

Best Practices

Use @modal.enter() for expensive initialization — Load ML models, establish database connections, and initialize heavy objects in the @modal.enter() lifecycle method. This runs once when the container starts, not on every function call, dramatically reducing per-request latency.
Right-size your GPU selection — Start with a T4 for inference tasks and only upgrade to A10G or A100 if you measure insufficient performance. GPU costs scale significantly — an A100 costs 10x more per hour than a T4. Profile your workload before committing to expensive hardware.
Use .map() for batch workloads — When processing lists of items, use function.map(items) instead of a loop of .remote() calls. Map distributes work across containers automatically and handles failures and retries at the framework level.
Set container_idle_timeout appropriately — For APIs with steady traffic, set 300-600 seconds to keep containers warm and avoid cold starts. For batch jobs that run once, set 0 to release resources immediately after completion.
Pin dependency versions in your image — Use pip_install("torch==2.1.0", "transformers==4.36.0") with exact versions rather than pip_install("torch", "transformers"). Unpinned versions cause non-reproducible builds when packages update between deployments.

Common Issues

Cold start latency is too high — Large container images (especially with PyTorch + model weights) take 30-60 seconds to start. Use modal.Image.from_registry() with a pre-built Docker image, enable keep_warm=1 to maintain a minimum warm container, or use Modal's model caching with modal.Volume to avoid re-downloading weights.

Out of memory errors on GPU — The model fits locally but crashes on Modal's GPU. This happens because Modal containers have less system RAM than your local machine. Increase memory parameter, use torch_dtype=torch.float16 for model loading, or upgrade to a GPU with more VRAM (A10G: 24GB, A100: 40/80GB).

Function calls timing out at 300 seconds — The default timeout is conservative. For long-running tasks like model training or large batch processing, increase timeout in the decorator: @app.function(timeout=3600) for up to 1 hour. For very long tasks, consider breaking work into smaller chunks and using map().

⚠️ Loading Issue

When to Use This Skill

Quick Start

Core Concepts

Container and Resource Configuration

GPU-Accelerated ML Inference

Batch Processing with Map

Scheduled Jobs

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace