Advisor Pagerduty Champion
Production-ready agent that handles responds, pagerduty, incidents, analyzing. Includes structured workflows, validation checks, and reusable patterns for development tools.
Advisor PagerDuty Champion
A PagerDuty integration and incident management agent that helps you configure alerting, design escalation policies, manage on-call rotations, and optimize incident response workflows.
When to Use This Agent
Choose Advisor PagerDuty Champion when:
- Setting up PagerDuty services, escalation policies, and on-call schedules
- Integrating PagerDuty with monitoring tools (Datadog, CloudWatch, Prometheus)
- Designing incident response workflows and runbooks
- Optimizing alert routing to reduce noise and improve response times
- Analyzing incident metrics (MTTA, MTTR, escalation rates)
Consider alternatives when:
- Setting up monitoring and observability infrastructure (use a DevOps agent)
- Debugging a specific production incident (use an incident response agent)
- Building alerting rules for application metrics (use a monitoring agent)
Quick Start
# .claude/agents/advisor-pagerduty-champion.yml name: Advisor PagerDuty Champion description: Configure and optimize PagerDuty incident management model: claude-sonnet tools: - Read - Write - Bash - WebSearch
Example invocation:
claude "Design a PagerDuty setup for our three microservices with appropriate escalation policies, on-call rotations, and Datadog integration"
Core Concepts
PagerDuty Configuration Architecture
| Component | Purpose | Example |
|---|---|---|
| Service | Represents a monitored system | payment-service-prod |
| Integration | Connects monitoring tool to service | Datadog → payment-service |
| Escalation Policy | Defines who to notify and when | L1 engineer → L2 lead → manager |
| Schedule | On-call rotation definition | Weekly rotation, 4 engineers |
| Event Rule | Routes and transforms incoming alerts | Suppress non-critical overnight |
| Response Play | Automated incident response actions | Page team, create Slack channel |
Escalation Policy Design
# Recommended escalation structure escalation_policy: name: "Payment Service - Production" repeat_enabled: true num_loops: 3 escalation_rules: - escalation_delay_in_minutes: 5 targets: - type: schedule_reference id: "payment-oncall-primary" - escalation_delay_in_minutes: 10 targets: - type: schedule_reference id: "payment-oncall-secondary" - type: user_reference id: "tech-lead-user-id" - escalation_delay_in_minutes: 15 targets: - type: user_reference id: "engineering-manager-id"
Alert Routing Rules
# Event orchestration rules event_rules: - name: "Suppress non-critical during maintenance" conditions: - field: "severity" operator: "equals" value: "warning" - field: "custom_details.environment" operator: "equals" value: "staging" actions: suppress: true - name: "Critical payment alerts — immediate page" conditions: - field: "summary" operator: "contains" value: "payment" - field: "severity" operator: "equals" value: "critical" actions: severity: "critical" priority: "P1" - name: "Batch low-severity alerts into digest" conditions: - field: "severity" operator: "equals" value: "info" actions: severity: "info" suppress: threshold_value: 10 threshold_time_unit: "minutes"
Configuration
| Parameter | Description | Default |
|---|---|---|
pd_api_version | PagerDuty API version | v2 |
integration_type | Primary monitoring integration | datadog |
escalation_loops | Number of escalation repeat cycles | 3 |
ack_timeout | Minutes before re-alerting on unacknowledged | 5 |
resolve_timeout | Minutes before auto-resolving | 240 |
notification_channels | Alert delivery (push, sms, phone, email) | ["push", "phone"] |
Best Practices
-
Create separate services for each independently deployable unit. A single PagerDuty service for "backend" that receives alerts from 10 microservices makes it impossible to route alerts to the right team or measure reliability per service. Create one PagerDuty service per microservice per environment. This enables targeted escalation policies, per-service SLO tracking, and accurate incident categorization.
-
Design escalation policies with increasing blast radius. Start with the primary on-call engineer (5 min), escalate to the secondary on-call (10 min), then the tech lead (15 min), then the engineering manager (20 min). Each level broadens the response team. Never start with a group page — it creates diffusion of responsibility where everyone assumes someone else will respond.
-
Use event orchestration to suppress noise before it reaches on-call. Configure rules that suppress known low-impact alerts during off-hours, batch duplicate alerts within time windows, and auto-resolve alerts that clear within thresholds. On-call engineers should receive fewer than 5 actionable pages per week. More than that indicates either poor alert tuning or genuine reliability issues — either way, it requires intervention.
-
Define severity levels with specific, objective criteria. Map alert severity to PagerDuty urgency with clear definitions: Critical (revenue impact, data loss) → High urgency page; Warning (degraded performance, single-node failure) → Low urgency notification; Info (capacity trending, minor anomalies) → No page, dashboard only. Subjective severity levels lead to inconsistent alerting and on-call burnout.
-
Review incident metrics monthly and adjust configurations. Track MTTA (mean time to acknowledge), MTTR (mean time to resolve), escalation rate, and noise ratio (suppressed/total alerts). If MTTA exceeds 5 minutes, the notification method may be ineffective. If escalation rate exceeds 20%, the primary on-call may be overloaded. Use data to drive configuration improvements, not gut feelings.
Common Issues
On-call engineers receive pages for non-actionable alerts. Monitoring systems often send alerts for conditions that self-resolve (brief CPU spikes, single request failures, auto-scaling events). Configure suppression rules with time windows: only page if the condition persists for 5+ minutes. Use PagerDuty's intelligent alert grouping to merge related alerts into a single incident rather than paging separately for each symptom.
Escalation policies do not account for timezone-distributed teams. A policy that escalates from a US engineer to a US tech lead at 3am means both are woken up. For distributed teams, configure follow-the-sun schedules where the on-call rotates to the timezone where it is business hours. Set up separate schedules per timezone and reference them in the escalation policy based on time-of-day routing rules.
Integration alerts arrive with insufficient context to diagnose the issue. A PagerDuty alert saying "CPU > 90%" does not help the on-call engineer diagnose the root cause. Configure monitoring integrations to include: the affected service, the specific metric value and threshold, a link to the relevant dashboard, and a link to the runbook. The on-call engineer should be able to start investigating within 30 seconds of opening the alert.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.