A

Advisor Pagerduty Champion

Production-ready agent that handles responds, pagerduty, incidents, analyzing. Includes structured workflows, validation checks, and reusable patterns for development tools.

AgentClipticsdevelopment toolsv1.0.0MIT
0 views0 copies

Advisor PagerDuty Champion

A PagerDuty integration and incident management agent that helps you configure alerting, design escalation policies, manage on-call rotations, and optimize incident response workflows.

When to Use This Agent

Choose Advisor PagerDuty Champion when:

  • Setting up PagerDuty services, escalation policies, and on-call schedules
  • Integrating PagerDuty with monitoring tools (Datadog, CloudWatch, Prometheus)
  • Designing incident response workflows and runbooks
  • Optimizing alert routing to reduce noise and improve response times
  • Analyzing incident metrics (MTTA, MTTR, escalation rates)

Consider alternatives when:

  • Setting up monitoring and observability infrastructure (use a DevOps agent)
  • Debugging a specific production incident (use an incident response agent)
  • Building alerting rules for application metrics (use a monitoring agent)

Quick Start

# .claude/agents/advisor-pagerduty-champion.yml name: Advisor PagerDuty Champion description: Configure and optimize PagerDuty incident management model: claude-sonnet tools: - Read - Write - Bash - WebSearch

Example invocation:

claude "Design a PagerDuty setup for our three microservices with appropriate escalation policies, on-call rotations, and Datadog integration"

Core Concepts

PagerDuty Configuration Architecture

ComponentPurposeExample
ServiceRepresents a monitored systempayment-service-prod
IntegrationConnects monitoring tool to serviceDatadog → payment-service
Escalation PolicyDefines who to notify and whenL1 engineer → L2 lead → manager
ScheduleOn-call rotation definitionWeekly rotation, 4 engineers
Event RuleRoutes and transforms incoming alertsSuppress non-critical overnight
Response PlayAutomated incident response actionsPage team, create Slack channel

Escalation Policy Design

# Recommended escalation structure escalation_policy: name: "Payment Service - Production" repeat_enabled: true num_loops: 3 escalation_rules: - escalation_delay_in_minutes: 5 targets: - type: schedule_reference id: "payment-oncall-primary" - escalation_delay_in_minutes: 10 targets: - type: schedule_reference id: "payment-oncall-secondary" - type: user_reference id: "tech-lead-user-id" - escalation_delay_in_minutes: 15 targets: - type: user_reference id: "engineering-manager-id"

Alert Routing Rules

# Event orchestration rules event_rules: - name: "Suppress non-critical during maintenance" conditions: - field: "severity" operator: "equals" value: "warning" - field: "custom_details.environment" operator: "equals" value: "staging" actions: suppress: true - name: "Critical payment alerts — immediate page" conditions: - field: "summary" operator: "contains" value: "payment" - field: "severity" operator: "equals" value: "critical" actions: severity: "critical" priority: "P1" - name: "Batch low-severity alerts into digest" conditions: - field: "severity" operator: "equals" value: "info" actions: severity: "info" suppress: threshold_value: 10 threshold_time_unit: "minutes"

Configuration

ParameterDescriptionDefault
pd_api_versionPagerDuty API versionv2
integration_typePrimary monitoring integrationdatadog
escalation_loopsNumber of escalation repeat cycles3
ack_timeoutMinutes before re-alerting on unacknowledged5
resolve_timeoutMinutes before auto-resolving240
notification_channelsAlert delivery (push, sms, phone, email)["push", "phone"]

Best Practices

  1. Create separate services for each independently deployable unit. A single PagerDuty service for "backend" that receives alerts from 10 microservices makes it impossible to route alerts to the right team or measure reliability per service. Create one PagerDuty service per microservice per environment. This enables targeted escalation policies, per-service SLO tracking, and accurate incident categorization.

  2. Design escalation policies with increasing blast radius. Start with the primary on-call engineer (5 min), escalate to the secondary on-call (10 min), then the tech lead (15 min), then the engineering manager (20 min). Each level broadens the response team. Never start with a group page — it creates diffusion of responsibility where everyone assumes someone else will respond.

  3. Use event orchestration to suppress noise before it reaches on-call. Configure rules that suppress known low-impact alerts during off-hours, batch duplicate alerts within time windows, and auto-resolve alerts that clear within thresholds. On-call engineers should receive fewer than 5 actionable pages per week. More than that indicates either poor alert tuning or genuine reliability issues — either way, it requires intervention.

  4. Define severity levels with specific, objective criteria. Map alert severity to PagerDuty urgency with clear definitions: Critical (revenue impact, data loss) → High urgency page; Warning (degraded performance, single-node failure) → Low urgency notification; Info (capacity trending, minor anomalies) → No page, dashboard only. Subjective severity levels lead to inconsistent alerting and on-call burnout.

  5. Review incident metrics monthly and adjust configurations. Track MTTA (mean time to acknowledge), MTTR (mean time to resolve), escalation rate, and noise ratio (suppressed/total alerts). If MTTA exceeds 5 minutes, the notification method may be ineffective. If escalation rate exceeds 20%, the primary on-call may be overloaded. Use data to drive configuration improvements, not gut feelings.

Common Issues

On-call engineers receive pages for non-actionable alerts. Monitoring systems often send alerts for conditions that self-resolve (brief CPU spikes, single request failures, auto-scaling events). Configure suppression rules with time windows: only page if the condition persists for 5+ minutes. Use PagerDuty's intelligent alert grouping to merge related alerts into a single incident rather than paging separately for each symptom.

Escalation policies do not account for timezone-distributed teams. A policy that escalates from a US engineer to a US tech lead at 3am means both are woken up. For distributed teams, configure follow-the-sun schedules where the on-call rotates to the timezone where it is business hours. Set up separate schedules per timezone and reference them in the escalation policy based on time-of-day routing rules.

Integration alerts arrive with insufficient context to diagnose the issue. A PagerDuty alert saying "CPU > 90%" does not help the on-call engineer diagnose the root cause. Configure monitoring integrations to include: the affected service, the specific metric value and threshold, a link to the relevant dashboard, and a link to the runbook. The on-call engineer should be able to start investigating within 30 seconds of opening the alert.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates