Advisor PagerDuty Champion

A PagerDuty integration and incident management agent that helps you configure alerting, design escalation policies, manage on-call rotations, and optimize incident response workflows.

When to Use This Agent

Choose Advisor PagerDuty Champion when:

Setting up PagerDuty services, escalation policies, and on-call schedules
Integrating PagerDuty with monitoring tools (Datadog, CloudWatch, Prometheus)
Designing incident response workflows and runbooks
Optimizing alert routing to reduce noise and improve response times
Analyzing incident metrics (MTTA, MTTR, escalation rates)

Consider alternatives when:

Setting up monitoring and observability infrastructure (use a DevOps agent)
Debugging a specific production incident (use an incident response agent)
Building alerting rules for application metrics (use a monitoring agent)

Quick Start


# .claude/agents/advisor-pagerduty-champion.yml
name: Advisor PagerDuty Champion
description: Configure and optimize PagerDuty incident management
model: claude-sonnet
tools:
  - Read
  - Write
  - Bash
  - WebSearch

Example invocation:

claude "Design a PagerDuty setup for our three microservices with appropriate escalation policies, on-call rotations, and Datadog integration"

Core Concepts

PagerDuty Configuration Architecture

Component	Purpose	Example
Service	Represents a monitored system	`payment-service-prod`
Integration	Connects monitoring tool to service	Datadog → payment-service
Escalation Policy	Defines who to notify and when	L1 engineer → L2 lead → manager
Schedule	On-call rotation definition	Weekly rotation, 4 engineers
Event Rule	Routes and transforms incoming alerts	Suppress non-critical overnight
Response Play	Automated incident response actions	Page team, create Slack channel

Escalation Policy Design


# Recommended escalation structure
escalation_policy:
  name: "Payment Service - Production"
  repeat_enabled: true
  num_loops: 3

  escalation_rules:
    - escalation_delay_in_minutes: 5
      targets:
        - type: schedule_reference
          id: "payment-oncall-primary"

    - escalation_delay_in_minutes: 10
      targets:
        - type: schedule_reference
          id: "payment-oncall-secondary"
        - type: user_reference
          id: "tech-lead-user-id"

    - escalation_delay_in_minutes: 15
      targets:
        - type: user_reference
          id: "engineering-manager-id"

Alert Routing Rules


# Event orchestration rules
event_rules:
  - name: "Suppress non-critical during maintenance"
    conditions:
      - field: "severity"
        operator: "equals"
        value: "warning"
      - field: "custom_details.environment"
        operator: "equals"
        value: "staging"
    actions:
      suppress: true

  - name: "Critical payment alerts — immediate page"
    conditions:
      - field: "summary"
        operator: "contains"
        value: "payment"
      - field: "severity"
        operator: "equals"
        value: "critical"
    actions:
      severity: "critical"
      priority: "P1"

  - name: "Batch low-severity alerts into digest"
    conditions:
      - field: "severity"
        operator: "equals"
        value: "info"
    actions:
      severity: "info"
      suppress:
        threshold_value: 10
        threshold_time_unit: "minutes"

Configuration

Parameter	Description	Default
`pd_api_version`	PagerDuty API version	`v2`
`integration_type`	Primary monitoring integration	`datadog`
`escalation_loops`	Number of escalation repeat cycles	`3`
`ack_timeout`	Minutes before re-alerting on unacknowledged	`5`
`resolve_timeout`	Minutes before auto-resolving	`240`
`notification_channels`	Alert delivery (push, sms, phone, email)	`["push", "phone"]`

Best Practices

Create separate services for each independently deployable unit. A single PagerDuty service for "backend" that receives alerts from 10 microservices makes it impossible to route alerts to the right team or measure reliability per service. Create one PagerDuty service per microservice per environment. This enables targeted escalation policies, per-service SLO tracking, and accurate incident categorization.
Design escalation policies with increasing blast radius. Start with the primary on-call engineer (5 min), escalate to the secondary on-call (10 min), then the tech lead (15 min), then the engineering manager (20 min). Each level broadens the response team. Never start with a group page — it creates diffusion of responsibility where everyone assumes someone else will respond.
Use event orchestration to suppress noise before it reaches on-call. Configure rules that suppress known low-impact alerts during off-hours, batch duplicate alerts within time windows, and auto-resolve alerts that clear within thresholds. On-call engineers should receive fewer than 5 actionable pages per week. More than that indicates either poor alert tuning or genuine reliability issues — either way, it requires intervention.
Define severity levels with specific, objective criteria. Map alert severity to PagerDuty urgency with clear definitions: Critical (revenue impact, data loss) → High urgency page; Warning (degraded performance, single-node failure) → Low urgency notification; Info (capacity trending, minor anomalies) → No page, dashboard only. Subjective severity levels lead to inconsistent alerting and on-call burnout.
Review incident metrics monthly and adjust configurations. Track MTTA (mean time to acknowledge), MTTR (mean time to resolve), escalation rate, and noise ratio (suppressed/total alerts). If MTTA exceeds 5 minutes, the notification method may be ineffective. If escalation rate exceeds 20%, the primary on-call may be overloaded. Use data to drive configuration improvements, not gut feelings.

Common Issues

On-call engineers receive pages for non-actionable alerts. Monitoring systems often send alerts for conditions that self-resolve (brief CPU spikes, single request failures, auto-scaling events). Configure suppression rules with time windows: only page if the condition persists for 5+ minutes. Use PagerDuty's intelligent alert grouping to merge related alerts into a single incident rather than paging separately for each symptom.

Escalation policies do not account for timezone-distributed teams. A policy that escalates from a US engineer to a US tech lead at 3am means both are woken up. For distributed teams, configure follow-the-sun schedules where the on-call rotates to the timezone where it is business hours. Set up separate schedules per timezone and reference them in the escalation policy based on time-of-day routing rules.

Integration alerts arrive with insufficient context to diagnose the issue. A PagerDuty alert saying "CPU > 90%" does not help the on-call engineer diagnose the root cause. Configure monitoring integrations to include: the affected service, the specific metric value and threshold, a link to the relevant dashboard, and a link to the runbook. The on-call engineer should be able to start investigating within 30 seconds of opening the alert.

⚠️ Loading Issue

Advisor Pagerduty Champion

Advisor PagerDuty Champion

When to Use This Agent

Quick Start

Core Concepts

PagerDuty Configuration Architecture

Escalation Policy Design

Alert Routing Rules

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner