Expert Rootly Incident Responder

An experienced SRE and incident response agent specializing in production incident analysis and resolution using Rootly, helping you analyze incidents, leverage historical data, and coordinate effective responses.

When to Use This Agent

Choose Expert Rootly Incident Responder when:

Managing active production incidents through the Rootly platform
Analyzing past incidents to identify patterns and systemic issues
Creating and refining incident runbooks and response playbooks
Conducting post-incident reviews and generating action items
Setting up incident severity definitions and escalation procedures

Consider alternatives when:

Setting up PagerDuty alerting (use a PagerDuty agent)
Debugging code-level issues (use a debugging agent)
Designing system resilience (use a chaos engineering agent)

Quick Start


# .claude/agents/expert-rootly-incident-responder.yml
name: Expert Rootly Incident Responder
description: Incident management and analysis with Rootly
model: claude-sonnet
tools:
  - Read
  - Write
  - Bash
  - WebSearch
  - Glob
  - Grep

Example invocation:

claude "Analyze the last 5 production incidents in Rootly, identify common root causes, and suggest preventive measures"

Core Concepts

Incident Lifecycle Management

Phase	Actions	Rootly Features
Detection	Alert received, incident created	Auto-creation from PagerDuty/Datadog
Triage	Severity assigned, team notified	Severity definitions, escalation rules
Response	Investigation, mitigation	Slack channel, war room, runbook
Resolution	Fix applied, services restored	Status updates, timeline tracking
Review	Post-mortem, action items	Retrospective template, follow-ups
Prevention	Systemic fixes implemented	Action item tracking, metrics

Incident Response Template


## Incident: Payment Service Outage

### Timeline
- 14:23 UTC — PagerDuty alert: payment-service error rate > 5%
- 14:25 UTC — Incident created in Rootly (SEV-1)
- 14:27 UTC — IC assigned: @oncall-engineer
- 14:30 UTC — Root cause identified: database connection pool exhaustion
- 14:35 UTC — Mitigation: increased pool size from 20 to 50
- 14:38 UTC — Error rate returning to baseline
- 14:45 UTC — Incident resolved, monitoring for stability

### Impact
- Duration: 22 minutes
- Affected users: ~2,400 (attempted payments during window)
- Revenue impact: estimated $18,000 in delayed transactions
- No data loss or corruption

### Root Cause
The connection pool was configured for 20 connections based on initial
traffic estimates. A marketing campaign drove 3x normal traffic,
exhausting the pool. Queued requests timed out after 30s, returning
500 errors to users.

### Action Items
1. [P0] Increase connection pool to 100 with overflow to 200
2. [P1] Add connection pool utilization alert at 70% threshold
3. [P2] Implement connection pooling with PgBouncer
4. [P2] Add load testing for 5x traffic scenarios

Incident Analysis Patterns


// Pattern analysis across historical incidents
interface IncidentPattern {
  category: string;
  frequency: number;
  averageMTTR: number;
  commonRootCauses: string[];
  affectedServices: string[];
  preventionStatus: 'addressed' | 'in-progress' | 'unaddressed';
}

const patterns: IncidentPattern[] = [
  {
    category: 'Database Capacity',
    frequency: 5,          // Last 6 months
    averageMTTR: 28,       // Minutes
    commonRootCauses: [
      'Connection pool exhaustion',
      'Disk space full',
      'Slow queries under load',
    ],
    affectedServices: ['payment-service', 'order-service', 'user-service'],
    preventionStatus: 'in-progress',
  },
];

Configuration

Parameter	Description	Default
`rootly_integration`	Rootly API connection	Required
`severity_levels`	Incident severity definitions	SEV1-SEV4
`auto_create_channel`	Auto-create Slack incident channel	`true`
`postmortem_template`	Post-incident review template	`blameless`
`action_item_tracking`	Track follow-up action items	`true`
`pattern_analysis_window`	Lookback period for pattern analysis	`6 months`

Best Practices

Assign an Incident Commander (IC) within the first 5 minutes. The IC's role is coordination, not investigation. They manage communication, track the timeline, make escalation decisions, and ensure the right people are involved. Without an IC, multiple engineers investigate in parallel without sharing findings, duplicate effort, and miscommunicate status. The IC does not need to be the most senior person — they need to be organized and communicative.
Update the incident status every 15 minutes during active incidents. Stakeholders (executives, customer support, affected teams) need regular updates to manage their own response. A status update template: "What we know, What we're doing, When the next update will be." Even if there is no new information, explicitly saying "still investigating, no new findings" prevents stakeholders from interrupting the response team for updates.
Run blameless post-mortems within 48 hours of resolution. Memory fades quickly. Conduct the review while the incident is fresh, focusing on what happened (timeline), why it happened (root cause analysis), and what to change (action items). Blameless means no personal blame — the goal is systemic improvement. Document contributing factors in the process, tools, and architecture, not in individual mistakes.
Track action items from post-mortems as engineering tickets with deadlines. Post-mortem action items that are not tracked formally get forgotten. Create tickets in your project tracker immediately during the review. Assign owners and deadlines. P0 action items (prevent recurrence of SEV-1 incidents) should be completed within the current sprint. Review open action items in weekly engineering meetings.
Analyze incident patterns quarterly to drive systemic improvements. Individual incidents fix immediate problems. Pattern analysis identifies systemic weaknesses. If 5 out of 10 incidents involve database capacity, the fix is not 5 individual capacity increases — it is a capacity planning process, autoscaling infrastructure, and better load testing. Quarterly reviews transform incident response from reactive to preventive.

Common Issues

Post-mortem action items accumulate without completion. Teams diligently create action items but rarely complete them because they compete with feature work for engineering capacity. Allocate a fixed percentage of sprint capacity (15-20%) for reliability improvements. Track action item completion rate as a team metric. Escalate unaddressed P0 action items that are more than 2 weeks old to engineering leadership.

Incident severity is inconsistently assigned, undermining escalation policies. Without clear severity definitions, some engineers classify every incident as SEV-1 (over-alerting) while others minimize severity to avoid attention (under-alerting). Create a severity matrix with specific, measurable criteria: SEV-1 = "complete service outage or data loss affecting >1% of users," SEV-2 = "degraded service affecting <1% of users," etc. Review severity assignments in post-mortems to calibrate.

Incident responders spend too much time on communication instead of investigation. During a SEV-1 incident, the investigating engineer is interrupted every 2 minutes by stakeholders asking for updates. The IC role prevents this: all communication goes through the IC, freeing the investigation team to focus. Configure Rootly's auto-status updates to push information to stakeholders automatically, reducing the communication burden on the response team.

⚠️ Loading Issue

Expert Rootly Incident Responder

Expert Rootly Incident Responder

When to Use This Agent

Quick Start

Core Concepts

Incident Lifecycle Management

Incident Response Template

Incident Analysis Patterns

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner