Expert Rootly Incident Responder
A agent template for development tools workflows. Streamlines development with pre-configured patterns and best practices.
Expert Rootly Incident Responder
An experienced SRE and incident response agent specializing in production incident analysis and resolution using Rootly, helping you analyze incidents, leverage historical data, and coordinate effective responses.
When to Use This Agent
Choose Expert Rootly Incident Responder when:
- Managing active production incidents through the Rootly platform
- Analyzing past incidents to identify patterns and systemic issues
- Creating and refining incident runbooks and response playbooks
- Conducting post-incident reviews and generating action items
- Setting up incident severity definitions and escalation procedures
Consider alternatives when:
- Setting up PagerDuty alerting (use a PagerDuty agent)
- Debugging code-level issues (use a debugging agent)
- Designing system resilience (use a chaos engineering agent)
Quick Start
# .claude/agents/expert-rootly-incident-responder.yml name: Expert Rootly Incident Responder description: Incident management and analysis with Rootly model: claude-sonnet tools: - Read - Write - Bash - WebSearch - Glob - Grep
Example invocation:
claude "Analyze the last 5 production incidents in Rootly, identify common root causes, and suggest preventive measures"
Core Concepts
Incident Lifecycle Management
| Phase | Actions | Rootly Features |
|---|---|---|
| Detection | Alert received, incident created | Auto-creation from PagerDuty/Datadog |
| Triage | Severity assigned, team notified | Severity definitions, escalation rules |
| Response | Investigation, mitigation | Slack channel, war room, runbook |
| Resolution | Fix applied, services restored | Status updates, timeline tracking |
| Review | Post-mortem, action items | Retrospective template, follow-ups |
| Prevention | Systemic fixes implemented | Action item tracking, metrics |
Incident Response Template
## Incident: Payment Service Outage ### Timeline - 14:23 UTC — PagerDuty alert: payment-service error rate > 5% - 14:25 UTC — Incident created in Rootly (SEV-1) - 14:27 UTC — IC assigned: @oncall-engineer - 14:30 UTC — Root cause identified: database connection pool exhaustion - 14:35 UTC — Mitigation: increased pool size from 20 to 50 - 14:38 UTC — Error rate returning to baseline - 14:45 UTC — Incident resolved, monitoring for stability ### Impact - Duration: 22 minutes - Affected users: ~2,400 (attempted payments during window) - Revenue impact: estimated $18,000 in delayed transactions - No data loss or corruption ### Root Cause The connection pool was configured for 20 connections based on initial traffic estimates. A marketing campaign drove 3x normal traffic, exhausting the pool. Queued requests timed out after 30s, returning 500 errors to users. ### Action Items 1. [P0] Increase connection pool to 100 with overflow to 200 2. [P1] Add connection pool utilization alert at 70% threshold 3. [P2] Implement connection pooling with PgBouncer 4. [P2] Add load testing for 5x traffic scenarios
Incident Analysis Patterns
// Pattern analysis across historical incidents interface IncidentPattern { category: string; frequency: number; averageMTTR: number; commonRootCauses: string[]; affectedServices: string[]; preventionStatus: 'addressed' | 'in-progress' | 'unaddressed'; } const patterns: IncidentPattern[] = [ { category: 'Database Capacity', frequency: 5, // Last 6 months averageMTTR: 28, // Minutes commonRootCauses: [ 'Connection pool exhaustion', 'Disk space full', 'Slow queries under load', ], affectedServices: ['payment-service', 'order-service', 'user-service'], preventionStatus: 'in-progress', }, ];
Configuration
| Parameter | Description | Default |
|---|---|---|
rootly_integration | Rootly API connection | Required |
severity_levels | Incident severity definitions | SEV1-SEV4 |
auto_create_channel | Auto-create Slack incident channel | true |
postmortem_template | Post-incident review template | blameless |
action_item_tracking | Track follow-up action items | true |
pattern_analysis_window | Lookback period for pattern analysis | 6 months |
Best Practices
-
Assign an Incident Commander (IC) within the first 5 minutes. The IC's role is coordination, not investigation. They manage communication, track the timeline, make escalation decisions, and ensure the right people are involved. Without an IC, multiple engineers investigate in parallel without sharing findings, duplicate effort, and miscommunicate status. The IC does not need to be the most senior person — they need to be organized and communicative.
-
Update the incident status every 15 minutes during active incidents. Stakeholders (executives, customer support, affected teams) need regular updates to manage their own response. A status update template: "What we know, What we're doing, When the next update will be." Even if there is no new information, explicitly saying "still investigating, no new findings" prevents stakeholders from interrupting the response team for updates.
-
Run blameless post-mortems within 48 hours of resolution. Memory fades quickly. Conduct the review while the incident is fresh, focusing on what happened (timeline), why it happened (root cause analysis), and what to change (action items). Blameless means no personal blame — the goal is systemic improvement. Document contributing factors in the process, tools, and architecture, not in individual mistakes.
-
Track action items from post-mortems as engineering tickets with deadlines. Post-mortem action items that are not tracked formally get forgotten. Create tickets in your project tracker immediately during the review. Assign owners and deadlines. P0 action items (prevent recurrence of SEV-1 incidents) should be completed within the current sprint. Review open action items in weekly engineering meetings.
-
Analyze incident patterns quarterly to drive systemic improvements. Individual incidents fix immediate problems. Pattern analysis identifies systemic weaknesses. If 5 out of 10 incidents involve database capacity, the fix is not 5 individual capacity increases — it is a capacity planning process, autoscaling infrastructure, and better load testing. Quarterly reviews transform incident response from reactive to preventive.
Common Issues
Post-mortem action items accumulate without completion. Teams diligently create action items but rarely complete them because they compete with feature work for engineering capacity. Allocate a fixed percentage of sprint capacity (15-20%) for reliability improvements. Track action item completion rate as a team metric. Escalate unaddressed P0 action items that are more than 2 weeks old to engineering leadership.
Incident severity is inconsistently assigned, undermining escalation policies. Without clear severity definitions, some engineers classify every incident as SEV-1 (over-alerting) while others minimize severity to avoid attention (under-alerting). Create a severity matrix with specific, measurable criteria: SEV-1 = "complete service outage or data loss affecting >1% of users," SEV-2 = "degraded service affecting <1% of users," etc. Review severity assignments in post-mortems to calibrate.
Incident responders spend too much time on communication instead of investigation. During a SEV-1 incident, the investigating engineer is interrupted every 2 minutes by stakeholders asking for updates. The IC role prevents this: all communication goes through the IC, freeing the investigation team to focus. Configure Rootly's auto-status updates to push information to stakeholders automatically, reducing the communication burden on the response team.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.