Incident Response Command

Command

/incident

Description

Structured incident response workflow for production issues. Follows a triage-diagnose-fix-verify-document cycle with clear escalation points. Designed to minimize mean time to resolution (MTTR) during production outages.

Behavior

Arguments

$ARGUMENTS -- Incident description (e.g., "500 errors on /api/checkout since 14:30 UTC")

Incident Response Phases

Phase 1: Triage (2 minutes)

Classify severity:
- P0: Service down, all users affected
- P1: Major feature broken, many users affected
- P2: Minor feature broken, some users affected
- P3: Cosmetic/minor issue, workaround exists

Assess blast radius:


# Check service health
curl -s https://your-app.com/api/health | jq .

# Check recent deploys
git log --oneline --since="6 hours ago" origin/main

# Check error rates (if logging available)
# tail -100 /var/log/app/error.log

Quick wins check: Was there a recent deploy that could be reverted?

Phase 2: Diagnose (5-15 minutes)

Gather evidence:
- Error messages and stack traces
- Affected endpoints/pages
- Timeline of when it started
- What changed (deploys, config changes, traffic spikes)

Form hypotheses (ranked by likelihood):


1. Recent deploy introduced a bug (check git log)
2. Database connection pool exhausted (check connections)
3. Third-party service outage (check status pages)
4. Traffic spike exceeding capacity (check metrics)

Narrow down: Test each hypothesis with targeted commands

Phase 3: Fix

Option A: Revert (if caused by recent deploy)


# Identify the last known good commit
git log --oneline -10 origin/main

# Create a revert commit
git revert <bad-commit> --no-edit

# Deploy the revert

Option B: Hotfix (if revert is not feasible)


# Create hotfix branch from production
git checkout -b hotfix/incident-$(date +%Y%m%d) origin/main

# Apply the minimal fix
# ... implement fix ...

# Fast-track to production
git push -u origin hotfix/incident-$(date +%Y%m%d)
gh pr create --title "HOTFIX: [description]" --label urgent

Option C: Mitigate (if fix needs more time)

Enable feature flag to disable broken feature
Scale up infrastructure if capacity issue
Switch to backup/fallback service
Add rate limiting if under attack

Phase 4: Verify

Confirm the fix resolves the issue
Monitor error rates for 15 minutes
Check for secondary issues
Communicate resolution to stakeholders

Phase 5: Post-Mortem

Generate a post-mortem document:


## Incident Post-Mortem

**Date**: 2025-03-25
**Duration**: 45 minutes (14:30 - 15:15 UTC)
**Severity**: P1
**Impact**: Checkout flow returning 500 errors for ~30% of users

### Timeline
- 14:15 -- Deploy #287 pushed to production
- 14:30 -- Error rate spike detected
- 14:35 -- Incident declared, triage started
- 14:45 -- Root cause identified: null pointer in payment validation
- 15:00 -- Hotfix deployed
- 15:15 -- Error rates back to normal, incident resolved

### Root Cause
Deploy #287 changed the payment validation logic but did not
account for guest checkout where `user.paymentMethods` is null.

### Action Items
- [ ] Add null safety checks to payment flow
- [ ] Add integration test for guest checkout
- [ ] Set up alerting on 5xx spike > 5%

Examples


# Start incident response
/incident 500 errors on checkout API since 14:30

# Investigate a specific error
/incident users getting "session expired" immediately after login

# Respond to a performance issue
/incident API response times 10x slower than normal

⚠️ Loading Issue

Command

Description

Behavior

Arguments

Incident Response Phases

Phase 1: Triage (2 minutes)

Phase 2: Diagnose (5-15 minutes)

Phase 3: Fix

Phase 4: Verify

Phase 5: Post-Mortem

Examples

Reviews

Write a review

Similar Templates

Git Commit Message Generator

React Component Scaffolder

CI/CD Pipeline Generator