I

Incident Response Command

Production incident response workflow with structured diagnostics, root cause analysis, hotfix implementation, and post-mortem documentation. Guides you through a systematic triage process when production is down.

CommandCommunitydeploymentv1.0.0MIT
0 views0 copies

Command

/incident

Description

Structured incident response workflow for production issues. Follows a triage-diagnose-fix-verify-document cycle with clear escalation points. Designed to minimize mean time to resolution (MTTR) during production outages.

Behavior

Arguments

  • $ARGUMENTS -- Incident description (e.g., "500 errors on /api/checkout since 14:30 UTC")

Incident Response Phases

Phase 1: Triage (2 minutes)

  1. Classify severity:

    • P0: Service down, all users affected
    • P1: Major feature broken, many users affected
    • P2: Minor feature broken, some users affected
    • P3: Cosmetic/minor issue, workaround exists
  2. Assess blast radius:

    # Check service health curl -s https://your-app.com/api/health | jq . # Check recent deploys git log --oneline --since="6 hours ago" origin/main # Check error rates (if logging available) # tail -100 /var/log/app/error.log
  3. Quick wins check: Was there a recent deploy that could be reverted?

Phase 2: Diagnose (5-15 minutes)

  1. Gather evidence:

    • Error messages and stack traces
    • Affected endpoints/pages
    • Timeline of when it started
    • What changed (deploys, config changes, traffic spikes)
  2. Form hypotheses (ranked by likelihood):

    1. Recent deploy introduced a bug (check git log) 2. Database connection pool exhausted (check connections) 3. Third-party service outage (check status pages) 4. Traffic spike exceeding capacity (check metrics)
  3. Narrow down: Test each hypothesis with targeted commands

Phase 3: Fix

Option A: Revert (if caused by recent deploy)

# Identify the last known good commit git log --oneline -10 origin/main # Create a revert commit git revert <bad-commit> --no-edit # Deploy the revert

Option B: Hotfix (if revert is not feasible)

# Create hotfix branch from production git checkout -b hotfix/incident-$(date +%Y%m%d) origin/main # Apply the minimal fix # ... implement fix ... # Fast-track to production git push -u origin hotfix/incident-$(date +%Y%m%d) gh pr create --title "HOTFIX: [description]" --label urgent

Option C: Mitigate (if fix needs more time)

  • Enable feature flag to disable broken feature
  • Scale up infrastructure if capacity issue
  • Switch to backup/fallback service
  • Add rate limiting if under attack

Phase 4: Verify

  1. Confirm the fix resolves the issue
  2. Monitor error rates for 15 minutes
  3. Check for secondary issues
  4. Communicate resolution to stakeholders

Phase 5: Post-Mortem

Generate a post-mortem document:

## Incident Post-Mortem **Date**: 2025-03-25 **Duration**: 45 minutes (14:30 - 15:15 UTC) **Severity**: P1 **Impact**: Checkout flow returning 500 errors for ~30% of users ### Timeline - 14:15 -- Deploy #287 pushed to production - 14:30 -- Error rate spike detected - 14:35 -- Incident declared, triage started - 14:45 -- Root cause identified: null pointer in payment validation - 15:00 -- Hotfix deployed - 15:15 -- Error rates back to normal, incident resolved ### Root Cause Deploy #287 changed the payment validation logic but did not account for guest checkout where `user.paymentMethods` is null. ### Action Items - [ ] Add null safety checks to payment flow - [ ] Add integration test for guest checkout - [ ] Set up alerting on 5xx spike > 5%

Examples

# Start incident response /incident 500 errors on checkout API since 14:30 # Investigate a specific error /incident users getting "session expired" immediately after login # Respond to a performance issue /incident API response times 10x slower than normal
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates