A

Advanced Root Cause Tracing Kit

Streamline your workflow with this skill for trace errors back to find original triggers and causes. Built for Claude Code with best practices and real-world patterns.

SkillCommunitydebuggingv1.0.0MIT
0 views0 copies

Root Cause Tracing Kit

Systematic root cause analysis toolkit that traces bugs, failures, and regressions back to their origin through structured investigation techniques and causal chain mapping.

When to Use This Skill

Choose Root Cause Tracing when:

  • A bug keeps recurring despite surface-level fixes
  • Production incidents need thorough post-mortem investigation
  • Test failures have unclear or non-obvious causes
  • Performance degradations appear without clear code changes
  • You need to document causal chains for team knowledge sharing

Consider alternatives when:

  • The bug is immediately obvious from the error message
  • You need quick hotfixes rather than deep analysis
  • Issues are in third-party libraries you cannot modify

Quick Start

# Activate the root cause tracing skill claude skill activate advanced-root-cause-tracing-kit # Investigate a failing test claude "Trace the root cause of the failing payment integration test" # Analyze a production incident claude "Root cause analysis: users seeing 500 errors on checkout since deploy v2.4.1"

Example Investigation Flow

// Symptom: OrderService.createOrder() throws NullPointerException // Step 1: Identify the failing line async createOrder(userId: string, items: CartItem[]) { const user = await this.userRepo.findById(userId); const address = user.defaultAddress; // NPE here - user.defaultAddress is null // Step 2: Trace why defaultAddress is null // -> User was created via SSO flow // -> SSO registration skips address collection step // -> Root cause: SSO onboarding flow missing address prompt // Fix: Add address check with fallback const address = user.defaultAddress ?? await this.promptForAddress(userId); }

Core Concepts

Investigation Methodology

PhaseActionOutput
Symptom CollectionGather error logs, stack traces, user reportsSymptom map
Timeline ConstructionIdentify when the issue first appearedChange window
Hypothesis FormationList possible causes ranked by likelihoodHypothesis tree
Evidence GatheringTest each hypothesis with dataConfirmed/eliminated causes
Causal Chain MappingTrace confirmed cause to its originRoot cause document
Fix VerificationConfirm fix addresses root cause, not symptomRegression test

Tracing Techniques

TechniqueBest ForApproach
Binary Search (git bisect)Regressions with known good stateBisect commits to find breaking change
Dependency TracingFailures after library updatesCompare dependency trees before/after
Data Flow AnalysisIncorrect output valuesTrace variable values through execution path
Log CorrelationDistributed system failuresCorrelate timestamps across service logs
Fault Tree AnalysisComplex system failuresTop-down decomposition of failure modes

Causal Chain Template

## Incident: [Title] **Symptom**: Users cannot complete checkout **Impact**: 15% of orders failing since 2024-03-10 14:00 UTC ### Causal Chain 1. **Immediate cause**: PaymentGateway.charge() returns timeout error 2. **Contributing factor**: Gateway connection pool exhausted (max 10) 3. **Underlying cause**: New retry logic holds connections during backoff 4. **Root cause**: Retry implementation uses synchronous sleep instead of releasing connection 5. **Systemic factor**: No connection pool monitoring or alerting ### Fix - Primary: Use async retry with connection release between attempts - Secondary: Add connection pool utilization alerts at 80% threshold - Preventive: Add integration test for concurrent payment processing

Configuration

ParameterDescriptionDefault
max_depthMaximum causal chain depth to investigate7
include_git_historySearch git history for related changestrue
log_windowTime window for log analysis48h
hypothesis_limitMaximum hypotheses to evaluate in parallel5
include_dependenciesCheck dependency changes in analysistrue
output_formatReport format: markdown, json, or jiramarkdown

Best Practices

  1. Never fix symptoms without understanding causes — Patching the immediate error without tracing the root cause leads to recurring failures and growing technical debt.

  2. Use the "5 Whys" technique systematically — When you find a cause, ask "why did this happen?" at least five times. Each answer peels back a layer closer to the true root cause.

  3. Preserve evidence before fixing — Capture logs, database states, heap dumps, and reproduction steps before applying any fix. Evidence disappears quickly in production systems.

  4. Build a timeline of changes — Compare the failure onset time against deployment logs, config changes, dependency updates, and infrastructure events to narrow the investigation window.

  5. Document every root cause analysis — Even minor investigations produce institutional knowledge. Maintain an RCA database that teams can search to identify patterns across incidents.

Common Issues

Investigation leads to dead ends with no clear root cause. Expand the investigation scope beyond code. Check infrastructure changes, DNS updates, certificate rotations, third-party API modifications, and data migration scripts. Many "code bugs" are actually environment or configuration issues that don't appear in git history.

Root cause fix introduces new regressions. Always write a regression test that reproduces the original failure before implementing the fix. Run the full test suite after fixing and specifically test adjacent functionality that shares the same code paths or data structures.

Team disagrees on what constitutes the "root cause" versus a contributing factor. Use the causal chain format to separate immediate triggers from underlying causes and systemic factors. The root cause is the deepest fixable point in the chain — going deeper hits organizational or architectural constraints that require separate planning.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates