Advanced Root Cause Tracing Kit
Streamline your workflow with this skill for trace errors back to find original triggers and causes. Built for Claude Code with best practices and real-world patterns.
Root Cause Tracing Kit
Systematic root cause analysis toolkit that traces bugs, failures, and regressions back to their origin through structured investigation techniques and causal chain mapping.
When to Use This Skill
Choose Root Cause Tracing when:
- A bug keeps recurring despite surface-level fixes
- Production incidents need thorough post-mortem investigation
- Test failures have unclear or non-obvious causes
- Performance degradations appear without clear code changes
- You need to document causal chains for team knowledge sharing
Consider alternatives when:
- The bug is immediately obvious from the error message
- You need quick hotfixes rather than deep analysis
- Issues are in third-party libraries you cannot modify
Quick Start
# Activate the root cause tracing skill claude skill activate advanced-root-cause-tracing-kit # Investigate a failing test claude "Trace the root cause of the failing payment integration test" # Analyze a production incident claude "Root cause analysis: users seeing 500 errors on checkout since deploy v2.4.1"
Example Investigation Flow
// Symptom: OrderService.createOrder() throws NullPointerException // Step 1: Identify the failing line async createOrder(userId: string, items: CartItem[]) { const user = await this.userRepo.findById(userId); const address = user.defaultAddress; // NPE here - user.defaultAddress is null // Step 2: Trace why defaultAddress is null // -> User was created via SSO flow // -> SSO registration skips address collection step // -> Root cause: SSO onboarding flow missing address prompt // Fix: Add address check with fallback const address = user.defaultAddress ?? await this.promptForAddress(userId); }
Core Concepts
Investigation Methodology
| Phase | Action | Output |
|---|---|---|
| Symptom Collection | Gather error logs, stack traces, user reports | Symptom map |
| Timeline Construction | Identify when the issue first appeared | Change window |
| Hypothesis Formation | List possible causes ranked by likelihood | Hypothesis tree |
| Evidence Gathering | Test each hypothesis with data | Confirmed/eliminated causes |
| Causal Chain Mapping | Trace confirmed cause to its origin | Root cause document |
| Fix Verification | Confirm fix addresses root cause, not symptom | Regression test |
Tracing Techniques
| Technique | Best For | Approach |
|---|---|---|
| Binary Search (git bisect) | Regressions with known good state | Bisect commits to find breaking change |
| Dependency Tracing | Failures after library updates | Compare dependency trees before/after |
| Data Flow Analysis | Incorrect output values | Trace variable values through execution path |
| Log Correlation | Distributed system failures | Correlate timestamps across service logs |
| Fault Tree Analysis | Complex system failures | Top-down decomposition of failure modes |
Causal Chain Template
## Incident: [Title] **Symptom**: Users cannot complete checkout **Impact**: 15% of orders failing since 2024-03-10 14:00 UTC ### Causal Chain 1. **Immediate cause**: PaymentGateway.charge() returns timeout error 2. **Contributing factor**: Gateway connection pool exhausted (max 10) 3. **Underlying cause**: New retry logic holds connections during backoff 4. **Root cause**: Retry implementation uses synchronous sleep instead of releasing connection 5. **Systemic factor**: No connection pool monitoring or alerting ### Fix - Primary: Use async retry with connection release between attempts - Secondary: Add connection pool utilization alerts at 80% threshold - Preventive: Add integration test for concurrent payment processing
Configuration
| Parameter | Description | Default |
|---|---|---|
max_depth | Maximum causal chain depth to investigate | 7 |
include_git_history | Search git history for related changes | true |
log_window | Time window for log analysis | 48h |
hypothesis_limit | Maximum hypotheses to evaluate in parallel | 5 |
include_dependencies | Check dependency changes in analysis | true |
output_format | Report format: markdown, json, or jira | markdown |
Best Practices
-
Never fix symptoms without understanding causes — Patching the immediate error without tracing the root cause leads to recurring failures and growing technical debt.
-
Use the "5 Whys" technique systematically — When you find a cause, ask "why did this happen?" at least five times. Each answer peels back a layer closer to the true root cause.
-
Preserve evidence before fixing — Capture logs, database states, heap dumps, and reproduction steps before applying any fix. Evidence disappears quickly in production systems.
-
Build a timeline of changes — Compare the failure onset time against deployment logs, config changes, dependency updates, and infrastructure events to narrow the investigation window.
-
Document every root cause analysis — Even minor investigations produce institutional knowledge. Maintain an RCA database that teams can search to identify patterns across incidents.
Common Issues
Investigation leads to dead ends with no clear root cause. Expand the investigation scope beyond code. Check infrastructure changes, DNS updates, certificate rotations, third-party API modifications, and data migration scripts. Many "code bugs" are actually environment or configuration issues that don't appear in git history.
Root cause fix introduces new regressions. Always write a regression test that reproduces the original failure before implementing the fix. Run the full test suite after fixing and specifically test adjacent functionality that shares the same code paths or data structures.
Team disagrees on what constitutes the "root cause" versus a contributing factor. Use the causal chain format to separate immediate triggers from underlying causes and systemic factors. The root cause is the deepest fixable point in the chain — going deeper hits organizational or architectural constraints that require separate planning.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.