Error Detective Mentor
Boost productivity using this agent, need, diagnose, errors. Includes structured workflows, validation checks, and reusable patterns for development tools.
Error Detective Mentor
A senior error analysis agent that investigates complex error patterns, correlates failures across distributed systems, and uncovers hidden root causes through log analysis, error correlation, anomaly detection, and predictive error prevention.
When to Use This Agent
Choose Error Detective Mentor when:
- Analyzing complex error patterns that span multiple services or systems
- Investigating error spikes or anomalies in production monitoring
- Correlating seemingly unrelated failures to find a common root cause
- Setting up error tracking, alerting, and categorization systems
- Building error budgets and reliability metrics for SLOs
Consider alternatives when:
- Debugging a single specific bug with a known reproduction (use a debugger agent)
- Setting up monitoring infrastructure from scratch (use a DevOps agent)
- Handling a live incident that needs immediate mitigation (use an incident response agent)
Quick Start
# .claude/agents/error-detective-mentor.yml name: Error Detective Mentor description: Investigate and correlate complex error patterns model: claude-sonnet tools: - Read - Bash - Glob - Grep - WebSearch
Example invocation:
claude "Analyze the error logs from the past 24 hours, identify the top error patterns, correlate them across services, and determine root causes"
Core Concepts
Error Investigation Pipeline
| Stage | Action | Output |
|---|---|---|
| 1. Collection | Gather errors from all sources | Unified error stream |
| 2. Categorization | Group errors by type, service, endpoint | Error clusters |
| 3. Correlation | Find temporal and causal relationships | Correlated groups |
| 4. Prioritization | Rank by user impact and frequency | Priority queue |
| 5. Root Cause Analysis | Trace each cluster to its origin | Root cause report |
| 6. Prevention | Recommend fixes and monitoring | Action items |
Error Correlation Techniques
// Temporal correlation: errors that spike together share a cause interface ErrorCorrelation { primary: ErrorCluster; correlated: ErrorCluster[]; timeWindow: string; // e.g., "within 30 seconds" confidence: number; // 0-1 hypothesis: string; } // Example correlation analysis const correlation: ErrorCorrelation = { primary: { service: 'payment-service', error: 'ConnectionTimeoutError', rate: '45/min (baseline: 2/min)', startedAt: '2026-03-14T14:23:00Z', }, correlated: [ { service: 'order-service', error: 'PaymentServiceUnavailable', rate: '38/min (baseline: 0/min)', startedAt: '2026-03-14T14:23:15Z', // 15s after primary }, { service: 'notification-service', error: 'OrderConfirmationFailed', rate: '35/min (baseline: 0/min)', startedAt: '2026-03-14T14:23:45Z', // 45s after primary }, ], timeWindow: 'within 60 seconds', confidence: 0.95, hypothesis: 'Database connection pool exhaustion in payment-service ' + 'cascading to dependent services', };
Error Budget Tracking
## Service Level Objective: Order Processing ### Error Budget (30-day rolling window) - SLO Target: 99.9% success rate - Budget: 43.2 minutes of downtime (0.1% of 30 days) - Consumed: 28.7 minutes (66.4% of budget) - Remaining: 14.5 minutes (33.6% of budget) - Burn Rate: 2.2x normal (budget will exhaust in 6.6 days at current rate) ### Top Error Budget Consumers 1. Payment timeout cascade (Mar 14): 18.2 minutes 2. Database failover (Mar 9): 6.3 minutes 3. Deployment rollback (Mar 6): 4.2 minutes
Configuration
| Parameter | Description | Default |
|---|---|---|
log_sources | Error log sources (sentry, datadog, cloudwatch) | Auto-detect |
time_window | Analysis time window | 24h |
correlation_threshold | Minimum confidence for correlation | 0.7 |
severity_filter | Minimum error severity to analyze | warning |
include_budget | Include error budget analysis | true |
alert_on_anomaly | Alert when error rates spike above baseline | true |
Best Practices
-
Categorize errors by impact before investigating. Not all errors deserve investigation. Categorize by user-facing impact: silent errors (logged but no user effect), degraded experience (slower response, missing features), broken functionality (feature does not work), and data loss (irreversible damage). Investigate data loss and broken functionality first, regardless of error count. A single data corruption error is more urgent than thousands of retry-success warnings.
-
Build error fingerprints that group related occurrences. Raw error messages with unique request IDs, timestamps, or user data create thousands of "unique" errors that are actually the same bug. Create fingerprints based on error type + stack trace + endpoint, ignoring variable data. Good fingerprinting reduces 10,000 raw errors to 15 unique issues, making prioritization possible.
-
Correlate errors across time windows, not just simultaneous events. Cascading failures appear as sequential error spikes across services, not simultaneous ones. Use sliding time windows (30s, 1m, 5m) to detect correlations where service A's errors precede service B's errors by a consistent lag. The first service to error is usually closest to the root cause.
-
Track error rates as percentages, not absolute counts. An endpoint with 100 errors sounds bad until you learn it handles 10 million requests (0.001% error rate). Another endpoint with 5 errors sounds fine until you learn it handles 50 requests (10% error rate). Always normalize error counts against request volume. Display error rates as percentages on dashboards and alerts.
-
Establish baseline error rates and alert on deviations, not thresholds. A static threshold of "alert when errors exceed 100/minute" either alerts too often during normal traffic or misses problems during low traffic. Instead, compute a rolling baseline and alert when the current rate is 3x or more above the baseline. This adapts to traffic patterns and detects anomalies at any scale.
Common Issues
Alert fatigue causes real errors to be ignored. When the monitoring system fires hundreds of alerts daily, the team stops responding. Review all alerts quarterly and delete any that have not led to an actionable investigation in the past 3 months. Consolidate related alerts into summary alerts. Use severity levels rigorously: page for service outage, notify for degradation, log for anomalies. The team should receive fewer than 5 actionable alerts per on-call shift.
Error logs lack sufficient context to diagnose issues. Generic messages like "Something went wrong" or "Request failed" without request IDs, user context, or input data make investigation impossible. Establish structured logging standards that include: request ID, user ID (hashed for privacy), endpoint, relevant input parameters, and the full error chain (including cause). Require code reviews to verify that new error handling includes sufficient context.
Cascading failures make root cause identification ambiguous. When service A fails, causing B, C, and D to fail, error dashboards show four services with problems. Without dependency mapping and temporal ordering, teams investigate all four simultaneously. Build a service dependency graph and overlay error timelines to identify the originating service. The service with the earliest error spike and no upstream errors is the likely root cause.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.