E

Error Detective Mentor

Boost productivity using this agent, need, diagnose, errors. Includes structured workflows, validation checks, and reusable patterns for development tools.

AgentClipticsdevelopment toolsv1.0.0MIT
0 views0 copies

Error Detective Mentor

A senior error analysis agent that investigates complex error patterns, correlates failures across distributed systems, and uncovers hidden root causes through log analysis, error correlation, anomaly detection, and predictive error prevention.

When to Use This Agent

Choose Error Detective Mentor when:

  • Analyzing complex error patterns that span multiple services or systems
  • Investigating error spikes or anomalies in production monitoring
  • Correlating seemingly unrelated failures to find a common root cause
  • Setting up error tracking, alerting, and categorization systems
  • Building error budgets and reliability metrics for SLOs

Consider alternatives when:

  • Debugging a single specific bug with a known reproduction (use a debugger agent)
  • Setting up monitoring infrastructure from scratch (use a DevOps agent)
  • Handling a live incident that needs immediate mitigation (use an incident response agent)

Quick Start

# .claude/agents/error-detective-mentor.yml name: Error Detective Mentor description: Investigate and correlate complex error patterns model: claude-sonnet tools: - Read - Bash - Glob - Grep - WebSearch

Example invocation:

claude "Analyze the error logs from the past 24 hours, identify the top error patterns, correlate them across services, and determine root causes"

Core Concepts

Error Investigation Pipeline

StageActionOutput
1. CollectionGather errors from all sourcesUnified error stream
2. CategorizationGroup errors by type, service, endpointError clusters
3. CorrelationFind temporal and causal relationshipsCorrelated groups
4. PrioritizationRank by user impact and frequencyPriority queue
5. Root Cause AnalysisTrace each cluster to its originRoot cause report
6. PreventionRecommend fixes and monitoringAction items

Error Correlation Techniques

// Temporal correlation: errors that spike together share a cause interface ErrorCorrelation { primary: ErrorCluster; correlated: ErrorCluster[]; timeWindow: string; // e.g., "within 30 seconds" confidence: number; // 0-1 hypothesis: string; } // Example correlation analysis const correlation: ErrorCorrelation = { primary: { service: 'payment-service', error: 'ConnectionTimeoutError', rate: '45/min (baseline: 2/min)', startedAt: '2026-03-14T14:23:00Z', }, correlated: [ { service: 'order-service', error: 'PaymentServiceUnavailable', rate: '38/min (baseline: 0/min)', startedAt: '2026-03-14T14:23:15Z', // 15s after primary }, { service: 'notification-service', error: 'OrderConfirmationFailed', rate: '35/min (baseline: 0/min)', startedAt: '2026-03-14T14:23:45Z', // 45s after primary }, ], timeWindow: 'within 60 seconds', confidence: 0.95, hypothesis: 'Database connection pool exhaustion in payment-service ' + 'cascading to dependent services', };

Error Budget Tracking

## Service Level Objective: Order Processing ### Error Budget (30-day rolling window) - SLO Target: 99.9% success rate - Budget: 43.2 minutes of downtime (0.1% of 30 days) - Consumed: 28.7 minutes (66.4% of budget) - Remaining: 14.5 minutes (33.6% of budget) - Burn Rate: 2.2x normal (budget will exhaust in 6.6 days at current rate) ### Top Error Budget Consumers 1. Payment timeout cascade (Mar 14): 18.2 minutes 2. Database failover (Mar 9): 6.3 minutes 3. Deployment rollback (Mar 6): 4.2 minutes

Configuration

ParameterDescriptionDefault
log_sourcesError log sources (sentry, datadog, cloudwatch)Auto-detect
time_windowAnalysis time window24h
correlation_thresholdMinimum confidence for correlation0.7
severity_filterMinimum error severity to analyzewarning
include_budgetInclude error budget analysistrue
alert_on_anomalyAlert when error rates spike above baselinetrue

Best Practices

  1. Categorize errors by impact before investigating. Not all errors deserve investigation. Categorize by user-facing impact: silent errors (logged but no user effect), degraded experience (slower response, missing features), broken functionality (feature does not work), and data loss (irreversible damage). Investigate data loss and broken functionality first, regardless of error count. A single data corruption error is more urgent than thousands of retry-success warnings.

  2. Build error fingerprints that group related occurrences. Raw error messages with unique request IDs, timestamps, or user data create thousands of "unique" errors that are actually the same bug. Create fingerprints based on error type + stack trace + endpoint, ignoring variable data. Good fingerprinting reduces 10,000 raw errors to 15 unique issues, making prioritization possible.

  3. Correlate errors across time windows, not just simultaneous events. Cascading failures appear as sequential error spikes across services, not simultaneous ones. Use sliding time windows (30s, 1m, 5m) to detect correlations where service A's errors precede service B's errors by a consistent lag. The first service to error is usually closest to the root cause.

  4. Track error rates as percentages, not absolute counts. An endpoint with 100 errors sounds bad until you learn it handles 10 million requests (0.001% error rate). Another endpoint with 5 errors sounds fine until you learn it handles 50 requests (10% error rate). Always normalize error counts against request volume. Display error rates as percentages on dashboards and alerts.

  5. Establish baseline error rates and alert on deviations, not thresholds. A static threshold of "alert when errors exceed 100/minute" either alerts too often during normal traffic or misses problems during low traffic. Instead, compute a rolling baseline and alert when the current rate is 3x or more above the baseline. This adapts to traffic patterns and detects anomalies at any scale.

Common Issues

Alert fatigue causes real errors to be ignored. When the monitoring system fires hundreds of alerts daily, the team stops responding. Review all alerts quarterly and delete any that have not led to an actionable investigation in the past 3 months. Consolidate related alerts into summary alerts. Use severity levels rigorously: page for service outage, notify for degradation, log for anomalies. The team should receive fewer than 5 actionable alerts per on-call shift.

Error logs lack sufficient context to diagnose issues. Generic messages like "Something went wrong" or "Request failed" without request IDs, user context, or input data make investigation impossible. Establish structured logging standards that include: request ID, user ID (hashed for privacy), endpoint, relevant input parameters, and the full error chain (including cause). Require code reviews to verify that new error handling includes sufficient context.

Cascading failures make root cause identification ambiguous. When service A fails, causing B, C, and D to fail, error dashboards show four services with problems. Without dependency mapping and temporal ordering, teams investigate all four simultaneously. Build a service dependency graph and overlay error timelines to identify the originating service. The service with the earliest error spike and no upstream errors is the likely root cause.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates