Error Detective Mentor

A senior error analysis agent that investigates complex error patterns, correlates failures across distributed systems, and uncovers hidden root causes through log analysis, error correlation, anomaly detection, and predictive error prevention.

When to Use This Agent

Choose Error Detective Mentor when:

Analyzing complex error patterns that span multiple services or systems
Investigating error spikes or anomalies in production monitoring
Correlating seemingly unrelated failures to find a common root cause
Setting up error tracking, alerting, and categorization systems
Building error budgets and reliability metrics for SLOs

Consider alternatives when:

Debugging a single specific bug with a known reproduction (use a debugger agent)
Setting up monitoring infrastructure from scratch (use a DevOps agent)
Handling a live incident that needs immediate mitigation (use an incident response agent)

Quick Start


# .claude/agents/error-detective-mentor.yml
name: Error Detective Mentor
description: Investigate and correlate complex error patterns
model: claude-sonnet
tools:
  - Read
  - Bash
  - Glob
  - Grep
  - WebSearch

Example invocation:

claude "Analyze the error logs from the past 24 hours, identify the top error patterns, correlate them across services, and determine root causes"

Core Concepts

Error Investigation Pipeline

Stage	Action	Output
1. Collection	Gather errors from all sources	Unified error stream
2. Categorization	Group errors by type, service, endpoint	Error clusters
3. Correlation	Find temporal and causal relationships	Correlated groups
4. Prioritization	Rank by user impact and frequency	Priority queue
5. Root Cause Analysis	Trace each cluster to its origin	Root cause report
6. Prevention	Recommend fixes and monitoring	Action items

Error Correlation Techniques


// Temporal correlation: errors that spike together share a cause
interface ErrorCorrelation {
  primary: ErrorCluster;
  correlated: ErrorCluster[];
  timeWindow: string;       // e.g., "within 30 seconds"
  confidence: number;       // 0-1
  hypothesis: string;
}

// Example correlation analysis
const correlation: ErrorCorrelation = {
  primary: {
    service: 'payment-service',
    error: 'ConnectionTimeoutError',
    rate: '45/min (baseline: 2/min)',
    startedAt: '2026-03-14T14:23:00Z',
  },
  correlated: [
    {
      service: 'order-service',
      error: 'PaymentServiceUnavailable',
      rate: '38/min (baseline: 0/min)',
      startedAt: '2026-03-14T14:23:15Z',  // 15s after primary
    },
    {
      service: 'notification-service',
      error: 'OrderConfirmationFailed',
      rate: '35/min (baseline: 0/min)',
      startedAt: '2026-03-14T14:23:45Z',  // 45s after primary
    },
  ],
  timeWindow: 'within 60 seconds',
  confidence: 0.95,
  hypothesis: 'Database connection pool exhaustion in payment-service '
    + 'cascading to dependent services',
};

Error Budget Tracking


## Service Level Objective: Order Processing

### Error Budget (30-day rolling window)
- SLO Target: 99.9% success rate
- Budget: 43.2 minutes of downtime (0.1% of 30 days)
- Consumed: 28.7 minutes (66.4% of budget)
- Remaining: 14.5 minutes (33.6% of budget)
- Burn Rate: 2.2x normal (budget will exhaust in 6.6 days at current rate)

### Top Error Budget Consumers
1. Payment timeout cascade (Mar 14): 18.2 minutes
2. Database failover (Mar 9): 6.3 minutes
3. Deployment rollback (Mar 6): 4.2 minutes

Configuration

Parameter	Description	Default
`log_sources`	Error log sources (sentry, datadog, cloudwatch)	Auto-detect
`time_window`	Analysis time window	`24h`
`correlation_threshold`	Minimum confidence for correlation	`0.7`
`severity_filter`	Minimum error severity to analyze	`warning`
`include_budget`	Include error budget analysis	`true`
`alert_on_anomaly`	Alert when error rates spike above baseline	`true`

Best Practices

Categorize errors by impact before investigating. Not all errors deserve investigation. Categorize by user-facing impact: silent errors (logged but no user effect), degraded experience (slower response, missing features), broken functionality (feature does not work), and data loss (irreversible damage). Investigate data loss and broken functionality first, regardless of error count. A single data corruption error is more urgent than thousands of retry-success warnings.
Build error fingerprints that group related occurrences. Raw error messages with unique request IDs, timestamps, or user data create thousands of "unique" errors that are actually the same bug. Create fingerprints based on error type + stack trace + endpoint, ignoring variable data. Good fingerprinting reduces 10,000 raw errors to 15 unique issues, making prioritization possible.
Correlate errors across time windows, not just simultaneous events. Cascading failures appear as sequential error spikes across services, not simultaneous ones. Use sliding time windows (30s, 1m, 5m) to detect correlations where service A's errors precede service B's errors by a consistent lag. The first service to error is usually closest to the root cause.
Track error rates as percentages, not absolute counts. An endpoint with 100 errors sounds bad until you learn it handles 10 million requests (0.001% error rate). Another endpoint with 5 errors sounds fine until you learn it handles 50 requests (10% error rate). Always normalize error counts against request volume. Display error rates as percentages on dashboards and alerts.
Establish baseline error rates and alert on deviations, not thresholds. A static threshold of "alert when errors exceed 100/minute" either alerts too often during normal traffic or misses problems during low traffic. Instead, compute a rolling baseline and alert when the current rate is 3x or more above the baseline. This adapts to traffic patterns and detects anomalies at any scale.

Common Issues

Alert fatigue causes real errors to be ignored. When the monitoring system fires hundreds of alerts daily, the team stops responding. Review all alerts quarterly and delete any that have not led to an actionable investigation in the past 3 months. Consolidate related alerts into summary alerts. Use severity levels rigorously: page for service outage, notify for degradation, log for anomalies. The team should receive fewer than 5 actionable alerts per on-call shift.

Error logs lack sufficient context to diagnose issues. Generic messages like "Something went wrong" or "Request failed" without request IDs, user context, or input data make investigation impossible. Establish structured logging standards that include: request ID, user ID (hashed for privacy), endpoint, relevant input parameters, and the full error chain (including cause). Require code reviews to verify that new error handling includes sufficient context.

Cascading failures make root cause identification ambiguous. When service A fails, causing B, C, and D to fail, error dashboards show four services with problems. Without dependency mapping and temporal ordering, teams investigate all four simultaneously. Build a service dependency graph and overlay error timelines to identify the originating service. The service with the earliest error spike and no upstream errors is the likely root cause.

⚠️ Loading Issue

Error Detective Mentor

Error Detective Mentor

When to Use This Agent

Quick Start

Core Concepts

Error Investigation Pipeline

Error Correlation Techniques

Error Budget Tracking

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner