Monitoring Specialist Copilot

A monitoring and observability specialist focused on metrics collection, log aggregation, alerting strategy, and performance analytics across infrastructure and application layers.

When to Use This Agent

Choose Monitoring Specialist Copilot when:

Setting up monitoring infrastructure (Prometheus, Grafana, Datadog)
Designing alerting strategies and SLO-based monitoring
Implementing distributed tracing and log correlation
Building operational dashboards for system health
Diagnosing monitoring gaps that allow incidents to go undetected

Consider alternatives when:

Debugging application-level issues (use a debugging agent)
Setting up CI/CD pipelines (use a DevOps agent)
Managing incidents (use an incident response agent)

Quick Start


# .claude/agents/monitoring-specialist-copilot.yml
name: Monitoring Specialist Copilot
description: Monitoring, observability, and alerting strategy
model: claude-sonnet
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep

Example invocation:

claude "Set up a monitoring stack with Prometheus, Grafana, and Alertmanager for our Kubernetes cluster, including SLO-based alerting for our API services"

Core Concepts

Observability Pillars

Pillar	Tool	What It Answers
Metrics	Prometheus, Datadog	"Is the system healthy?"
Logs	Loki, ELK, CloudWatch	"What happened?"
Traces	Jaeger, Tempo, Zipkin	"Where is time spent?"
Events	PagerDuty, Rootly	"What changed?"

SLO-Based Alerting


# Prometheus alerting rules based on SLOs
groups:
  - name: slo-alerts
    rules:
      # Error rate SLO: 99.9% success rate
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.001
        for: 5m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "Error rate exceeds SLO target"
          description: "Current error rate: {{ $value | humanizePercentage }}"

      # Latency SLO: P99 < 1s
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1.0
        for: 5m
        labels:
          severity: warning
          slo: latency

      # Error budget burn rate
      - alert: ErrorBudgetBurnRate
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) / (1 - 0.999) > 14.4
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning 14.4x faster than sustainable"

Dashboard Design


## Service Health Dashboard Layout

### Row 1: Golden Signals (4 panels)
| Traffic (req/s) | Error Rate (%) | Latency P95 (ms) | Saturation (%) |
|-----------------|----------------|-------------------|----------------|

### Row 2: SLO Status (3 panels)
| Availability SLO | Latency SLO | Error Budget Remaining |
|-------------------|-------------|------------------------|

### Row 3: Infrastructure (4 panels)
| CPU Usage | Memory Usage | Disk I/O | Network I/O |
|-----------|-------------|----------|-------------|

### Row 4: Dependencies (variable)
| Database Latency | Cache Hit Rate | Queue Depth | External API Status |
|------------------|----------------|-------------|---------------------|

Configuration

Parameter	Description	Default
`metrics_backend`	Metrics storage (prometheus, datadog, cloudwatch)	`prometheus`
`log_backend`	Log aggregation (loki, elasticsearch, cloudwatch)	`loki`
`tracing_backend`	Distributed tracing (jaeger, tempo, zipkin)	`tempo`
`dashboard_tool`	Dashboard platform (grafana, datadog, custom)	`grafana`
`alerting_tool`	Alert management (alertmanager, pagerduty, opsgenie)	`alertmanager`
`slo_target`	Default SLO availability target	`99.9%`

Best Practices

Monitor the four golden signals for every service: traffic, errors, latency, and saturation. These four metrics cover 90% of operational monitoring needs. Traffic shows demand, errors show failures, latency shows user experience, and saturation shows resource utilization. Start with these four for every new service before adding domain-specific metrics. A dashboard without these four is incomplete.
Alert on SLO burn rate, not raw thresholds. A static alert at "error rate > 1%" fires regardless of whether the error budget can absorb it. SLO-based alerting uses burn rate: how fast the error budget is being consumed relative to the SLO target. A 14.4x burn rate alert means the monthly error budget will be exhausted in 2 hours — that is genuinely urgent. A brief 2x burn rate may be normal and does not warrant a page.
Centralize logs with structured formatting and correlation IDs. Every log line should be structured JSON with a consistent schema: timestamp, level, service, request ID, message, and relevant context. Propagate a correlation ID (request ID) across all services in a request chain. This enables querying all logs for a single user request across 5 services with one filter: request_id = "abc123".
Build dashboards for specific audiences, not generic catch-alls. A single "system health" dashboard that shows 50 panels serves no one well. Create role-specific dashboards: an on-call dashboard with alerts and SLO status, a product dashboard with business metrics, a capacity dashboard with resource trends. Each dashboard should answer the questions its audience actually asks without requiring scrolling or panel hopping.
Implement distributed tracing for all inter-service communication. Without traces, debugging a slow request that spans 5 services requires correlating logs across all five. With traces, one trace view shows the entire request flow with per-service timing. Use OpenTelemetry SDKs to instrument applications automatically. The setup cost is minimal compared to the debugging time saved.

Common Issues

Alert fatigue causes on-call engineers to ignore real alerts. When the team receives hundreds of alerts daily, attention degrades and real incidents are missed. Audit every alert: delete alerts that have not led to action in 3 months, increase thresholds on overly sensitive alerts, and consolidate related alerts into summary notifications. Target fewer than 5 actionable pages per on-call shift.

Dashboards show healthy metrics while users report problems. Infrastructure metrics (CPU, memory) can be green while users experience errors. This happens when monitoring measures system health but not user experience. Add synthetic monitoring (automated user journeys), real user monitoring (browser performance), and business metrics (conversion rate, order completion) alongside infrastructure metrics.

Monitoring costs grow faster than infrastructure costs. High-cardinality metrics (unique request IDs, user IDs as label values), verbose log levels in production, and long retention periods drive monitoring costs upward. Reduce cardinality by aggregating metrics at the service level rather than per-request. Set log levels to info in production (not debug). Implement tiered retention: 7 days hot, 30 days warm, 90 days cold/archived.

⚠️ Loading Issue

Monitoring Specialist Copilot

Monitoring Specialist Copilot

When to Use This Agent

Quick Start

Core Concepts

Observability Pillars

SLO-Based Alerting

Dashboard Design

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner