M

Monitoring Specialist Copilot

Boost productivity using this monitoring, observability, infrastructure, specialist. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.

AgentClipticsdevops infrastructurev1.0.0MIT
0 views0 copies

Monitoring Specialist Copilot

A monitoring and observability specialist focused on metrics collection, log aggregation, alerting strategy, and performance analytics across infrastructure and application layers.

When to Use This Agent

Choose Monitoring Specialist Copilot when:

  • Setting up monitoring infrastructure (Prometheus, Grafana, Datadog)
  • Designing alerting strategies and SLO-based monitoring
  • Implementing distributed tracing and log correlation
  • Building operational dashboards for system health
  • Diagnosing monitoring gaps that allow incidents to go undetected

Consider alternatives when:

  • Debugging application-level issues (use a debugging agent)
  • Setting up CI/CD pipelines (use a DevOps agent)
  • Managing incidents (use an incident response agent)

Quick Start

# .claude/agents/monitoring-specialist-copilot.yml name: Monitoring Specialist Copilot description: Monitoring, observability, and alerting strategy model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep

Example invocation:

claude "Set up a monitoring stack with Prometheus, Grafana, and Alertmanager for our Kubernetes cluster, including SLO-based alerting for our API services"

Core Concepts

Observability Pillars

PillarToolWhat It Answers
MetricsPrometheus, Datadog"Is the system healthy?"
LogsLoki, ELK, CloudWatch"What happened?"
TracesJaeger, Tempo, Zipkin"Where is time spent?"
EventsPagerDuty, Rootly"What changed?"

SLO-Based Alerting

# Prometheus alerting rules based on SLOs groups: - name: slo-alerts rules: # Error rate SLO: 99.9% success rate - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.001 for: 5m labels: severity: critical slo: availability annotations: summary: "Error rate exceeds SLO target" description: "Current error rate: {{ $value | humanizePercentage }}" # Latency SLO: P99 < 1s - alert: HighLatency expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) > 1.0 for: 5m labels: severity: warning slo: latency # Error budget burn rate - alert: ErrorBudgetBurnRate expr: | ( 1 - ( sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) ) / (1 - 0.999) > 14.4 for: 5m labels: severity: critical annotations: summary: "Error budget burning 14.4x faster than sustainable"

Dashboard Design

## Service Health Dashboard Layout ### Row 1: Golden Signals (4 panels) | Traffic (req/s) | Error Rate (%) | Latency P95 (ms) | Saturation (%) | |-----------------|----------------|-------------------|----------------| ### Row 2: SLO Status (3 panels) | Availability SLO | Latency SLO | Error Budget Remaining | |-------------------|-------------|------------------------| ### Row 3: Infrastructure (4 panels) | CPU Usage | Memory Usage | Disk I/O | Network I/O | |-----------|-------------|----------|-------------| ### Row 4: Dependencies (variable) | Database Latency | Cache Hit Rate | Queue Depth | External API Status | |------------------|----------------|-------------|---------------------|

Configuration

ParameterDescriptionDefault
metrics_backendMetrics storage (prometheus, datadog, cloudwatch)prometheus
log_backendLog aggregation (loki, elasticsearch, cloudwatch)loki
tracing_backendDistributed tracing (jaeger, tempo, zipkin)tempo
dashboard_toolDashboard platform (grafana, datadog, custom)grafana
alerting_toolAlert management (alertmanager, pagerduty, opsgenie)alertmanager
slo_targetDefault SLO availability target99.9%

Best Practices

  1. Monitor the four golden signals for every service: traffic, errors, latency, and saturation. These four metrics cover 90% of operational monitoring needs. Traffic shows demand, errors show failures, latency shows user experience, and saturation shows resource utilization. Start with these four for every new service before adding domain-specific metrics. A dashboard without these four is incomplete.

  2. Alert on SLO burn rate, not raw thresholds. A static alert at "error rate > 1%" fires regardless of whether the error budget can absorb it. SLO-based alerting uses burn rate: how fast the error budget is being consumed relative to the SLO target. A 14.4x burn rate alert means the monthly error budget will be exhausted in 2 hours β€” that is genuinely urgent. A brief 2x burn rate may be normal and does not warrant a page.

  3. Centralize logs with structured formatting and correlation IDs. Every log line should be structured JSON with a consistent schema: timestamp, level, service, request ID, message, and relevant context. Propagate a correlation ID (request ID) across all services in a request chain. This enables querying all logs for a single user request across 5 services with one filter: request_id = "abc123".

  4. Build dashboards for specific audiences, not generic catch-alls. A single "system health" dashboard that shows 50 panels serves no one well. Create role-specific dashboards: an on-call dashboard with alerts and SLO status, a product dashboard with business metrics, a capacity dashboard with resource trends. Each dashboard should answer the questions its audience actually asks without requiring scrolling or panel hopping.

  5. Implement distributed tracing for all inter-service communication. Without traces, debugging a slow request that spans 5 services requires correlating logs across all five. With traces, one trace view shows the entire request flow with per-service timing. Use OpenTelemetry SDKs to instrument applications automatically. The setup cost is minimal compared to the debugging time saved.

Common Issues

Alert fatigue causes on-call engineers to ignore real alerts. When the team receives hundreds of alerts daily, attention degrades and real incidents are missed. Audit every alert: delete alerts that have not led to action in 3 months, increase thresholds on overly sensitive alerts, and consolidate related alerts into summary notifications. Target fewer than 5 actionable pages per on-call shift.

Dashboards show healthy metrics while users report problems. Infrastructure metrics (CPU, memory) can be green while users experience errors. This happens when monitoring measures system health but not user experience. Add synthetic monitoring (automated user journeys), real user monitoring (browser performance), and business metrics (conversion rate, order completion) alongside infrastructure metrics.

Monitoring costs grow faster than infrastructure costs. High-cardinality metrics (unique request IDs, user IDs as label values), verbose log levels in production, and long retention periods drive monitoring costs upward. Reduce cardinality by aggregating metrics at the service level rather than per-request. Set log levels to info in production (not debug). Implement tiered retention: 7 days hot, 30 days warm, 90 days cold/archived.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates