A

Architect Performance Monitor

Streamline your workflow with this establishing, observability, infrastructure, track. Includes structured workflows, validation checks, and reusable patterns for expert advisors.

AgentClipticsexpert advisorsv1.0.0MIT
0 views0 copies

Performance Monitor Architect

Your agent for designing and implementing application performance monitoring β€” covering metrics collection, dashboard design, anomaly detection, alerting, and performance optimization insights.

When to Use This Agent

Choose Performance Monitor Architect when:

  • Designing performance monitoring strategies for applications and services
  • Setting up metrics collection, dashboards, and alerting pipelines
  • Implementing APM (Application Performance Monitoring) solutions
  • Analyzing performance data to identify bottlenecks and optimization targets
  • Building custom monitoring tools or integrating with existing platforms

Consider alternatives when:

  • You need infrastructure monitoring (servers, networks) β€” use an SRE agent
  • You need to fix performance issues β€” use a performance engineering agent
  • You need logging and error tracking β€” use an observability agent

Quick Start

# .claude/agents/performance-monitor.yml name: Performance Monitor Architect model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep description: APM architect for metrics design, dashboard creation, anomaly detection, and performance optimization

Example invocation:

claude "Design a comprehensive monitoring setup for our microservices β€” we need request latency tracking, error rate dashboards, throughput metrics, and anomaly-based alerting"

Core Concepts

Monitoring Pillars

PillarWhat to MeasureTools
MetricsLatency, throughput, error rate, saturationPrometheus, Datadog, CloudWatch
LogsStructured events, errors, audit trailsELK Stack, Loki, Splunk
TracesRequest flow across servicesJaeger, Zipkin, AWS X-Ray
ProfilesCPU, memory, I/O allocation by functionPyroscope, Pprof, async-profiler

Golden Signals

Latency:    Time to serve a request
            β”œβ”€β”€ p50, p95, p99 percentiles
            └── Differentiate success vs error latency

Traffic:    Demand on the system
            β”œβ”€β”€ Requests per second
            └── Concurrent connections

Errors:     Rate of failed requests
            β”œβ”€β”€ HTTP 5xx rate
            └── Business logic error rate

Saturation: How full the system is
            β”œβ”€β”€ CPU, memory, disk utilization
            └── Queue depth, connection pool usage

Configuration

ParameterDescriptionDefault
monitoring_platformAPM tool (prometheus, datadog, newrelic, cloudwatch)prometheus
dashboard_toolVisualization (grafana, datadog, cloudwatch-dashboards)grafana
alert_channelAlert destination (pagerduty, opsgenie, slack)pagerduty
retention_periodMetrics retention duration30d
sampling_rateTrace sampling rate for high-traffic services10%

Best Practices

  1. Monitor from the user's perspective first. Start with end-to-end latency, success rate, and throughput as seen by clients β€” not internal server metrics. A service can have 0% CPU usage and 100% error rate. User-facing metrics catch issues that infrastructure metrics miss.

  2. Use percentiles, not averages, for latency. Average latency hides outliers. A p99 of 5 seconds means 1% of users wait 5+ seconds β€” this could be 10,000 frustrated users per million requests. Track p50, p95, and p99 for every critical endpoint.

  3. Set alerts on symptoms, not causes. Alert on "error rate > 1% for 5 minutes" (symptom) rather than "CPU > 80%" (possible cause). Symptom-based alerts fire when users are affected. Cause-based alerts fire for non-issues (high CPU during batch processing is normal).

  4. Build dashboards in layers. Top layer: business KPIs (conversion rate, order throughput). Middle layer: service health (golden signals per service). Bottom layer: infrastructure (CPU, memory, disk per node). Each layer answers different questions for different audiences.

  5. Sample traces intelligently in high-traffic systems. Tracing every request at 100K RPS is prohibitively expensive. Use head-based sampling (10% random) for baseline coverage, and tail-based sampling to always capture slow or errored requests.

Common Issues

Dashboard has 50 panels and nobody can find the relevant metric. Overstuffed dashboards cause alert fatigue and slow diagnosis. Create focused dashboards: one for system overview, one per service, one for infrastructure. Each dashboard should answer a specific question in under 10 seconds.

Alerting produces too many false positives. Alerts on raw metrics without appropriate thresholds and windows fire constantly. Use multi-window alerting (short window for severity, long window for persistence), and require sustained violation (5 minutes, not 1 data point) before paging.

Monitoring overhead impacts application performance. Heavy instrumentation, high-cardinality metrics, and synchronous metric export add latency to every request. Use asynchronous metric export, limit label cardinality, and benchmark the monitoring overhead itself (aim for < 1% latency impact).

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates