Performance Monitor Architect

Your agent for designing and implementing application performance monitoring — covering metrics collection, dashboard design, anomaly detection, alerting, and performance optimization insights.

When to Use This Agent

Choose Performance Monitor Architect when:

Designing performance monitoring strategies for applications and services
Setting up metrics collection, dashboards, and alerting pipelines
Implementing APM (Application Performance Monitoring) solutions
Analyzing performance data to identify bottlenecks and optimization targets
Building custom monitoring tools or integrating with existing platforms

Consider alternatives when:

You need infrastructure monitoring (servers, networks) — use an SRE agent
You need to fix performance issues — use a performance engineering agent
You need logging and error tracking — use an observability agent

Quick Start


# .claude/agents/performance-monitor.yml
name: Performance Monitor Architect
model: claude-sonnet
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
description: APM architect for metrics design, dashboard creation, anomaly detection, and performance optimization

Example invocation:

claude "Design a comprehensive monitoring setup for our microservices — we need request latency tracking, error rate dashboards, throughput metrics, and anomaly-based alerting"

Core Concepts

Monitoring Pillars

Pillar	What to Measure	Tools
Metrics	Latency, throughput, error rate, saturation	Prometheus, Datadog, CloudWatch
Logs	Structured events, errors, audit trails	ELK Stack, Loki, Splunk
Traces	Request flow across services	Jaeger, Zipkin, AWS X-Ray
Profiles	CPU, memory, I/O allocation by function	Pyroscope, Pprof, async-profiler

Golden Signals

Latency:    Time to serve a request
            ├── p50, p95, p99 percentiles
            └── Differentiate success vs error latency

Traffic:    Demand on the system
            ├── Requests per second
            └── Concurrent connections

Errors:     Rate of failed requests
            ├── HTTP 5xx rate
            └── Business logic error rate

Saturation: How full the system is
            ├── CPU, memory, disk utilization
            └── Queue depth, connection pool usage

Configuration

Parameter	Description	Default
`monitoring_platform`	APM tool (prometheus, datadog, newrelic, cloudwatch)	prometheus
`dashboard_tool`	Visualization (grafana, datadog, cloudwatch-dashboards)	grafana
`alert_channel`	Alert destination (pagerduty, opsgenie, slack)	pagerduty
`retention_period`	Metrics retention duration	30d
`sampling_rate`	Trace sampling rate for high-traffic services	10%

Best Practices

Monitor from the user's perspective first. Start with end-to-end latency, success rate, and throughput as seen by clients — not internal server metrics. A service can have 0% CPU usage and 100% error rate. User-facing metrics catch issues that infrastructure metrics miss.
Use percentiles, not averages, for latency. Average latency hides outliers. A p99 of 5 seconds means 1% of users wait 5+ seconds — this could be 10,000 frustrated users per million requests. Track p50, p95, and p99 for every critical endpoint.
Set alerts on symptoms, not causes. Alert on "error rate > 1% for 5 minutes" (symptom) rather than "CPU > 80%" (possible cause). Symptom-based alerts fire when users are affected. Cause-based alerts fire for non-issues (high CPU during batch processing is normal).
Build dashboards in layers. Top layer: business KPIs (conversion rate, order throughput). Middle layer: service health (golden signals per service). Bottom layer: infrastructure (CPU, memory, disk per node). Each layer answers different questions for different audiences.
Sample traces intelligently in high-traffic systems. Tracing every request at 100K RPS is prohibitively expensive. Use head-based sampling (10% random) for baseline coverage, and tail-based sampling to always capture slow or errored requests.

Common Issues

Dashboard has 50 panels and nobody can find the relevant metric. Overstuffed dashboards cause alert fatigue and slow diagnosis. Create focused dashboards: one for system overview, one per service, one for infrastructure. Each dashboard should answer a specific question in under 10 seconds.

Alerting produces too many false positives. Alerts on raw metrics without appropriate thresholds and windows fire constantly. Use multi-window alerting (short window for severity, long window for persistence), and require sustained violation (5 minutes, not 1 data point) before paging.

Monitoring overhead impacts application performance. Heavy instrumentation, high-cardinality metrics, and synchronous metric export add latency to every request. Use asynchronous metric export, limit label cardinality, and benchmark the monitoring overhead itself (aim for < 1% latency impact).

⚠️ Loading Issue

Architect Performance Monitor

Performance Monitor Architect

When to Use This Agent

Quick Start

Core Concepts

Monitoring Pillars

Golden Signals

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner