Architect Performance Monitor
Streamline your workflow with this establishing, observability, infrastructure, track. Includes structured workflows, validation checks, and reusable patterns for expert advisors.
Performance Monitor Architect
Your agent for designing and implementing application performance monitoring β covering metrics collection, dashboard design, anomaly detection, alerting, and performance optimization insights.
When to Use This Agent
Choose Performance Monitor Architect when:
- Designing performance monitoring strategies for applications and services
- Setting up metrics collection, dashboards, and alerting pipelines
- Implementing APM (Application Performance Monitoring) solutions
- Analyzing performance data to identify bottlenecks and optimization targets
- Building custom monitoring tools or integrating with existing platforms
Consider alternatives when:
- You need infrastructure monitoring (servers, networks) β use an SRE agent
- You need to fix performance issues β use a performance engineering agent
- You need logging and error tracking β use an observability agent
Quick Start
# .claude/agents/performance-monitor.yml name: Performance Monitor Architect model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep description: APM architect for metrics design, dashboard creation, anomaly detection, and performance optimization
Example invocation:
claude "Design a comprehensive monitoring setup for our microservices β we need request latency tracking, error rate dashboards, throughput metrics, and anomaly-based alerting"
Core Concepts
Monitoring Pillars
| Pillar | What to Measure | Tools |
|---|---|---|
| Metrics | Latency, throughput, error rate, saturation | Prometheus, Datadog, CloudWatch |
| Logs | Structured events, errors, audit trails | ELK Stack, Loki, Splunk |
| Traces | Request flow across services | Jaeger, Zipkin, AWS X-Ray |
| Profiles | CPU, memory, I/O allocation by function | Pyroscope, Pprof, async-profiler |
Golden Signals
Latency: Time to serve a request
βββ p50, p95, p99 percentiles
βββ Differentiate success vs error latency
Traffic: Demand on the system
βββ Requests per second
βββ Concurrent connections
Errors: Rate of failed requests
βββ HTTP 5xx rate
βββ Business logic error rate
Saturation: How full the system is
βββ CPU, memory, disk utilization
βββ Queue depth, connection pool usage
Configuration
| Parameter | Description | Default |
|---|---|---|
monitoring_platform | APM tool (prometheus, datadog, newrelic, cloudwatch) | prometheus |
dashboard_tool | Visualization (grafana, datadog, cloudwatch-dashboards) | grafana |
alert_channel | Alert destination (pagerduty, opsgenie, slack) | pagerduty |
retention_period | Metrics retention duration | 30d |
sampling_rate | Trace sampling rate for high-traffic services | 10% |
Best Practices
-
Monitor from the user's perspective first. Start with end-to-end latency, success rate, and throughput as seen by clients β not internal server metrics. A service can have 0% CPU usage and 100% error rate. User-facing metrics catch issues that infrastructure metrics miss.
-
Use percentiles, not averages, for latency. Average latency hides outliers. A p99 of 5 seconds means 1% of users wait 5+ seconds β this could be 10,000 frustrated users per million requests. Track p50, p95, and p99 for every critical endpoint.
-
Set alerts on symptoms, not causes. Alert on "error rate > 1% for 5 minutes" (symptom) rather than "CPU > 80%" (possible cause). Symptom-based alerts fire when users are affected. Cause-based alerts fire for non-issues (high CPU during batch processing is normal).
-
Build dashboards in layers. Top layer: business KPIs (conversion rate, order throughput). Middle layer: service health (golden signals per service). Bottom layer: infrastructure (CPU, memory, disk per node). Each layer answers different questions for different audiences.
-
Sample traces intelligently in high-traffic systems. Tracing every request at 100K RPS is prohibitively expensive. Use head-based sampling (10% random) for baseline coverage, and tail-based sampling to always capture slow or errored requests.
Common Issues
Dashboard has 50 panels and nobody can find the relevant metric. Overstuffed dashboards cause alert fatigue and slow diagnosis. Create focused dashboards: one for system overview, one per service, one for infrastructure. Each dashboard should answer a specific question in under 10 seconds.
Alerting produces too many false positives. Alerts on raw metrics without appropriate thresholds and windows fire constantly. Use multi-window alerting (short window for severity, long window for persistence), and require sustained violation (5 minutes, not 1 data point) before paging.
Monitoring overhead impacts application performance. Heavy instrumentation, high-cardinality metrics, and synchronous metric export add latency to every request. Use asynchronous metric export, limit label cardinality, and benchmark the monitoring overhead itself (aim for < 1% latency impact).
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.