Sre Engineer Partner
Streamline your workflow with this agent, need, establish, improve. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.
SRE Engineer Partner
Your agent for building and maintaining highly reliable systems through SLI/SLO management, error budgets, capacity planning, incident response, and toil reduction.
When to Use This Agent
Choose SRE Engineer Partner when:
- Defining and implementing SLIs, SLOs, and error budgets for services
- Building observability stacks (metrics, logs, traces, dashboards)
- Designing incident response processes and on-call rotations
- Automating toil and operational runbooks
- Performing capacity planning, load testing, or reliability reviews
Consider alternatives when:
- You need infrastructure provisioning β use a Terraform or cloud engineer agent
- You're focused on CI/CD pipeline design β use a GitOps agent
- You need security hardening β use a security engineer agent
Quick Start
# .claude/agents/sre-engineer.yml name: SRE Engineer Partner model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep description: SRE agent for reliability engineering, SLO management, incident response, and operational excellence
Example invocation:
claude "Define SLIs and SLOs for our user-facing API service β it handles 10K RPM with a p99 latency target of 500ms and 99.9% availability"
Core Concepts
SRE Hierarchy of Reliability
| Level | Focus | Activities |
|---|---|---|
| Monitoring | Know when things break | Alerting, dashboards, anomaly detection |
| Incident Response | Fix things fast | Runbooks, on-call, escalation, postmortems |
| Postmortem Culture | Learn from failures | Blameless reviews, action items, trends |
| Capacity Planning | Stay ahead of demand | Load testing, traffic forecasting, scaling |
| Toil Reduction | Automate the repetitive | Runbook automation, self-healing, tooling |
| SLO-Driven | Spend error budget wisely | Error budget policies, release gates |
SLI/SLO Framework
SLI (Service Level Indicator)
ββ Measurable metric (e.g., request latency, error rate)
SLO (Service Level Objective)
ββ Target for SLI (e.g., p99 latency < 500ms for 99.9% of requests)
Error Budget
ββ Allowed unreliability = 1 - SLO (e.g., 0.1% = 43.2 min/month)
Error Budget Policy
ββ Actions when budget is spent (freeze deploys, focus on reliability)
Configuration
| Parameter | Description | Default |
|---|---|---|
slo_target | Default SLO availability target | 99.9% |
error_budget_window | Rolling window for error budget calculation | 30d |
monitoring_stack | Observability tools (prometheus, datadog, newrelic) | prometheus |
incident_tool | Incident management (pagerduty, opsgenie, rootly) | pagerduty |
on_call_rotation | Rotation schedule pattern | weekly |
Best Practices
-
Define SLOs based on user experience, not infrastructure metrics. CPU utilization and disk usage don't tell you if users are happy. Measure what users care about: request success rate, latency percentiles, and data freshness. An SLO of "99.9% of requests complete successfully within 500ms" is actionable; "CPU below 80%" is not.
-
Use error budgets to balance reliability and velocity. When you have budget remaining, ship features aggressively. When budget is depleted, pause feature work and invest in reliability. This removes the subjective "is it reliable enough?" debate and replaces it with data-driven decisions.
-
Automate incident response, not just detection. Alerts that page humans for known-fixable problems are toil. Build self-healing for predictable failures (restart crashed pods, scale on traffic spikes, failover on health check failures) and reserve human intervention for novel incidents.
-
Run blameless postmortems for every significant incident. Focus on systemic causes and contributing factors, not individual blame. Document what happened, why detection/response took the time it did, and what specific changes will prevent recurrence. Track action item completion rigorously.
-
Measure and reduce toil systematically. Track how on-call engineers spend their time. Any repetitive, automatable task that scales linearly with service growth is toil. Set a target (e.g., toil < 30% of on-call time) and dedicate engineering capacity to automation projects that reduce it.
Common Issues
Alert fatigue causes on-call engineers to ignore pages. This happens when alerting thresholds are too sensitive or non-actionable alerts aren't suppressed. Audit every alert: if it doesn't require human action, it's not an alert β it's a log. Implement alert deduplication, grouping, and routing. Aim for fewer than 2 pages per on-call shift.
SLOs exist on paper but nobody uses them for decisions. SLOs become decoration when there's no error budget policy and no integration with release processes. Wire error budget status into your CI/CD pipeline as a deploy gate, display burn rate on team dashboards, and review error budget trends in sprint planning.
Capacity planning is reactive instead of proactive. Teams discover capacity problems during traffic spikes instead of before them. Run regular load tests against production-like environments, forecast traffic 3-6 months ahead using historical trends, and set capacity alerts at 70% utilization so you have time to scale before hitting limits.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.