SRE Engineer Partner

Your agent for building and maintaining highly reliable systems through SLI/SLO management, error budgets, capacity planning, incident response, and toil reduction.

When to Use This Agent

Choose SRE Engineer Partner when:

Defining and implementing SLIs, SLOs, and error budgets for services
Building observability stacks (metrics, logs, traces, dashboards)
Designing incident response processes and on-call rotations
Automating toil and operational runbooks
Performing capacity planning, load testing, or reliability reviews

Consider alternatives when:

You need infrastructure provisioning — use a Terraform or cloud engineer agent
You're focused on CI/CD pipeline design — use a GitOps agent
You need security hardening — use a security engineer agent

Quick Start


# .claude/agents/sre-engineer.yml
name: SRE Engineer Partner
model: claude-sonnet
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
description: SRE agent for reliability engineering, SLO management, incident response, and operational excellence

Example invocation:

claude "Define SLIs and SLOs for our user-facing API service — it handles 10K RPM with a p99 latency target of 500ms and 99.9% availability"

Core Concepts

SRE Hierarchy of Reliability

Level	Focus	Activities
Monitoring	Know when things break	Alerting, dashboards, anomaly detection
Incident Response	Fix things fast	Runbooks, on-call, escalation, postmortems
Postmortem Culture	Learn from failures	Blameless reviews, action items, trends
Capacity Planning	Stay ahead of demand	Load testing, traffic forecasting, scaling
Toil Reduction	Automate the repetitive	Runbook automation, self-healing, tooling
SLO-Driven	Spend error budget wisely	Error budget policies, release gates

SLI/SLO Framework

SLI (Service Level Indicator)
  └─ Measurable metric (e.g., request latency, error rate)

SLO (Service Level Objective)
  └─ Target for SLI (e.g., p99 latency < 500ms for 99.9% of requests)

Error Budget
  └─ Allowed unreliability = 1 - SLO (e.g., 0.1% = 43.2 min/month)

Error Budget Policy
  └─ Actions when budget is spent (freeze deploys, focus on reliability)

Configuration

Parameter	Description	Default
`slo_target`	Default SLO availability target	99.9%
`error_budget_window`	Rolling window for error budget calculation	30d
`monitoring_stack`	Observability tools (prometheus, datadog, newrelic)	prometheus
`incident_tool`	Incident management (pagerduty, opsgenie, rootly)	pagerduty
`on_call_rotation`	Rotation schedule pattern	weekly

Best Practices

Define SLOs based on user experience, not infrastructure metrics. CPU utilization and disk usage don't tell you if users are happy. Measure what users care about: request success rate, latency percentiles, and data freshness. An SLO of "99.9% of requests complete successfully within 500ms" is actionable; "CPU below 80%" is not.
Use error budgets to balance reliability and velocity. When you have budget remaining, ship features aggressively. When budget is depleted, pause feature work and invest in reliability. This removes the subjective "is it reliable enough?" debate and replaces it with data-driven decisions.
Automate incident response, not just detection. Alerts that page humans for known-fixable problems are toil. Build self-healing for predictable failures (restart crashed pods, scale on traffic spikes, failover on health check failures) and reserve human intervention for novel incidents.
Run blameless postmortems for every significant incident. Focus on systemic causes and contributing factors, not individual blame. Document what happened, why detection/response took the time it did, and what specific changes will prevent recurrence. Track action item completion rigorously.
Measure and reduce toil systematically. Track how on-call engineers spend their time. Any repetitive, automatable task that scales linearly with service growth is toil. Set a target (e.g., toil < 30% of on-call time) and dedicate engineering capacity to automation projects that reduce it.

Common Issues

Alert fatigue causes on-call engineers to ignore pages. This happens when alerting thresholds are too sensitive or non-actionable alerts aren't suppressed. Audit every alert: if it doesn't require human action, it's not an alert — it's a log. Implement alert deduplication, grouping, and routing. Aim for fewer than 2 pages per on-call shift.

SLOs exist on paper but nobody uses them for decisions. SLOs become decoration when there's no error budget policy and no integration with release processes. Wire error budget status into your CI/CD pipeline as a deploy gate, display burn rate on team dashboards, and review error budget trends in sprint planning.

Capacity planning is reactive instead of proactive. Teams discover capacity problems during traffic spikes instead of before them. Run regular load tests against production-like environments, forecast traffic 3-6 months ahead using historical trends, and set capacity alerts at 70% utilization so you have time to scale before hitting limits.

⚠️ Loading Issue

Sre Engineer Partner

SRE Engineer Partner

When to Use This Agent

Quick Start

Core Concepts

SRE Hierarchy of Reliability

SLI/SLO Framework

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner