S

Sre Engineer Partner

Streamline your workflow with this agent, need, establish, improve. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.

AgentClipticsdevops infrastructurev1.0.0MIT
0 views0 copies

SRE Engineer Partner

Your agent for building and maintaining highly reliable systems through SLI/SLO management, error budgets, capacity planning, incident response, and toil reduction.

When to Use This Agent

Choose SRE Engineer Partner when:

  • Defining and implementing SLIs, SLOs, and error budgets for services
  • Building observability stacks (metrics, logs, traces, dashboards)
  • Designing incident response processes and on-call rotations
  • Automating toil and operational runbooks
  • Performing capacity planning, load testing, or reliability reviews

Consider alternatives when:

  • You need infrastructure provisioning β€” use a Terraform or cloud engineer agent
  • You're focused on CI/CD pipeline design β€” use a GitOps agent
  • You need security hardening β€” use a security engineer agent

Quick Start

# .claude/agents/sre-engineer.yml name: SRE Engineer Partner model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep description: SRE agent for reliability engineering, SLO management, incident response, and operational excellence

Example invocation:

claude "Define SLIs and SLOs for our user-facing API service β€” it handles 10K RPM with a p99 latency target of 500ms and 99.9% availability"

Core Concepts

SRE Hierarchy of Reliability

LevelFocusActivities
MonitoringKnow when things breakAlerting, dashboards, anomaly detection
Incident ResponseFix things fastRunbooks, on-call, escalation, postmortems
Postmortem CultureLearn from failuresBlameless reviews, action items, trends
Capacity PlanningStay ahead of demandLoad testing, traffic forecasting, scaling
Toil ReductionAutomate the repetitiveRunbook automation, self-healing, tooling
SLO-DrivenSpend error budget wiselyError budget policies, release gates

SLI/SLO Framework

SLI (Service Level Indicator)
  └─ Measurable metric (e.g., request latency, error rate)

SLO (Service Level Objective)
  └─ Target for SLI (e.g., p99 latency < 500ms for 99.9% of requests)

Error Budget
  └─ Allowed unreliability = 1 - SLO (e.g., 0.1% = 43.2 min/month)

Error Budget Policy
  └─ Actions when budget is spent (freeze deploys, focus on reliability)

Configuration

ParameterDescriptionDefault
slo_targetDefault SLO availability target99.9%
error_budget_windowRolling window for error budget calculation30d
monitoring_stackObservability tools (prometheus, datadog, newrelic)prometheus
incident_toolIncident management (pagerduty, opsgenie, rootly)pagerduty
on_call_rotationRotation schedule patternweekly

Best Practices

  1. Define SLOs based on user experience, not infrastructure metrics. CPU utilization and disk usage don't tell you if users are happy. Measure what users care about: request success rate, latency percentiles, and data freshness. An SLO of "99.9% of requests complete successfully within 500ms" is actionable; "CPU below 80%" is not.

  2. Use error budgets to balance reliability and velocity. When you have budget remaining, ship features aggressively. When budget is depleted, pause feature work and invest in reliability. This removes the subjective "is it reliable enough?" debate and replaces it with data-driven decisions.

  3. Automate incident response, not just detection. Alerts that page humans for known-fixable problems are toil. Build self-healing for predictable failures (restart crashed pods, scale on traffic spikes, failover on health check failures) and reserve human intervention for novel incidents.

  4. Run blameless postmortems for every significant incident. Focus on systemic causes and contributing factors, not individual blame. Document what happened, why detection/response took the time it did, and what specific changes will prevent recurrence. Track action item completion rigorously.

  5. Measure and reduce toil systematically. Track how on-call engineers spend their time. Any repetitive, automatable task that scales linearly with service growth is toil. Set a target (e.g., toil < 30% of on-call time) and dedicate engineering capacity to automation projects that reduce it.

Common Issues

Alert fatigue causes on-call engineers to ignore pages. This happens when alerting thresholds are too sensitive or non-actionable alerts aren't suppressed. Audit every alert: if it doesn't require human action, it's not an alert β€” it's a log. Implement alert deduplication, grouping, and routing. Aim for fewer than 2 pages per on-call shift.

SLOs exist on paper but nobody uses them for decisions. SLOs become decoration when there's no error budget policy and no integration with release processes. Wire error budget status into your CI/CD pipeline as a deploy gate, display burn rate on team dashboards, and review error budget trends in sprint planning.

Capacity planning is reactive instead of proactive. Teams discover capacity problems during traffic spikes instead of before them. Run regular load tests against production-like environments, forecast traffic 3-6 months ahead using historical trends, and set capacity alerts at 70% utilization so you have time to scale before hitting limits.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates