S

SLO Implement Command

Define and implement Service Level Objectives (SLOs) with corresponding SLIs, error budgets, and alerting rules. Generates monitoring configurations for Prometheus, Grafana, Datadog, or custom metrics systems to track service reliability.

CommandCommunitymonitoringv1.0.0MIT
0 views0 copies

Command

/slo-implement

Description

Helps you define meaningful SLOs for your services, generates the monitoring configuration to track them, and sets up error budget alerts. Follows Google SRE best practices for reliability engineering.

Behavior

  1. Analyze the service to understand user-facing interactions
  2. Propose SLIs (Service Level Indicators) aligned with user experience
  3. Set SLO targets with justification
  4. Calculate error budgets and burn rate alerts
  5. Generate monitoring configuration (Prometheus rules, Grafana dashboards)

SLI Types

TypeMeasuresExample
AvailabilitySuccessful requests / total requests99.9% of requests return non-5xx
LatencyRequests faster than threshold95% of requests complete < 200ms
CorrectnessCorrect results / total results99.99% of calculations are accurate
FreshnessData updated within threshold99% of data updated within 1 minute
ThroughputProcessed items / expected items99.5% of queue items processed

Output Format

SLO Definition Document

service: payment-api slos: - name: availability description: "Payment API returns successful responses" sli: type: availability good_events: "http_requests_total{status!~'5..'}" total_events: "http_requests_total" target: 99.95% window: 30d error_budget: 0.05% # ~21.6 minutes/month consequences: budget_exhausted: "Freeze deployments, focus on reliability" 50_percent_remaining: "Increase monitoring, limit risky changes" - name: latency description: "Payment API responds quickly" sli: type: latency threshold: 500ms good_events: "http_request_duration_seconds_bucket{le='0.5'}" total_events: "http_requests_total" target: 99% window: 30d

Prometheus Recording Rules

groups: - name: slo_payment_api interval: 30s rules: # Availability SLI - record: slo:payment_api:availability:ratio_rate5m expr: | sum(rate(http_requests_total{service="payment-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment-api"}[5m])) # Error budget remaining (30-day window) - record: slo:payment_api:availability:error_budget_remaining expr: | 1 - ( (1 - slo:payment_api:availability:ratio_rate30d) / (1 - 0.9995) )

Burn Rate Alerts

groups: - name: slo_alerts_payment_api rules: # Fast burn: 14.4x budget consumption (pages in 2 hours) - alert: PaymentAPIHighErrorBurnRate expr: | slo:payment_api:availability:error_rate5m > (14.4 * 0.0005) and slo:payment_api:availability:error_rate1h > (14.4 * 0.0005) for: 2m labels: severity: critical annotations: summary: "Payment API burning error budget 14.4x faster than sustainable" budget_remaining: "{{ $value | humanizePercentage }}" # Slow burn: 3x budget consumption (tickets in 3 days) - alert: PaymentAPIElevatedErrorRate expr: | slo:payment_api:availability:error_rate6h > (3 * 0.0005) for: 30m labels: severity: warning

Rules

  1. SLOs should reflect user experience, not system internals
  2. Start conservative - it's easier to tighten SLOs than loosen them
  3. Every SLO needs an error budget policy defining actions when budget is low
  4. Max 3-5 SLOs per service to maintain focus
  5. Review SLOs quarterly and adjust based on actual performance

Examples

# Define SLOs interactively for a service /slo-implement payment-api # Generate Prometheus rules /slo-implement payment-api --format prometheus # Generate Datadog monitors /slo-implement payment-api --format datadog
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates