I

It Operations Toolkit

Boost productivity using this manages, infrastructure, monitoring, incident. Includes structured workflows, validation checks, and reusable patterns for development.

SkillClipticsdevelopmentv1.0.0MIT
0 views0 copies

IT Operations Skill

A Claude Code skill for managing IT infrastructure operations — covering monitoring, incident response, capacity planning, change management, SLA tracking, and operational runbook creation.

When to Use This Skill

Choose this skill when:

  • Setting up monitoring and alerting for production services
  • Creating incident response runbooks and procedures
  • Planning capacity for growing infrastructure
  • Implementing change management processes
  • Tracking SLAs, SLOs, and error budgets
  • Automating operational tasks and health checks

Consider alternatives when:

  • You need infrastructure provisioning (use a Terraform/IaC skill)
  • You need application code debugging (use a debugging skill)
  • You need security operations (use a security skill)

Quick Start

# Add to your Claude Code project claude mcp add it-operations # Create a monitoring setup claude "set up monitoring for our Node.js API with health checks and alerting" # Create incident response runbook claude "create an incident response runbook for database outages"
// Health check endpoint app.get('/health', async (req, res) => { const checks = { uptime: process.uptime(), database: await checkDatabase(), redis: await checkRedis(), memory: process.memoryUsage(), }; const healthy = checks.database.ok && checks.redis.ok; res.status(healthy ? 200 : 503).json({ status: healthy ? 'healthy' : 'degraded', checks, timestamp: new Date().toISOString() }); });

Core Concepts

SRE Metrics

MetricDefinitionTarget
SLAService Level Agreement (contract)99.9% uptime
SLOService Level Objective (target)99.95% uptime
SLIService Level Indicator (measurement)Successful requests / total
Error BudgetAllowed downtime before SLO breach0.05% = 21.9 min/month
MTTDMean Time To Detect< 5 minutes
MTTRMean Time To Recover< 30 minutes

Incident Response Levels

SeverityImpactResponse TimeExample
SEV1Service down, all users affected15 minutesDatabase outage, API 500s
SEV2Major feature broken30 minutesPayment processing failure
SEV3Minor feature degraded4 hoursSlow search, UI glitches
SEV4Cosmetic or low-impactNext sprintTypo, minor UI inconsistency

Monitoring Stack

# Structured logging { "level": "error", "message": "Database connection failed", "service": "user-api", "host": "prod-api-01", "error": { "code": "ECONNREFUSED", "attempts": 3 }, "timestamp": "2026-03-13T10:15:30Z", "trace_id": "abc123" }

Configuration

ParameterTypeDefaultDescription
monitoring_toolstring"datadog"Monitoring: datadog, prometheus, cloudwatch
alerting_channelstring"slack"Alert destination: slack, pagerduty, email
health_check_intervalnumber30Health check frequency in seconds
log_formatstring"json"Log format: json, text
slo_targetnumber99.9Service level objective percentage
error_budget_windowstring"30d"Error budget calculation window
incident_severity_levelsnumber4Number of severity levels

Best Practices

  1. Monitor the four golden signals — track latency, traffic, errors, and saturation for every service; these four metrics capture the health of any system from the user's perspective.

  2. Create runbooks for every recurring incident — document the symptoms, diagnosis steps, and resolution procedures; runbooks reduce MTTR and enable on-call engineers to resolve issues without expert knowledge.

  3. Set alerts on SLOs, not raw metrics — alerting on "error rate > 1%" is noisy; alerting on "error budget burn rate exceeds 10x normal" captures real problems while ignoring healthy spikes.

  4. Practice incident response before you need it — run game days and tabletop exercises to test runbooks and on-call procedures; the worst time to discover a gap is during a real incident.

  5. Track operational metrics over time — measure MTTD, MTTR, and incident frequency monthly; improving these metrics systematically is more effective than reacting to individual incidents.

Common Issues

Alert fatigue from too many notifications — Reduce alerts to actionable items only. Remove alerts that fire frequently but require no action. Group related alerts and set appropriate thresholds based on historical data.

Runbooks become outdated as systems change — Link runbooks to the code/config they reference. When infrastructure changes, update the runbook in the same PR. Add a "last verified" date to each runbook.

Monitoring gaps discovered during incidents — After every incident, conduct a blameless postmortem and add monitoring for the failure mode that was missed. Build monitoring improvements into the postmortem action items.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates