It Operations Toolkit
Boost productivity using this manages, infrastructure, monitoring, incident. Includes structured workflows, validation checks, and reusable patterns for development.
IT Operations Skill
A Claude Code skill for managing IT infrastructure operations — covering monitoring, incident response, capacity planning, change management, SLA tracking, and operational runbook creation.
When to Use This Skill
Choose this skill when:
- Setting up monitoring and alerting for production services
- Creating incident response runbooks and procedures
- Planning capacity for growing infrastructure
- Implementing change management processes
- Tracking SLAs, SLOs, and error budgets
- Automating operational tasks and health checks
Consider alternatives when:
- You need infrastructure provisioning (use a Terraform/IaC skill)
- You need application code debugging (use a debugging skill)
- You need security operations (use a security skill)
Quick Start
# Add to your Claude Code project claude mcp add it-operations # Create a monitoring setup claude "set up monitoring for our Node.js API with health checks and alerting" # Create incident response runbook claude "create an incident response runbook for database outages"
// Health check endpoint app.get('/health', async (req, res) => { const checks = { uptime: process.uptime(), database: await checkDatabase(), redis: await checkRedis(), memory: process.memoryUsage(), }; const healthy = checks.database.ok && checks.redis.ok; res.status(healthy ? 200 : 503).json({ status: healthy ? 'healthy' : 'degraded', checks, timestamp: new Date().toISOString() }); });
Core Concepts
SRE Metrics
| Metric | Definition | Target |
|---|---|---|
| SLA | Service Level Agreement (contract) | 99.9% uptime |
| SLO | Service Level Objective (target) | 99.95% uptime |
| SLI | Service Level Indicator (measurement) | Successful requests / total |
| Error Budget | Allowed downtime before SLO breach | 0.05% = 21.9 min/month |
| MTTD | Mean Time To Detect | < 5 minutes |
| MTTR | Mean Time To Recover | < 30 minutes |
Incident Response Levels
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Service down, all users affected | 15 minutes | Database outage, API 500s |
| SEV2 | Major feature broken | 30 minutes | Payment processing failure |
| SEV3 | Minor feature degraded | 4 hours | Slow search, UI glitches |
| SEV4 | Cosmetic or low-impact | Next sprint | Typo, minor UI inconsistency |
Monitoring Stack
# Structured logging { "level": "error", "message": "Database connection failed", "service": "user-api", "host": "prod-api-01", "error": { "code": "ECONNREFUSED", "attempts": 3 }, "timestamp": "2026-03-13T10:15:30Z", "trace_id": "abc123" }
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
monitoring_tool | string | "datadog" | Monitoring: datadog, prometheus, cloudwatch |
alerting_channel | string | "slack" | Alert destination: slack, pagerduty, email |
health_check_interval | number | 30 | Health check frequency in seconds |
log_format | string | "json" | Log format: json, text |
slo_target | number | 99.9 | Service level objective percentage |
error_budget_window | string | "30d" | Error budget calculation window |
incident_severity_levels | number | 4 | Number of severity levels |
Best Practices
-
Monitor the four golden signals — track latency, traffic, errors, and saturation for every service; these four metrics capture the health of any system from the user's perspective.
-
Create runbooks for every recurring incident — document the symptoms, diagnosis steps, and resolution procedures; runbooks reduce MTTR and enable on-call engineers to resolve issues without expert knowledge.
-
Set alerts on SLOs, not raw metrics — alerting on "error rate > 1%" is noisy; alerting on "error budget burn rate exceeds 10x normal" captures real problems while ignoring healthy spikes.
-
Practice incident response before you need it — run game days and tabletop exercises to test runbooks and on-call procedures; the worst time to discover a gap is during a real incident.
-
Track operational metrics over time — measure MTTD, MTTR, and incident frequency monthly; improving these metrics systematically is more effective than reacting to individual incidents.
Common Issues
Alert fatigue from too many notifications — Reduce alerts to actionable items only. Remove alerts that fire frequently but require no action. Group related alerts and set appropriate thresholds based on historical data.
Runbooks become outdated as systems change — Link runbooks to the code/config they reference. When infrastructure changes, update the runbook in the same PR. Add a "last verified" date to each runbook.
Monitoring gaps discovered during incidents — After every incident, conduct a blameless postmortem and add monitoring for the failure mode that was missed. Build monitoring improvements into the postmortem action items.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.