IT Operations Skill

A Claude Code skill for managing IT infrastructure operations — covering monitoring, incident response, capacity planning, change management, SLA tracking, and operational runbook creation.

When to Use This Skill

Choose this skill when:

Setting up monitoring and alerting for production services
Creating incident response runbooks and procedures
Planning capacity for growing infrastructure
Implementing change management processes
Tracking SLAs, SLOs, and error budgets
Automating operational tasks and health checks

Consider alternatives when:

You need infrastructure provisioning (use a Terraform/IaC skill)
You need application code debugging (use a debugging skill)
You need security operations (use a security skill)

Quick Start


# Add to your Claude Code project
claude mcp add it-operations

# Create a monitoring setup
claude "set up monitoring for our Node.js API with health checks and alerting"

# Create incident response runbook
claude "create an incident response runbook for database outages"


// Health check endpoint
app.get('/health', async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    database: await checkDatabase(),
    redis: await checkRedis(),
    memory: process.memoryUsage(),
  };

  const healthy = checks.database.ok && checks.redis.ok;
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'degraded',
    checks,
    timestamp: new Date().toISOString()
  });
});

Core Concepts

SRE Metrics

Metric	Definition	Target
SLA	Service Level Agreement (contract)	99.9% uptime
SLO	Service Level Objective (target)	99.95% uptime
SLI	Service Level Indicator (measurement)	Successful requests / total
Error Budget	Allowed downtime before SLO breach	0.05% = 21.9 min/month
MTTD	Mean Time To Detect	< 5 minutes
MTTR	Mean Time To Recover	< 30 minutes

Incident Response Levels

Severity	Impact	Response Time	Example
SEV1	Service down, all users affected	15 minutes	Database outage, API 500s
SEV2	Major feature broken	30 minutes	Payment processing failure
SEV3	Minor feature degraded	4 hours	Slow search, UI glitches
SEV4	Cosmetic or low-impact	Next sprint	Typo, minor UI inconsistency

Monitoring Stack


# Structured logging
{
  "level": "error",
  "message": "Database connection failed",
  "service": "user-api",
  "host": "prod-api-01",
  "error": { "code": "ECONNREFUSED", "attempts": 3 },
  "timestamp": "2026-03-13T10:15:30Z",
  "trace_id": "abc123"
}

Configuration

Parameter	Type	Default	Description
`monitoring_tool`	string	`"datadog"`	Monitoring: datadog, prometheus, cloudwatch
`alerting_channel`	string	`"slack"`	Alert destination: slack, pagerduty, email
`health_check_interval`	number	`30`	Health check frequency in seconds
`log_format`	string	`"json"`	Log format: json, text
`slo_target`	number	`99.9`	Service level objective percentage
`error_budget_window`	string	`"30d"`	Error budget calculation window
`incident_severity_levels`	number	`4`	Number of severity levels

Best Practices

Monitor the four golden signals — track latency, traffic, errors, and saturation for every service; these four metrics capture the health of any system from the user's perspective.
Create runbooks for every recurring incident — document the symptoms, diagnosis steps, and resolution procedures; runbooks reduce MTTR and enable on-call engineers to resolve issues without expert knowledge.
Set alerts on SLOs, not raw metrics — alerting on "error rate > 1%" is noisy; alerting on "error budget burn rate exceeds 10x normal" captures real problems while ignoring healthy spikes.
Practice incident response before you need it — run game days and tabletop exercises to test runbooks and on-call procedures; the worst time to discover a gap is during a real incident.
Track operational metrics over time — measure MTTD, MTTR, and incident frequency monthly; improving these metrics systematically is more effective than reacting to individual incidents.

Common Issues

Alert fatigue from too many notifications — Reduce alerts to actionable items only. Remove alerts that fire frequently but require no action. Group related alerts and set appropriate thresholds based on historical data.

Runbooks become outdated as systems change — Link runbooks to the code/config they reference. When infrastructure changes, update the runbook in the same PR. Add a "last verified" date to each runbook.

Monitoring gaps discovered during incidents — After every incident, conduct a blameless postmortem and add monitoring for the failure mode that was missed. Build monitoring improvements into the postmortem action items.

⚠️ Loading Issue

It Operations Toolkit

IT Operations Skill

When to Use This Skill

Quick Start

Core Concepts

SRE Metrics

Incident Response Levels

Monitoring Stack

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace