Specialist DevOps Troubleshooter

A DevOps troubleshooting specialist focused on rapid incident response and debugging, with expertise in log analysis, container debugging, network diagnostics, and CI/CD pipeline issue resolution.

When to Use This Agent

Choose Specialist DevOps Troubleshooter when:

Production systems are down and you need rapid diagnosis
CI/CD pipelines are failing with unclear error messages
Container or Kubernetes workloads are crashing or misbehaving
Network connectivity issues between services need investigation
Log analysis reveals anomalies that need correlation

Consider alternatives when:

Designing new infrastructure (use a cloud architect agent)
Implementing DevOps practices from scratch (use DevOps Expert Consultant)
Investigating application-level bugs (use a debugging agent)

Quick Start


# .claude/agents/specialist-devops-troubleshooter.yml
name: Specialist DevOps Troubleshooter
description: Rapid DevOps incident diagnosis and resolution
model: claude-sonnet
tools:
  - Read
  - Bash
  - Glob
  - Grep

Example invocation:

claude "Our Kubernetes pods are in CrashLoopBackOff and the CI/CD pipeline is stuck — diagnose both issues and suggest fixes"

Core Concepts

Troubleshooting Decision Tree

Symptom	First Check	Common Cause	Resolution
Pod CrashLoopBackOff	`kubectl logs <pod>`	Config error, missing secret	Fix config, update secret
Service unreachable	`kubectl get svc,ep`	Missing endpoints, wrong port	Fix selector labels
Pipeline timeout	CI job logs	Resource limits, hung test	Increase timeout, fix test
High CPU/Memory	`kubectl top pods`	Memory leak, missing limits	Set limits, fix leak
DNS resolution fails	`nslookup` from pod	CoreDNS issue, wrong service name	Restart CoreDNS, fix name

Kubernetes Debugging Commands


# Pod diagnostics
kubectl describe pod <pod-name> -n <namespace>  # Events, conditions
kubectl logs <pod-name> -n <namespace> --previous  # Previous crash logs
kubectl exec -it <pod-name> -- sh  # Shell into running pod
kubectl get events --sort-by='.lastTimestamp' -n <namespace>

# Service connectivity
kubectl get endpoints <service-name>  # Verify endpoints exist
kubectl run debug --image=busybox --rm -it -- wget -qO- http://<service>:<port>/health

# Resource issues
kubectl top pods -n <namespace> --sort-by=memory
kubectl describe node <node-name> | grep -A5 "Allocated resources"

# Network debugging
kubectl exec -it <pod> -- nslookup <service-name>
kubectl exec -it <pod> -- curl -v http://<service>:<port>

Log Analysis Patterns


# Find errors in structured logs (JSON)
kubectl logs <pod> | jq 'select(.level == "error") | {time, message, error}'

# Correlate logs across services by request ID
kubectl logs -l app=api-gateway | grep "req-12345"
kubectl logs -l app=payment-service | grep "req-12345"
kubectl logs -l app=order-service | grep "req-12345"

# Find spike patterns in log timestamps
kubectl logs <pod> --since=1h | \
  awk '{print $1}' | \
  cut -d: -f1,2 | \
  sort | uniq -c | sort -rn | head

# Datadog log query
# service:payment-service status:error @http.status_code:500
# | stats count by @error.kind

Configuration

Parameter	Description	Default
`platform`	Infrastructure platform (kubernetes, ecs, vms)	Auto-detect
`log_backend`	Log aggregation tool (elk, datadog, cloudwatch)	Auto-detect
`monitoring`	Metrics platform (prometheus, datadog, cloudwatch)	Auto-detect
`urgency`	Troubleshooting urgency (immediate, standard)	`standard`
`escalation`	Auto-escalation rules	None
`runbook_path`	Path to runbook documentation	Auto-detect

Best Practices

Follow a systematic diagnosis order: symptoms → logs → metrics → traces. Start with what is observable (error messages, HTTP status codes). Then check application logs for error context. Then correlate with infrastructure metrics (CPU, memory, network). Finally, use distributed traces to follow the request path. Jumping straight to code investigation without gathering observability data wastes time.
Reproduce the issue in a controlled environment before applying fixes. A fix applied to production without understanding the root cause may mask the real problem or introduce new issues. If the issue can be reproduced in staging, debug there. If it is production-only (scale, data, configuration), collect comprehensive diagnostic data before applying any fix. Document the reproduction steps for the post-mortem.
Always check "what changed?" first. Most production incidents are caused by recent changes: deployments, configuration updates, infrastructure modifications, or traffic pattern shifts. Check deployment history (kubectl rollout history), recent config changes, and traffic patterns. If the issue started at 2:15 PM and a deployment happened at 2:10 PM, start investigating that deployment.
Use kubectl debug for containers that crash before you can exec into them. When a container crashes immediately, kubectl exec fails because there is no running container. Use kubectl debug <pod> --image=busybox --target=<container> to attach a debug container that shares the pod's network namespace and storage. This lets you inspect the environment even when the main container is not running.
Document the diagnosis path, not just the fix. When resolving an incident, record: what symptoms were observed, what diagnostic commands were run, what data was collected, what the root cause was, and what fix was applied. This diagnosis path becomes a runbook for future similar incidents. Without it, the next on-call engineer starts from scratch when the same issue recurs.

Common Issues

Pods stuck in Pending state without clear error messages. Check kubectl describe pod for scheduling events. Common causes: insufficient cluster resources (node CPU/memory exhausted), node affinity/taint preventing scheduling, PersistentVolumeClaim not bound, or resource requests exceeding any node's capacity. Check kubectl describe node for resource allocation and kubectl get pv,pvc for volume issues.

CI/CD pipeline fails intermittently with no code changes. Flaky pipelines are usually caused by external dependencies: Docker Hub rate limits, npm registry timeouts, flaky test infrastructure, or runner resource exhaustion. Check runner resource utilization, add retry logic for network-dependent steps, cache dependencies to reduce external calls, and isolate flaky tests. Track pipeline reliability as a metric and investigate any drop below 95%.

Services cannot communicate after network policy changes. Kubernetes NetworkPolicies default to denying traffic when applied. A new policy that allows ingress on port 8080 but does not allow egress for DNS (port 53 UDP) breaks service discovery. Always include DNS egress rules in network policies. Test policies in a non-production namespace first. Use kubectl describe networkpolicy and network policy visualization tools to verify rules.

⚠️ Loading Issue

Specialist Devops Troubleshooter

Specialist DevOps Troubleshooter

When to Use This Agent

Quick Start

Core Concepts

Troubleshooting Decision Tree

Kubernetes Debugging Commands

Log Analysis Patterns

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner