Specialist Devops Troubleshooter
Streamline your workflow with this production, troubleshooting, incident, response. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.
Specialist DevOps Troubleshooter
A DevOps troubleshooting specialist focused on rapid incident response and debugging, with expertise in log analysis, container debugging, network diagnostics, and CI/CD pipeline issue resolution.
When to Use This Agent
Choose Specialist DevOps Troubleshooter when:
- Production systems are down and you need rapid diagnosis
- CI/CD pipelines are failing with unclear error messages
- Container or Kubernetes workloads are crashing or misbehaving
- Network connectivity issues between services need investigation
- Log analysis reveals anomalies that need correlation
Consider alternatives when:
- Designing new infrastructure (use a cloud architect agent)
- Implementing DevOps practices from scratch (use DevOps Expert Consultant)
- Investigating application-level bugs (use a debugging agent)
Quick Start
# .claude/agents/specialist-devops-troubleshooter.yml name: Specialist DevOps Troubleshooter description: Rapid DevOps incident diagnosis and resolution model: claude-sonnet tools: - Read - Bash - Glob - Grep
Example invocation:
claude "Our Kubernetes pods are in CrashLoopBackOff and the CI/CD pipeline is stuck — diagnose both issues and suggest fixes"
Core Concepts
Troubleshooting Decision Tree
| Symptom | First Check | Common Cause | Resolution |
|---|---|---|---|
| Pod CrashLoopBackOff | kubectl logs <pod> | Config error, missing secret | Fix config, update secret |
| Service unreachable | kubectl get svc,ep | Missing endpoints, wrong port | Fix selector labels |
| Pipeline timeout | CI job logs | Resource limits, hung test | Increase timeout, fix test |
| High CPU/Memory | kubectl top pods | Memory leak, missing limits | Set limits, fix leak |
| DNS resolution fails | nslookup from pod | CoreDNS issue, wrong service name | Restart CoreDNS, fix name |
Kubernetes Debugging Commands
# Pod diagnostics kubectl describe pod <pod-name> -n <namespace> # Events, conditions kubectl logs <pod-name> -n <namespace> --previous # Previous crash logs kubectl exec -it <pod-name> -- sh # Shell into running pod kubectl get events --sort-by='.lastTimestamp' -n <namespace> # Service connectivity kubectl get endpoints <service-name> # Verify endpoints exist kubectl run debug --image=busybox --rm -it -- wget -qO- http://<service>:<port>/health # Resource issues kubectl top pods -n <namespace> --sort-by=memory kubectl describe node <node-name> | grep -A5 "Allocated resources" # Network debugging kubectl exec -it <pod> -- nslookup <service-name> kubectl exec -it <pod> -- curl -v http://<service>:<port>
Log Analysis Patterns
# Find errors in structured logs (JSON) kubectl logs <pod> | jq 'select(.level == "error") | {time, message, error}' # Correlate logs across services by request ID kubectl logs -l app=api-gateway | grep "req-12345" kubectl logs -l app=payment-service | grep "req-12345" kubectl logs -l app=order-service | grep "req-12345" # Find spike patterns in log timestamps kubectl logs <pod> --since=1h | \ awk '{print $1}' | \ cut -d: -f1,2 | \ sort | uniq -c | sort -rn | head # Datadog log query # service:payment-service status:error @http.status_code:500 # | stats count by @error.kind
Configuration
| Parameter | Description | Default |
|---|---|---|
platform | Infrastructure platform (kubernetes, ecs, vms) | Auto-detect |
log_backend | Log aggregation tool (elk, datadog, cloudwatch) | Auto-detect |
monitoring | Metrics platform (prometheus, datadog, cloudwatch) | Auto-detect |
urgency | Troubleshooting urgency (immediate, standard) | standard |
escalation | Auto-escalation rules | None |
runbook_path | Path to runbook documentation | Auto-detect |
Best Practices
-
Follow a systematic diagnosis order: symptoms → logs → metrics → traces. Start with what is observable (error messages, HTTP status codes). Then check application logs for error context. Then correlate with infrastructure metrics (CPU, memory, network). Finally, use distributed traces to follow the request path. Jumping straight to code investigation without gathering observability data wastes time.
-
Reproduce the issue in a controlled environment before applying fixes. A fix applied to production without understanding the root cause may mask the real problem or introduce new issues. If the issue can be reproduced in staging, debug there. If it is production-only (scale, data, configuration), collect comprehensive diagnostic data before applying any fix. Document the reproduction steps for the post-mortem.
-
Always check "what changed?" first. Most production incidents are caused by recent changes: deployments, configuration updates, infrastructure modifications, or traffic pattern shifts. Check deployment history (
kubectl rollout history), recent config changes, and traffic patterns. If the issue started at 2:15 PM and a deployment happened at 2:10 PM, start investigating that deployment. -
Use kubectl debug for containers that crash before you can exec into them. When a container crashes immediately,
kubectl execfails because there is no running container. Usekubectl debug <pod> --image=busybox --target=<container>to attach a debug container that shares the pod's network namespace and storage. This lets you inspect the environment even when the main container is not running. -
Document the diagnosis path, not just the fix. When resolving an incident, record: what symptoms were observed, what diagnostic commands were run, what data was collected, what the root cause was, and what fix was applied. This diagnosis path becomes a runbook for future similar incidents. Without it, the next on-call engineer starts from scratch when the same issue recurs.
Common Issues
Pods stuck in Pending state without clear error messages. Check kubectl describe pod for scheduling events. Common causes: insufficient cluster resources (node CPU/memory exhausted), node affinity/taint preventing scheduling, PersistentVolumeClaim not bound, or resource requests exceeding any node's capacity. Check kubectl describe node for resource allocation and kubectl get pv,pvc for volume issues.
CI/CD pipeline fails intermittently with no code changes. Flaky pipelines are usually caused by external dependencies: Docker Hub rate limits, npm registry timeouts, flaky test infrastructure, or runner resource exhaustion. Check runner resource utilization, add retry logic for network-dependent steps, cache dependencies to reduce external calls, and isolate flaky tests. Track pipeline reliability as a metric and investigate any drop below 95%.
Services cannot communicate after network policy changes. Kubernetes NetworkPolicies default to denying traffic when applied. A new policy that allows ingress on port 8080 but does not allow egress for DNS (port 53 UDP) breaks service discovery. Always include DNS egress rules in network policies. Test policies in a non-production namespace first. Use kubectl describe networkpolicy and network policy visualization tools to verify rules.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.