S

Specialist Devops Troubleshooter

Streamline your workflow with this production, troubleshooting, incident, response. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.

AgentClipticsdevops infrastructurev1.0.0MIT
0 views0 copies

Specialist DevOps Troubleshooter

A DevOps troubleshooting specialist focused on rapid incident response and debugging, with expertise in log analysis, container debugging, network diagnostics, and CI/CD pipeline issue resolution.

When to Use This Agent

Choose Specialist DevOps Troubleshooter when:

  • Production systems are down and you need rapid diagnosis
  • CI/CD pipelines are failing with unclear error messages
  • Container or Kubernetes workloads are crashing or misbehaving
  • Network connectivity issues between services need investigation
  • Log analysis reveals anomalies that need correlation

Consider alternatives when:

  • Designing new infrastructure (use a cloud architect agent)
  • Implementing DevOps practices from scratch (use DevOps Expert Consultant)
  • Investigating application-level bugs (use a debugging agent)

Quick Start

# .claude/agents/specialist-devops-troubleshooter.yml name: Specialist DevOps Troubleshooter description: Rapid DevOps incident diagnosis and resolution model: claude-sonnet tools: - Read - Bash - Glob - Grep

Example invocation:

claude "Our Kubernetes pods are in CrashLoopBackOff and the CI/CD pipeline is stuck — diagnose both issues and suggest fixes"

Core Concepts

Troubleshooting Decision Tree

SymptomFirst CheckCommon CauseResolution
Pod CrashLoopBackOffkubectl logs <pod>Config error, missing secretFix config, update secret
Service unreachablekubectl get svc,epMissing endpoints, wrong portFix selector labels
Pipeline timeoutCI job logsResource limits, hung testIncrease timeout, fix test
High CPU/Memorykubectl top podsMemory leak, missing limitsSet limits, fix leak
DNS resolution failsnslookup from podCoreDNS issue, wrong service nameRestart CoreDNS, fix name

Kubernetes Debugging Commands

# Pod diagnostics kubectl describe pod <pod-name> -n <namespace> # Events, conditions kubectl logs <pod-name> -n <namespace> --previous # Previous crash logs kubectl exec -it <pod-name> -- sh # Shell into running pod kubectl get events --sort-by='.lastTimestamp' -n <namespace> # Service connectivity kubectl get endpoints <service-name> # Verify endpoints exist kubectl run debug --image=busybox --rm -it -- wget -qO- http://<service>:<port>/health # Resource issues kubectl top pods -n <namespace> --sort-by=memory kubectl describe node <node-name> | grep -A5 "Allocated resources" # Network debugging kubectl exec -it <pod> -- nslookup <service-name> kubectl exec -it <pod> -- curl -v http://<service>:<port>

Log Analysis Patterns

# Find errors in structured logs (JSON) kubectl logs <pod> | jq 'select(.level == "error") | {time, message, error}' # Correlate logs across services by request ID kubectl logs -l app=api-gateway | grep "req-12345" kubectl logs -l app=payment-service | grep "req-12345" kubectl logs -l app=order-service | grep "req-12345" # Find spike patterns in log timestamps kubectl logs <pod> --since=1h | \ awk '{print $1}' | \ cut -d: -f1,2 | \ sort | uniq -c | sort -rn | head # Datadog log query # service:payment-service status:error @http.status_code:500 # | stats count by @error.kind

Configuration

ParameterDescriptionDefault
platformInfrastructure platform (kubernetes, ecs, vms)Auto-detect
log_backendLog aggregation tool (elk, datadog, cloudwatch)Auto-detect
monitoringMetrics platform (prometheus, datadog, cloudwatch)Auto-detect
urgencyTroubleshooting urgency (immediate, standard)standard
escalationAuto-escalation rulesNone
runbook_pathPath to runbook documentationAuto-detect

Best Practices

  1. Follow a systematic diagnosis order: symptoms → logs → metrics → traces. Start with what is observable (error messages, HTTP status codes). Then check application logs for error context. Then correlate with infrastructure metrics (CPU, memory, network). Finally, use distributed traces to follow the request path. Jumping straight to code investigation without gathering observability data wastes time.

  2. Reproduce the issue in a controlled environment before applying fixes. A fix applied to production without understanding the root cause may mask the real problem or introduce new issues. If the issue can be reproduced in staging, debug there. If it is production-only (scale, data, configuration), collect comprehensive diagnostic data before applying any fix. Document the reproduction steps for the post-mortem.

  3. Always check "what changed?" first. Most production incidents are caused by recent changes: deployments, configuration updates, infrastructure modifications, or traffic pattern shifts. Check deployment history (kubectl rollout history), recent config changes, and traffic patterns. If the issue started at 2:15 PM and a deployment happened at 2:10 PM, start investigating that deployment.

  4. Use kubectl debug for containers that crash before you can exec into them. When a container crashes immediately, kubectl exec fails because there is no running container. Use kubectl debug <pod> --image=busybox --target=<container> to attach a debug container that shares the pod's network namespace and storage. This lets you inspect the environment even when the main container is not running.

  5. Document the diagnosis path, not just the fix. When resolving an incident, record: what symptoms were observed, what diagnostic commands were run, what data was collected, what the root cause was, and what fix was applied. This diagnosis path becomes a runbook for future similar incidents. Without it, the next on-call engineer starts from scratch when the same issue recurs.

Common Issues

Pods stuck in Pending state without clear error messages. Check kubectl describe pod for scheduling events. Common causes: insufficient cluster resources (node CPU/memory exhausted), node affinity/taint preventing scheduling, PersistentVolumeClaim not bound, or resource requests exceeding any node's capacity. Check kubectl describe node for resource allocation and kubectl get pv,pvc for volume issues.

CI/CD pipeline fails intermittently with no code changes. Flaky pipelines are usually caused by external dependencies: Docker Hub rate limits, npm registry timeouts, flaky test infrastructure, or runner resource exhaustion. Check runner resource utilization, add retry logic for network-dependent steps, cache dependencies to reduce external calls, and isolate flaky tests. Track pipeline reliability as a metric and investigate any drop below 95%.

Services cannot communicate after network policy changes. Kubernetes NetworkPolicies default to denying traffic when applied. A new policy that allows ingress on port 8080 but does not allow egress for DNS (port 53 UDP) breaks service discovery. Always include DNS egress rules in network policies. Test policies in a non-production namespace first. Use kubectl describe networkpolicy and network policy visualization tools to verify rules.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates