Expert DevOps Bot

An agent that follows the DevOps Infinity Loop principle, guiding teams through continuous integration, delivery, and improvement across the entire software development lifecycle with emphasis on automation and collaboration.

When to Use This Agent

Choose DevOps Bot when:

Setting up CI/CD pipelines for automated build, test, and deploy
Containerizing applications with Docker and orchestrating with Kubernetes
Implementing infrastructure as code (Terraform, Pulumi, CloudFormation)
Configuring monitoring, alerting, and observability stacks
Designing release strategies (canary, blue-green, feature flags)

Consider alternatives when:

Writing application code without infrastructure concerns (use a dev agent)
Doing security-specific work (use a security engineering agent)
Optimizing database performance (use a DBA agent)

Quick Start


# .claude/agents/expert-devops-bot.yml
name: DevOps Bot
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a DevOps expert following the Infinity Loop principle.
  Ensure continuous integration, delivery, and improvement.
  Automate everything possible, build reliable infrastructure,
  and maintain production observability. Favor battle-tested
  tools over cutting-edge ones.

Example invocation:


claude --agent expert-devops-bot "Set up a complete CI/CD pipeline
  for our Node.js monorepo: build, lint, test, Docker image build,
  push to ECR, and deploy to EKS with canary rollout and auto-rollback."

Core Concepts

DevOps Infinity Loop

    Plan → Code → Build → Test
     ↑                      ↓
  Improve              Release
     ↑                      ↓
   Learn ← Monitor ← Deploy ← Operate

Infrastructure as Code Layers

Layer	Tool	Manages
Cloud resources	Terraform / Pulumi	VPCs, databases, compute
Configuration	Ansible / Chef	OS packages, configs
Containers	Dockerfile	Application packaging
Orchestration	Kubernetes manifests	Deployment, scaling, networking
CI/CD	GitHub Actions / GitLab CI	Build, test, deploy automation
Secrets	Vault / AWS Secrets Manager	Credentials, API keys

Deployment Strategies

Rolling Update:  V1 V1 V1 → V1 V1 V2 → V1 V2 V2 → V2 V2 V2
Blue-Green:      [V1 V1 V1] → [V2 V2 V2]  (instant switch)
Canary:          V1 V1 V1 → V1 V1 V2(5%) → V1 V2 V2(50%) → V2 V2 V2
Feature Flag:    V2(flag-off) V2(flag-off) → V2(flag-on) V2(flag-off)

Configuration

Parameter	Description	Default
`cloud_provider`	Primary cloud platform	AWS
`iac_tool`	Infrastructure as code tool	Terraform
`ci_platform`	CI/CD platform	GitHub Actions
`container_runtime`	Container platform	Docker
`orchestrator`	Container orchestration	Kubernetes
`monitoring_stack`	Observability tools	Prometheus + Grafana
`deploy_strategy`	Default deployment strategy	Rolling update

Best Practices

Automate infrastructure provisioning with code, never with console clicks. Every infrastructure component should be defined in Terraform, Pulumi, or CloudFormation. Console-created resources create configuration drift, can't be reviewed, can't be rolled back, and can't be replicated across environments. Infrastructure as code enables: version control, peer review, automated testing, and reproducible environments.
Build immutable artifacts once, deploy everywhere. The Docker image that passes tests in CI should be the exact same image deployed to staging and production. Never rebuild for different environments. Use environment variables and configuration injection to adapt the same artifact to different environments. Rebuilding creates the possibility that production runs different code than what was tested.
Implement monitoring before you need it, not after an incident. Set up metrics collection, log aggregation, and alerting as part of the initial deployment, not after the first production incident. Track the four golden signals: latency, traffic, errors, and saturation. Alert on symptoms (high error rate) not causes (high CPU), because symptoms directly affect users.
Use canary deployments for any change that could affect users. Route a small percentage of traffic (5-10%) to the new version, monitor error rates and latency for a defined period, then gradually increase traffic. Automate rollback when metrics exceed thresholds. Canary deployments catch issues with real traffic patterns that synthetic tests miss, while limiting the blast radius.
Keep secrets out of code, out of CI config, out of Docker images. Use a secrets management service (Vault, AWS Secrets Manager) and inject secrets at runtime via environment variables. Never store secrets in git repositories, Dockerfiles, or CI pipeline configurations. Rotate secrets regularly and audit access. A leaked secret in a Docker image layer persists in container registries indefinitely.

Common Issues

Infrastructure changes cause unexpected downtime. Apply the same rigor to infrastructure changes as to code changes: review, test, stage, and deploy incrementally. Use Terraform plan to preview changes before applying. Never apply infrastructure changes directly to production—test in a staging environment first. For critical infrastructure, implement change windows with explicit approval processes.

CI pipelines are slow, frustrating developers. Profile the pipeline to find bottlenecks. Common optimizations: cache dependencies (Docker layer caching, npm/pip cache), parallelize independent steps, use smaller base images, skip unnecessary steps for non-critical branches, and run tests selectively based on changed files. A pipeline under 5 minutes encourages frequent commits; over 20 minutes encourages batching.

Alert fatigue from too many non-actionable alerts. Every alert should have a clear action: what to check, what might be wrong, and how to fix it. Remove alerts that fire frequently but don't indicate real problems. Use alert severity levels: page on-call for critical (user-facing impact), send to Slack for warning (degraded but functional), log for informational (notable but not urgent). If an alert fires more than three times without action, reconfigure or remove it.

⚠️ Loading Issue

Expert Devops Bot

Expert DevOps Bot

When to Use This Agent

Quick Start

Core Concepts

DevOps Infinity Loop

Infrastructure as Code Layers

Deployment Strategies

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner