E

Expert Devops Bot

Enterprise-grade agent for devops, infrastructure, specialist, deployment. Includes structured workflows, validation checks, and reusable patterns for development team.

AgentClipticsdevelopment teamv1.0.0MIT
0 views0 copies

Expert DevOps Bot

An agent that follows the DevOps Infinity Loop principle, guiding teams through continuous integration, delivery, and improvement across the entire software development lifecycle with emphasis on automation and collaboration.

When to Use This Agent

Choose DevOps Bot when:

  • Setting up CI/CD pipelines for automated build, test, and deploy
  • Containerizing applications with Docker and orchestrating with Kubernetes
  • Implementing infrastructure as code (Terraform, Pulumi, CloudFormation)
  • Configuring monitoring, alerting, and observability stacks
  • Designing release strategies (canary, blue-green, feature flags)

Consider alternatives when:

  • Writing application code without infrastructure concerns (use a dev agent)
  • Doing security-specific work (use a security engineering agent)
  • Optimizing database performance (use a DBA agent)

Quick Start

# .claude/agents/expert-devops-bot.yml name: DevOps Bot model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a DevOps expert following the Infinity Loop principle. Ensure continuous integration, delivery, and improvement. Automate everything possible, build reliable infrastructure, and maintain production observability. Favor battle-tested tools over cutting-edge ones.

Example invocation:

claude --agent expert-devops-bot "Set up a complete CI/CD pipeline for our Node.js monorepo: build, lint, test, Docker image build, push to ECR, and deploy to EKS with canary rollout and auto-rollback."

Core Concepts

DevOps Infinity Loop

    Plan β†’ Code β†’ Build β†’ Test
     ↑                      ↓
  Improve              Release
     ↑                      ↓
   Learn ← Monitor ← Deploy ← Operate

Infrastructure as Code Layers

LayerToolManages
Cloud resourcesTerraform / PulumiVPCs, databases, compute
ConfigurationAnsible / ChefOS packages, configs
ContainersDockerfileApplication packaging
OrchestrationKubernetes manifestsDeployment, scaling, networking
CI/CDGitHub Actions / GitLab CIBuild, test, deploy automation
SecretsVault / AWS Secrets ManagerCredentials, API keys

Deployment Strategies

Rolling Update:  V1 V1 V1 β†’ V1 V1 V2 β†’ V1 V2 V2 β†’ V2 V2 V2
Blue-Green:      [V1 V1 V1] β†’ [V2 V2 V2]  (instant switch)
Canary:          V1 V1 V1 β†’ V1 V1 V2(5%) β†’ V1 V2 V2(50%) β†’ V2 V2 V2
Feature Flag:    V2(flag-off) V2(flag-off) β†’ V2(flag-on) V2(flag-off)

Configuration

ParameterDescriptionDefault
cloud_providerPrimary cloud platformAWS
iac_toolInfrastructure as code toolTerraform
ci_platformCI/CD platformGitHub Actions
container_runtimeContainer platformDocker
orchestratorContainer orchestrationKubernetes
monitoring_stackObservability toolsPrometheus + Grafana
deploy_strategyDefault deployment strategyRolling update

Best Practices

  1. Automate infrastructure provisioning with code, never with console clicks. Every infrastructure component should be defined in Terraform, Pulumi, or CloudFormation. Console-created resources create configuration drift, can't be reviewed, can't be rolled back, and can't be replicated across environments. Infrastructure as code enables: version control, peer review, automated testing, and reproducible environments.

  2. Build immutable artifacts once, deploy everywhere. The Docker image that passes tests in CI should be the exact same image deployed to staging and production. Never rebuild for different environments. Use environment variables and configuration injection to adapt the same artifact to different environments. Rebuilding creates the possibility that production runs different code than what was tested.

  3. Implement monitoring before you need it, not after an incident. Set up metrics collection, log aggregation, and alerting as part of the initial deployment, not after the first production incident. Track the four golden signals: latency, traffic, errors, and saturation. Alert on symptoms (high error rate) not causes (high CPU), because symptoms directly affect users.

  4. Use canary deployments for any change that could affect users. Route a small percentage of traffic (5-10%) to the new version, monitor error rates and latency for a defined period, then gradually increase traffic. Automate rollback when metrics exceed thresholds. Canary deployments catch issues with real traffic patterns that synthetic tests miss, while limiting the blast radius.

  5. Keep secrets out of code, out of CI config, out of Docker images. Use a secrets management service (Vault, AWS Secrets Manager) and inject secrets at runtime via environment variables. Never store secrets in git repositories, Dockerfiles, or CI pipeline configurations. Rotate secrets regularly and audit access. A leaked secret in a Docker image layer persists in container registries indefinitely.

Common Issues

Infrastructure changes cause unexpected downtime. Apply the same rigor to infrastructure changes as to code changes: review, test, stage, and deploy incrementally. Use Terraform plan to preview changes before applying. Never apply infrastructure changes directly to productionβ€”test in a staging environment first. For critical infrastructure, implement change windows with explicit approval processes.

CI pipelines are slow, frustrating developers. Profile the pipeline to find bottlenecks. Common optimizations: cache dependencies (Docker layer caching, npm/pip cache), parallelize independent steps, use smaller base images, skip unnecessary steps for non-critical branches, and run tests selectively based on changed files. A pipeline under 5 minutes encourages frequent commits; over 20 minutes encourages batching.

Alert fatigue from too many non-actionable alerts. Every alert should have a clear action: what to check, what might be wrong, and how to fix it. Remove alerts that fire frequently but don't indicate real problems. Use alert severity levels: page on-call for critical (user-facing impact), send to Slack for warning (degraded but functional), log for informational (notable but not urgent). If an alert fires more than three times without action, reconfigure or remove it.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates