Specialist SE Ally

A GitOps and CI/CD specialist agent focused on building reliable deployment pipelines, debugging deployment failures, and ensuring every code change deploys safely and automatically with proper monitoring and rollback capabilities.

When to Use This Agent

Choose SE Ally when:

Building or debugging CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
Implementing GitOps workflows with ArgoCD or Flux
Troubleshooting failed deployments and rollback procedures
Setting up infrastructure as code deployment automation
Configuring deployment monitoring, health checks, and alerts

Consider alternatives when:

Writing application code without deployment concerns (use a dev agent)
Designing system architecture (use an architecture agent)
Managing cloud infrastructure without deployment automation (use a DevOps agent)

Quick Start


# .claude/agents/specialist-se-ally.yml
name: SE Ally
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a GitOps and CI/CD specialist. Build reliable deployment
  pipelines that make every commit deploy safely. Focus on automation,
  monitoring, and rapid recovery. When investigating failures,
  triage systematically before applying fixes.

Example invocation:


claude --agent specialist-se-ally "Our GitHub Actions deployment
  pipeline fails intermittently on the Docker build step. Investigate
  the workflow, identify the root cause, and fix the reliability issue."

Core Concepts

Deployment Pipeline Stages

Code Push → Build → Test → Scan → Deploy Staging → Test → Deploy Prod
    │         │       │      │          │             │         │
  Lint     Compile  Unit   SAST     Smoke tests   Integration Canary
  Format   Docker   Integ  DAST     Health check  E2E tests   Blue/green
  Validate Cache    Perf   License  Rollback test Approval    Monitor

GitOps Workflow

Component	Tool	Responsibility
Source of truth	Git repository	All config and manifests
Sync agent	ArgoCD / Flux	Apply desired state to cluster
Image builder	GitHub Actions / CI	Build and tag container images
Config management	Kustomize / Helm	Environment-specific config
Secrets	Sealed Secrets / SOPS	Encrypted secrets in Git
Monitoring	Prometheus + Grafana	Deployment health tracking

Failure Triage Checklist


## Deployment Failure Triage

1. **What changed?** (git log, config diff, dependency update)
2. **When did it start?** (last successful deploy timestamp)
3. **What's the blast radius?** (one service, one env, all envs)
4. **Is rollback possible?** (check rollback procedure)
5. **Is it intermittent?** (check last 5 pipeline runs)
6. **What do logs show?** (build logs, deploy logs, app logs)
7. **Is infrastructure healthy?** (nodes, DNS, certs, registry)

Configuration

Parameter	Description	Default
`ci_platform`	CI/CD platform	GitHub Actions
`gitops_tool`	GitOps reconciliation tool	ArgoCD
`container_registry`	Docker image registry	ghcr.io
`deploy_strategy`	Deployment strategy	Rolling update
`health_check_path`	Application health endpoint	/health
`rollback_on_failure`	Automatic rollback trigger	true
`approval_required`	Production deployment gate	true

Best Practices

Make deployments boring through automation. Every manual step in a deployment process is an opportunity for human error. Automate build, test, scan, deploy, and monitoring end-to-end. The gold standard is that deploying to production feels the same as deploying to staging—push to a branch and watch the pipeline work. Reserve human approval gates for production promotion decisions, not execution steps.
Build rollback into the deployment process, not as an afterthought. Every deployment should know how to undo itself. For Kubernetes, keep the previous ReplicaSet warm during canary periods. For database changes, ship backward-compatible migrations. Test rollback in staging regularly. When a production issue occurs, the response should be "roll back" (seconds) not "debug and fix" (minutes to hours).
Cache aggressively in CI pipelines. Docker layer caching, dependency caching (node_modules, pip packages), and build artifact caching can reduce pipeline runtime from 20 minutes to 3 minutes. Faster pipelines mean faster feedback, more frequent deployments, and lower developer friction. Configure cache keys based on lockfile hashes so caches invalidate when dependencies actually change.
Use environment promotion, not environment-specific branches. Build one artifact, promote it through environments. The same Docker image that passes staging tests should deploy to production—not a rebuild from the same commit. Environment-specific configuration lives in environment overlays (Kustomize) or values files (Helm), not in the application code or container image.
Monitor the deployment, not just the application. Track pipeline duration, success rate, deployment frequency, and mean time to recovery (DORA metrics). Alert when pipeline duration increases by 50% (cache or infrastructure regression), when success rate drops below 95% (flaky tests or infrastructure issues), and when deployment frequency decreases (process friction). These metrics tell you whether your CI/CD system is healthy.

Common Issues

CI pipeline fails intermittently with the same code. Flaky failures usually come from: flaky tests (timing-dependent, order-dependent, or external-dependency-dependent), resource limits (runner out of memory or disk), rate limiting (Docker Hub pulls, npm registry), or infrastructure issues (runner connectivity, DNS resolution). Categorize failures, fix the most common flaky test first, and add retry logic only for genuinely transient issues like network timeouts.

Deployments succeed but the application is unhealthy. Add post-deployment health checks that verify the application is actually serving requests, not just that the container started. Check readiness probes, run smoke tests against the deployed instance, and verify key metrics (error rate, latency, throughput) are within normal ranges. Automate rollback when post-deployment checks fail.

GitOps sync drift between desired and actual state. ArgoCD or Flux may show "out of sync" when cluster state doesn't match the Git repository. Common causes: manual kubectl changes that bypass Git, CRDs or operators that modify resources, and auto-scaling that changes replica counts. Enforce Git as the single source of truth by preventing manual cluster changes, excluding auto-managed fields from sync, and using ArgoCD's self-heal feature.

⚠️ Loading Issue

Specialist Se Ally

Specialist SE Ally

When to Use This Agent

Quick Start

Core Concepts

Deployment Pipeline Stages

GitOps Workflow

Failure Triage Checklist

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner