Specialist Se Ally
Comprehensive agent designed for product, management, guidance, creating. Includes structured workflows, validation checks, and reusable patterns for data ai.
Specialist SE Ally
A GitOps and CI/CD specialist agent focused on building reliable deployment pipelines, debugging deployment failures, and ensuring every code change deploys safely and automatically with proper monitoring and rollback capabilities.
When to Use This Agent
Choose SE Ally when:
- Building or debugging CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
- Implementing GitOps workflows with ArgoCD or Flux
- Troubleshooting failed deployments and rollback procedures
- Setting up infrastructure as code deployment automation
- Configuring deployment monitoring, health checks, and alerts
Consider alternatives when:
- Writing application code without deployment concerns (use a dev agent)
- Designing system architecture (use an architecture agent)
- Managing cloud infrastructure without deployment automation (use a DevOps agent)
Quick Start
# .claude/agents/specialist-se-ally.yml name: SE Ally model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a GitOps and CI/CD specialist. Build reliable deployment pipelines that make every commit deploy safely. Focus on automation, monitoring, and rapid recovery. When investigating failures, triage systematically before applying fixes.
Example invocation:
claude --agent specialist-se-ally "Our GitHub Actions deployment pipeline fails intermittently on the Docker build step. Investigate the workflow, identify the root cause, and fix the reliability issue."
Core Concepts
Deployment Pipeline Stages
Code Push → Build → Test → Scan → Deploy Staging → Test → Deploy Prod
│ │ │ │ │ │ │
Lint Compile Unit SAST Smoke tests Integration Canary
Format Docker Integ DAST Health check E2E tests Blue/green
Validate Cache Perf License Rollback test Approval Monitor
GitOps Workflow
| Component | Tool | Responsibility |
|---|---|---|
| Source of truth | Git repository | All config and manifests |
| Sync agent | ArgoCD / Flux | Apply desired state to cluster |
| Image builder | GitHub Actions / CI | Build and tag container images |
| Config management | Kustomize / Helm | Environment-specific config |
| Secrets | Sealed Secrets / SOPS | Encrypted secrets in Git |
| Monitoring | Prometheus + Grafana | Deployment health tracking |
Failure Triage Checklist
## Deployment Failure Triage 1. **What changed?** (git log, config diff, dependency update) 2. **When did it start?** (last successful deploy timestamp) 3. **What's the blast radius?** (one service, one env, all envs) 4. **Is rollback possible?** (check rollback procedure) 5. **Is it intermittent?** (check last 5 pipeline runs) 6. **What do logs show?** (build logs, deploy logs, app logs) 7. **Is infrastructure healthy?** (nodes, DNS, certs, registry)
Configuration
| Parameter | Description | Default |
|---|---|---|
ci_platform | CI/CD platform | GitHub Actions |
gitops_tool | GitOps reconciliation tool | ArgoCD |
container_registry | Docker image registry | ghcr.io |
deploy_strategy | Deployment strategy | Rolling update |
health_check_path | Application health endpoint | /health |
rollback_on_failure | Automatic rollback trigger | true |
approval_required | Production deployment gate | true |
Best Practices
-
Make deployments boring through automation. Every manual step in a deployment process is an opportunity for human error. Automate build, test, scan, deploy, and monitoring end-to-end. The gold standard is that deploying to production feels the same as deploying to staging—push to a branch and watch the pipeline work. Reserve human approval gates for production promotion decisions, not execution steps.
-
Build rollback into the deployment process, not as an afterthought. Every deployment should know how to undo itself. For Kubernetes, keep the previous ReplicaSet warm during canary periods. For database changes, ship backward-compatible migrations. Test rollback in staging regularly. When a production issue occurs, the response should be "roll back" (seconds) not "debug and fix" (minutes to hours).
-
Cache aggressively in CI pipelines. Docker layer caching, dependency caching (node_modules, pip packages), and build artifact caching can reduce pipeline runtime from 20 minutes to 3 minutes. Faster pipelines mean faster feedback, more frequent deployments, and lower developer friction. Configure cache keys based on lockfile hashes so caches invalidate when dependencies actually change.
-
Use environment promotion, not environment-specific branches. Build one artifact, promote it through environments. The same Docker image that passes staging tests should deploy to production—not a rebuild from the same commit. Environment-specific configuration lives in environment overlays (Kustomize) or values files (Helm), not in the application code or container image.
-
Monitor the deployment, not just the application. Track pipeline duration, success rate, deployment frequency, and mean time to recovery (DORA metrics). Alert when pipeline duration increases by 50% (cache or infrastructure regression), when success rate drops below 95% (flaky tests or infrastructure issues), and when deployment frequency decreases (process friction). These metrics tell you whether your CI/CD system is healthy.
Common Issues
CI pipeline fails intermittently with the same code. Flaky failures usually come from: flaky tests (timing-dependent, order-dependent, or external-dependency-dependent), resource limits (runner out of memory or disk), rate limiting (Docker Hub pulls, npm registry), or infrastructure issues (runner connectivity, DNS resolution). Categorize failures, fix the most common flaky test first, and add retry logic only for genuinely transient issues like network timeouts.
Deployments succeed but the application is unhealthy. Add post-deployment health checks that verify the application is actually serving requests, not just that the container started. Check readiness probes, run smoke tests against the deployed instance, and verify key metrics (error rate, latency, throughput) are within normal ranges. Automate rollback when post-deployment checks fail.
GitOps sync drift between desired and actual state. ArgoCD or Flux may show "out of sync" when cluster state doesn't match the Git repository. Common causes: manual kubectl changes that bypass Git, CRDs or operators that modify resources, and auto-scaling that changes replica counts. Enforce Git as the single source of truth by preventing manual cluster changes, excluding auto-managed fields from sync, and using ArgoCD's self-heal feature.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.