S

Specialist Se Ally

Comprehensive agent designed for product, management, guidance, creating. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

Specialist SE Ally

A GitOps and CI/CD specialist agent focused on building reliable deployment pipelines, debugging deployment failures, and ensuring every code change deploys safely and automatically with proper monitoring and rollback capabilities.

When to Use This Agent

Choose SE Ally when:

  • Building or debugging CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
  • Implementing GitOps workflows with ArgoCD or Flux
  • Troubleshooting failed deployments and rollback procedures
  • Setting up infrastructure as code deployment automation
  • Configuring deployment monitoring, health checks, and alerts

Consider alternatives when:

  • Writing application code without deployment concerns (use a dev agent)
  • Designing system architecture (use an architecture agent)
  • Managing cloud infrastructure without deployment automation (use a DevOps agent)

Quick Start

# .claude/agents/specialist-se-ally.yml name: SE Ally model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a GitOps and CI/CD specialist. Build reliable deployment pipelines that make every commit deploy safely. Focus on automation, monitoring, and rapid recovery. When investigating failures, triage systematically before applying fixes.

Example invocation:

claude --agent specialist-se-ally "Our GitHub Actions deployment pipeline fails intermittently on the Docker build step. Investigate the workflow, identify the root cause, and fix the reliability issue."

Core Concepts

Deployment Pipeline Stages

Code Push → Build → Test → Scan → Deploy Staging → Test → Deploy Prod
    │         │       │      │          │             │         │
  Lint     Compile  Unit   SAST     Smoke tests   Integration Canary
  Format   Docker   Integ  DAST     Health check  E2E tests   Blue/green
  Validate Cache    Perf   License  Rollback test Approval    Monitor

GitOps Workflow

ComponentToolResponsibility
Source of truthGit repositoryAll config and manifests
Sync agentArgoCD / FluxApply desired state to cluster
Image builderGitHub Actions / CIBuild and tag container images
Config managementKustomize / HelmEnvironment-specific config
SecretsSealed Secrets / SOPSEncrypted secrets in Git
MonitoringPrometheus + GrafanaDeployment health tracking

Failure Triage Checklist

## Deployment Failure Triage 1. **What changed?** (git log, config diff, dependency update) 2. **When did it start?** (last successful deploy timestamp) 3. **What's the blast radius?** (one service, one env, all envs) 4. **Is rollback possible?** (check rollback procedure) 5. **Is it intermittent?** (check last 5 pipeline runs) 6. **What do logs show?** (build logs, deploy logs, app logs) 7. **Is infrastructure healthy?** (nodes, DNS, certs, registry)

Configuration

ParameterDescriptionDefault
ci_platformCI/CD platformGitHub Actions
gitops_toolGitOps reconciliation toolArgoCD
container_registryDocker image registryghcr.io
deploy_strategyDeployment strategyRolling update
health_check_pathApplication health endpoint/health
rollback_on_failureAutomatic rollback triggertrue
approval_requiredProduction deployment gatetrue

Best Practices

  1. Make deployments boring through automation. Every manual step in a deployment process is an opportunity for human error. Automate build, test, scan, deploy, and monitoring end-to-end. The gold standard is that deploying to production feels the same as deploying to staging—push to a branch and watch the pipeline work. Reserve human approval gates for production promotion decisions, not execution steps.

  2. Build rollback into the deployment process, not as an afterthought. Every deployment should know how to undo itself. For Kubernetes, keep the previous ReplicaSet warm during canary periods. For database changes, ship backward-compatible migrations. Test rollback in staging regularly. When a production issue occurs, the response should be "roll back" (seconds) not "debug and fix" (minutes to hours).

  3. Cache aggressively in CI pipelines. Docker layer caching, dependency caching (node_modules, pip packages), and build artifact caching can reduce pipeline runtime from 20 minutes to 3 minutes. Faster pipelines mean faster feedback, more frequent deployments, and lower developer friction. Configure cache keys based on lockfile hashes so caches invalidate when dependencies actually change.

  4. Use environment promotion, not environment-specific branches. Build one artifact, promote it through environments. The same Docker image that passes staging tests should deploy to production—not a rebuild from the same commit. Environment-specific configuration lives in environment overlays (Kustomize) or values files (Helm), not in the application code or container image.

  5. Monitor the deployment, not just the application. Track pipeline duration, success rate, deployment frequency, and mean time to recovery (DORA metrics). Alert when pipeline duration increases by 50% (cache or infrastructure regression), when success rate drops below 95% (flaky tests or infrastructure issues), and when deployment frequency decreases (process friction). These metrics tell you whether your CI/CD system is healthy.

Common Issues

CI pipeline fails intermittently with the same code. Flaky failures usually come from: flaky tests (timing-dependent, order-dependent, or external-dependency-dependent), resource limits (runner out of memory or disk), rate limiting (Docker Hub pulls, npm registry), or infrastructure issues (runner connectivity, DNS resolution). Categorize failures, fix the most common flaky test first, and add retry logic only for genuinely transient issues like network timeouts.

Deployments succeed but the application is unhealthy. Add post-deployment health checks that verify the application is actually serving requests, not just that the container started. Check readiness probes, run smoke tests against the deployed instance, and verify key metrics (error rate, latency, throughput) are within normal ranges. Automate rollback when post-deployment checks fail.

GitOps sync drift between desired and actual state. ArgoCD or Flux may show "out of sync" when cluster state doesn't match the Git repository. Common causes: manual kubectl changes that bypass Git, CRDs or operators that modify resources, and auto-scaling that changes replica counts. Enforce Git as the single source of truth by preventing manual cluster changes, excluding auto-managed fields from sync, and using ArgoCD's self-heal feature.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates