Senior Devops Smart
All-in-one skill covering comprehensive, devops, skill, infrastructure. Includes structured workflows, validation checks, and reusable patterns for development.
Senior DevOps Smart
A comprehensive skill for senior DevOps engineers covering CI/CD pipeline design, infrastructure-as-code, container orchestration, monitoring/alerting, and site reliability engineering practices.
When to Use This Skill
Choose this skill when:
- Designing CI/CD pipelines with multi-stage builds and automated testing
- Writing infrastructure-as-code with Terraform, Pulumi, or CloudFormation
- Setting up Kubernetes clusters, Helm charts, and deployment strategies
- Implementing monitoring, alerting, and observability with Prometheus/Grafana
- Establishing SLOs, error budgets, and incident response procedures
Consider alternatives when:
- Working only within a specific cloud → use that cloud's skill (AWS, GCP, Azure)
- Need application code patterns → use a framework-specific skill
- Building data pipelines → use a data engineering skill
- Setting up development environments only → use a Docker skill
Quick Start
# .github/workflows/ci-cd.yml — Production-grade CI/CD pipeline name: CI/CD Pipeline on: push: branches: [main, develop] pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: '20', cache: 'npm' } - run: npm ci - run: npm run lint - run: npm test -- --coverage - uses: actions/upload-artifact@v4 with: { name: coverage, path: coverage/ } build: needs: test runs-on: ubuntu-latest outputs: image_tag: ${{ steps.meta.outputs.tags }} steps: - uses: actions/checkout@v4 - uses: docker/setup-buildx-action@v3 - uses: docker/login-action@v3 with: registry: ghcr.io username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - id: meta uses: docker/metadata-action@v5 with: images: ghcr.io/${{ github.repository }} tags: | type=sha,prefix= type=ref,event=branch - uses: docker/build-push-action@v5 with: push: true tags: ${{ steps.meta.outputs.tags }} cache-from: type=gha cache-to: type=gha,mode=max deploy: needs: build if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v4 - run: | helm upgrade --install app ./charts/app \ --set image.tag=${{ needs.build.outputs.image_tag }} \ --namespace production --wait --timeout 300s
Core Concepts
Deployment Strategy Comparison
| Strategy | Downtime | Rollback Speed | Resource Cost | Risk |
|---|---|---|---|---|
| Rolling Update | None | Medium (minutes) | 1.3x baseline | Low |
| Blue-Green | None | Fast (seconds) | 2x baseline | Very Low |
| Canary | None | Fast (seconds) | 1.1x baseline | Very Low |
| Recreate | Brief | Slow (redeploy) | 1x baseline | High |
| A/B Testing | None | Fast (routing) | 1.5x baseline | Low |
Terraform Infrastructure Pattern
# modules/service/main.tf — Reusable service module resource "aws_ecs_service" "app" { name = var.service_name cluster = var.cluster_id task_definition = aws_ecs_task_definition.app.arn desired_count = var.desired_count deployment_circuit_breaker { enable = true rollback = true } deployment_configuration { maximum_percent = 200 minimum_healthy_percent = 100 } load_balancer { target_group_arn = aws_lb_target_group.app.arn container_name = var.service_name container_port = var.container_port } lifecycle { ignore_changes = [desired_count] # managed by auto-scaling } } resource "aws_appautoscaling_target" "app" { max_capacity = var.max_count min_capacity = var.min_count resource_id = "service/${var.cluster_name}/${aws_ecs_service.app.name}" scalable_dimension = "ecs:service:DesiredCount" service_namespace = "ecs" }
Monitoring and Alerting
# Prometheus alerting rules groups: - name: application rules: - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "Error rate above 5% for {{ $labels.service }}" - alert: HighLatency expr: | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2.0 for: 10m labels: severity: warning annotations: summary: "P99 latency above 2s for {{ $labels.service }}" - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 15m labels: severity: critical
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
deployStrategy | string | 'rolling' | Deployment strategy: rolling, blue-green, or canary |
iacTool | string | 'terraform' | IaC tool: terraform, pulumi, or cloudformation |
containerOrchestrator | string | 'kubernetes' | Orchestrator: kubernetes, ecs, or nomad |
monitoringStack | string | 'prometheus' | Monitoring: prometheus, datadog, or newrelic |
alertChannels | string[] | ['slack', 'pagerduty'] | Alert notification channels |
sloTarget | number | 99.9 | Service Level Objective availability percentage |
Best Practices
-
Define SLOs before building monitoring — Start with what matters to users (availability, latency, error rate), set measurable targets, then build alerts that fire when error budgets are being consumed too quickly.
-
Use GitOps for infrastructure changes — All infrastructure modifications should flow through version control with code review, automated testing, and audit trails. Direct console changes create drift and are impossible to reproduce or roll back.
-
Build deployment pipelines that are faster than manual deploys — If CI/CD is slower than
git push && ssh server && pull && restart, developers will bypass it. Target under 10 minutes from merge to production for most services. -
Implement progressive delivery with automated rollback — Canary deployments that automatically roll back when error rates spike catch issues before they affect all users. Combine with feature flags for instant kill switches.
-
Practice chaos engineering in non-production first — Randomly terminate pods, inject latency, and simulate AZ failures in staging. When the team is comfortable, graduate to production chaos experiments during business hours with an abort button ready.
Common Issues
Terraform state conflicts in team environments — Multiple engineers running terraform apply simultaneously corrupt state. Use remote state backends (S3 + DynamoDB lock) and CI/CD-only applies. Engineers run plan locally; only the pipeline runs apply.
Alert fatigue from too many non-actionable alerts — Every alert should have a runbook and require human action. If an alert fires and the response is "ignore it," delete or tune the alert. Review alert frequency monthly and suppress alerts that fire more than daily without action.
Container images work locally but fail in production — Environment differences (architecture, kernel version, missing env vars) cause runtime failures. Use multi-stage builds with explicit base images, scan for vulnerabilities, and test images in a staging environment that mirrors production configuration.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.