Senior DevOps Smart

A comprehensive skill for senior DevOps engineers covering CI/CD pipeline design, infrastructure-as-code, container orchestration, monitoring/alerting, and site reliability engineering practices.

When to Use This Skill

Choose this skill when:

Designing CI/CD pipelines with multi-stage builds and automated testing
Writing infrastructure-as-code with Terraform, Pulumi, or CloudFormation
Setting up Kubernetes clusters, Helm charts, and deployment strategies
Implementing monitoring, alerting, and observability with Prometheus/Grafana
Establishing SLOs, error budgets, and incident response procedures

Consider alternatives when:

Working only within a specific cloud → use that cloud's skill (AWS, GCP, Azure)
Need application code patterns → use a framework-specific skill
Building data pipelines → use a data engineering skill
Setting up development environments only → use a Docker skill

Quick Start


# .github/workflows/ci-cd.yml — Production-grade CI/CD pipeline
name: CI/CD Pipeline
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20', cache: 'npm' }
      - run: npm ci
      - run: npm run lint
      - run: npm test -- --coverage
      - uses: actions/upload-artifact@v4
        with: { name: coverage, path: coverage/ }

  build:
    needs: test
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ghcr.io/${{ github.repository }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - run: |
          helm upgrade --install app ./charts/app \
            --set image.tag=${{ needs.build.outputs.image_tag }} \
            --namespace production --wait --timeout 300s

Core Concepts

Deployment Strategy Comparison

Strategy	Downtime	Rollback Speed	Resource Cost	Risk
Rolling Update	None	Medium (minutes)	1.3x baseline	Low
Blue-Green	None	Fast (seconds)	2x baseline	Very Low
Canary	None	Fast (seconds)	1.1x baseline	Very Low
Recreate	Brief	Slow (redeploy)	1x baseline	High
A/B Testing	None	Fast (routing)	1.5x baseline	Low

Terraform Infrastructure Pattern


# modules/service/main.tf — Reusable service module
resource "aws_ecs_service" "app" {
  name            = var.service_name
  cluster         = var.cluster_id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = var.desired_count

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_configuration {
    maximum_percent         = 200
    minimum_healthy_percent = 100
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = var.service_name
    container_port   = var.container_port
  }

  lifecycle {
    ignore_changes = [desired_count] # managed by auto-scaling
  }
}

resource "aws_appautoscaling_target" "app" {
  max_capacity       = var.max_count
  min_capacity       = var.min_count
  resource_id        = "service/${var.cluster_name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

Monitoring and Alerting


# Prometheus alerting rules
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.service }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
          > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s for {{ $labels.service }}"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 15m
        labels:
          severity: critical

Configuration

Parameter	Type	Default	Description
`deployStrategy`	string	`'rolling'`	Deployment strategy: rolling, blue-green, or canary
`iacTool`	string	`'terraform'`	IaC tool: terraform, pulumi, or cloudformation
`containerOrchestrator`	string	`'kubernetes'`	Orchestrator: kubernetes, ecs, or nomad
`monitoringStack`	string	`'prometheus'`	Monitoring: prometheus, datadog, or newrelic
`alertChannels`	string[]	`['slack', 'pagerduty']`	Alert notification channels
`sloTarget`	number	`99.9`	Service Level Objective availability percentage

Best Practices

Define SLOs before building monitoring — Start with what matters to users (availability, latency, error rate), set measurable targets, then build alerts that fire when error budgets are being consumed too quickly.
Use GitOps for infrastructure changes — All infrastructure modifications should flow through version control with code review, automated testing, and audit trails. Direct console changes create drift and are impossible to reproduce or roll back.
Build deployment pipelines that are faster than manual deploys — If CI/CD is slower than git push && ssh server && pull && restart, developers will bypass it. Target under 10 minutes from merge to production for most services.
Implement progressive delivery with automated rollback — Canary deployments that automatically roll back when error rates spike catch issues before they affect all users. Combine with feature flags for instant kill switches.
Practice chaos engineering in non-production first — Randomly terminate pods, inject latency, and simulate AZ failures in staging. When the team is comfortable, graduate to production chaos experiments during business hours with an abort button ready.

Common Issues

Terraform state conflicts in team environments — Multiple engineers running terraform apply simultaneously corrupt state. Use remote state backends (S3 + DynamoDB lock) and CI/CD-only applies. Engineers run plan locally; only the pipeline runs apply.

Alert fatigue from too many non-actionable alerts — Every alert should have a runbook and require human action. If an alert fires and the response is "ignore it," delete or tune the alert. Review alert frequency monthly and suppress alerts that fire more than daily without action.

Container images work locally but fail in production — Environment differences (architecture, kernel version, missing env vars) cause runtime failures. Use multi-stage builds with explicit base images, scan for vulnerabilities, and test images in a staging environment that mirrors production configuration.

⚠️ Loading Issue

Senior Devops Smart

Senior DevOps Smart

When to Use This Skill

Quick Start

Core Concepts

Deployment Strategy Comparison

Terraform Infrastructure Pattern

Monitoring and Alerting

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace