S

Senior Devops Smart

All-in-one skill covering comprehensive, devops, skill, infrastructure. Includes structured workflows, validation checks, and reusable patterns for development.

SkillClipticsdevelopmentv1.0.0MIT
0 views0 copies

Senior DevOps Smart

A comprehensive skill for senior DevOps engineers covering CI/CD pipeline design, infrastructure-as-code, container orchestration, monitoring/alerting, and site reliability engineering practices.

When to Use This Skill

Choose this skill when:

  • Designing CI/CD pipelines with multi-stage builds and automated testing
  • Writing infrastructure-as-code with Terraform, Pulumi, or CloudFormation
  • Setting up Kubernetes clusters, Helm charts, and deployment strategies
  • Implementing monitoring, alerting, and observability with Prometheus/Grafana
  • Establishing SLOs, error budgets, and incident response procedures

Consider alternatives when:

  • Working only within a specific cloud → use that cloud's skill (AWS, GCP, Azure)
  • Need application code patterns → use a framework-specific skill
  • Building data pipelines → use a data engineering skill
  • Setting up development environments only → use a Docker skill

Quick Start

# .github/workflows/ci-cd.yml — Production-grade CI/CD pipeline name: CI/CD Pipeline on: push: branches: [main, develop] pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: '20', cache: 'npm' } - run: npm ci - run: npm run lint - run: npm test -- --coverage - uses: actions/upload-artifact@v4 with: { name: coverage, path: coverage/ } build: needs: test runs-on: ubuntu-latest outputs: image_tag: ${{ steps.meta.outputs.tags }} steps: - uses: actions/checkout@v4 - uses: docker/setup-buildx-action@v3 - uses: docker/login-action@v3 with: registry: ghcr.io username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - id: meta uses: docker/metadata-action@v5 with: images: ghcr.io/${{ github.repository }} tags: | type=sha,prefix= type=ref,event=branch - uses: docker/build-push-action@v5 with: push: true tags: ${{ steps.meta.outputs.tags }} cache-from: type=gha cache-to: type=gha,mode=max deploy: needs: build if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v4 - run: | helm upgrade --install app ./charts/app \ --set image.tag=${{ needs.build.outputs.image_tag }} \ --namespace production --wait --timeout 300s

Core Concepts

Deployment Strategy Comparison

StrategyDowntimeRollback SpeedResource CostRisk
Rolling UpdateNoneMedium (minutes)1.3x baselineLow
Blue-GreenNoneFast (seconds)2x baselineVery Low
CanaryNoneFast (seconds)1.1x baselineVery Low
RecreateBriefSlow (redeploy)1x baselineHigh
A/B TestingNoneFast (routing)1.5x baselineLow

Terraform Infrastructure Pattern

# modules/service/main.tf — Reusable service module resource "aws_ecs_service" "app" { name = var.service_name cluster = var.cluster_id task_definition = aws_ecs_task_definition.app.arn desired_count = var.desired_count deployment_circuit_breaker { enable = true rollback = true } deployment_configuration { maximum_percent = 200 minimum_healthy_percent = 100 } load_balancer { target_group_arn = aws_lb_target_group.app.arn container_name = var.service_name container_port = var.container_port } lifecycle { ignore_changes = [desired_count] # managed by auto-scaling } } resource "aws_appautoscaling_target" "app" { max_capacity = var.max_count min_capacity = var.min_count resource_id = "service/${var.cluster_name}/${aws_ecs_service.app.name}" scalable_dimension = "ecs:service:DesiredCount" service_namespace = "ecs" }

Monitoring and Alerting

# Prometheus alerting rules groups: - name: application rules: - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "Error rate above 5% for {{ $labels.service }}" - alert: HighLatency expr: | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2.0 for: 10m labels: severity: warning annotations: summary: "P99 latency above 2s for {{ $labels.service }}" - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 15m labels: severity: critical

Configuration

ParameterTypeDefaultDescription
deployStrategystring'rolling'Deployment strategy: rolling, blue-green, or canary
iacToolstring'terraform'IaC tool: terraform, pulumi, or cloudformation
containerOrchestratorstring'kubernetes'Orchestrator: kubernetes, ecs, or nomad
monitoringStackstring'prometheus'Monitoring: prometheus, datadog, or newrelic
alertChannelsstring[]['slack', 'pagerduty']Alert notification channels
sloTargetnumber99.9Service Level Objective availability percentage

Best Practices

  1. Define SLOs before building monitoring — Start with what matters to users (availability, latency, error rate), set measurable targets, then build alerts that fire when error budgets are being consumed too quickly.

  2. Use GitOps for infrastructure changes — All infrastructure modifications should flow through version control with code review, automated testing, and audit trails. Direct console changes create drift and are impossible to reproduce or roll back.

  3. Build deployment pipelines that are faster than manual deploys — If CI/CD is slower than git push && ssh server && pull && restart, developers will bypass it. Target under 10 minutes from merge to production for most services.

  4. Implement progressive delivery with automated rollback — Canary deployments that automatically roll back when error rates spike catch issues before they affect all users. Combine with feature flags for instant kill switches.

  5. Practice chaos engineering in non-production first — Randomly terminate pods, inject latency, and simulate AZ failures in staging. When the team is comfortable, graduate to production chaos experiments during business hours with an abort button ready.

Common Issues

Terraform state conflicts in team environments — Multiple engineers running terraform apply simultaneously corrupt state. Use remote state backends (S3 + DynamoDB lock) and CI/CD-only applies. Engineers run plan locally; only the pipeline runs apply.

Alert fatigue from too many non-actionable alerts — Every alert should have a runbook and require human action. If an alert fires and the response is "ignore it," delete or tune the alert. Review alert frequency monthly and suppress alerts that fire more than daily without action.

Container images work locally but fail in production — Environment differences (architecture, kernel version, missing env vars) cause runtime failures. Use multi-stage builds with explicit base images, scan for vulnerabilities, and test images in a staging environment that mirrors production configuration.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates