K

Kubernetes Specialist Mentor

Comprehensive agent designed for agent, need, design, deploy. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.

AgentClipticsdevops infrastructurev1.0.0MIT
0 views0 copies

Kubernetes Specialist Mentor

A senior Kubernetes specialist with deep expertise in designing, deploying, and managing production Kubernetes clusters, covering cluster architecture, workload orchestration, security hardening, and operational excellence.

When to Use This Agent

Choose Kubernetes Specialist Mentor when:

  • Designing Kubernetes cluster architecture for production workloads
  • Writing and debugging Kubernetes manifests, Helm charts, or Kustomize configs
  • Implementing Kubernetes security (RBAC, network policies, pod security)
  • Troubleshooting pod failures, networking issues, or resource constraints
  • Setting up Kubernetes observability and autoscaling

Consider alternatives when:

  • Choosing between Kubernetes and simpler alternatives (use Guide Cloud Architect)
  • Building CI/CD pipelines (use DevOps Expert Consultant)
  • Managing cloud infrastructure beyond Kubernetes (use cloud-specific agents)

Quick Start

# .claude/agents/kubernetes-specialist-mentor.yml name: Kubernetes Specialist Mentor description: Design and manage production Kubernetes clusters model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep

Example invocation:

claude "Design a production-ready Kubernetes deployment for our microservices with HPA, PDB, network policies, and proper resource management"

Core Concepts

Production Deployment Manifest

# deployment.yaml β€” Production-ready workload apiVersion: apps/v1 kind: Deployment metadata: name: payment-service namespace: production labels: app: payment-service version: v2.4.1 spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 selector: matchLabels: app: payment-service template: metadata: labels: app: payment-service version: v2.4.1 spec: serviceAccountName: payment-service securityContext: runAsNonRoot: true fsGroup: 1000 containers: - name: payment-service image: registry.example.com/payment-service:v2.4.1 ports: - containerPort: 8080 protocol: TCP resources: requests: cpu: 250m memory: 256Mi limits: cpu: 500m memory: 512Mi readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 15 periodSeconds: 20 env: - name: DB_HOST valueFrom: secretKeyRef: name: payment-db key: host topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: payment-service --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: payment-service-pdb spec: minAvailable: 2 selector: matchLabels: app: payment-service --- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-service-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-service minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80

Network Policy

apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: payment-service-netpol namespace: production spec: podSelector: matchLabels: app: payment-service policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: api-gateway ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app: payment-db ports: - protocol: TCP port: 5432 - to: # DNS - namespaceSelector: {} ports: - protocol: UDP port: 53

Configuration

ParameterDescriptionDefault
cluster_typeManaged Kubernetes (eks, aks, gke, self-managed)Auto-detect
manifest_formatManifest tooling (raw-yaml, helm, kustomize)kustomize
ingress_controllerIngress solution (nginx, traefik, istio)nginx
service_meshService mesh (istio, linkerd, none)none
monitoringMonitoring stack (prometheus, datadog)prometheus
security_baselineSecurity standard (pod-security-standards, custom)restricted

Best Practices

  1. Always set resource requests AND limits on every container. Requests determine scheduling (which node has capacity). Limits prevent containers from consuming unbounded resources. Set requests based on observed P50 usage and limits at 2x requests. Without requests, the scheduler cannot make informed decisions. Without limits, one misbehaving pod can starve all others on the node.

  2. Use PodDisruptionBudgets for every production deployment. PDBs prevent Kubernetes from draining too many pods during node maintenance, cluster upgrades, or autoscaler scale-down. Set minAvailable to at least 2 for any service that needs high availability. Without a PDB, a node drain can take down all replicas of a service simultaneously.

  3. Implement readiness and liveness probes with different characteristics. Readiness probes determine when a pod can receive traffic β€” configure them with short intervals (5-10s) and check application-level health (can process requests). Liveness probes determine when a pod needs to be restarted β€” configure them with longer intervals (15-30s) and check basic process health. Getting these confused causes either traffic to unhealthy pods or unnecessary restarts.

  4. Use topology spread constraints to distribute pods across failure domains. Default Kubernetes scheduling may place all replicas on the same node or availability zone. Use topologySpreadConstraints to ensure pods are spread across zones. This prevents a single zone failure from taking down all replicas. For critical services, use whenUnsatisfiable: DoNotSchedule to enforce the spread.

  5. Apply network policies to every namespace in production. Default Kubernetes networking allows all pods to communicate with all other pods. This means a compromised pod can reach any service in the cluster. Apply deny-all default policies to each namespace, then explicitly allow only the traffic patterns your services need. Test policies in a staging namespace first to avoid blocking legitimate traffic.

Common Issues

Pods are OOMKilled repeatedly despite setting memory limits. The container's actual memory usage exceeds the limit, causing the kernel to kill it. Common causes: JVM heap not matching container limits (set -Xmx to 75% of the container limit), memory leaks in the application, or limits set too low for the workload. Check kubectl describe pod for OOMKilled events and increase limits or fix the memory leak.

HPA scaling is too slow to handle traffic spikes. HPA evaluates metrics every 15 seconds by default and scales gradually. A sudden traffic spike overwhelms existing pods before new ones are ready. Mitigation: set minReplicas high enough to handle expected baseline traffic, use KEDA for event-driven scaling that reacts faster, and implement request queuing in the application to buffer during scale-up.

Services cannot resolve each other's DNS names after deploying to a new namespace. Kubernetes DNS uses the format <service-name>.<namespace>.svc.cluster.local. Services in the same namespace can use just the service name, but cross-namespace communication requires the full DNS name. Check that services reference the correct namespace in their configuration. Verify CoreDNS is running with kubectl get pods -n kube-system -l k8s-app=kube-dns.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates