Kubernetes Specialist Mentor
Comprehensive agent designed for agent, need, design, deploy. Includes structured workflows, validation checks, and reusable patterns for devops infrastructure.
Kubernetes Specialist Mentor
A senior Kubernetes specialist with deep expertise in designing, deploying, and managing production Kubernetes clusters, covering cluster architecture, workload orchestration, security hardening, and operational excellence.
When to Use This Agent
Choose Kubernetes Specialist Mentor when:
- Designing Kubernetes cluster architecture for production workloads
- Writing and debugging Kubernetes manifests, Helm charts, or Kustomize configs
- Implementing Kubernetes security (RBAC, network policies, pod security)
- Troubleshooting pod failures, networking issues, or resource constraints
- Setting up Kubernetes observability and autoscaling
Consider alternatives when:
- Choosing between Kubernetes and simpler alternatives (use Guide Cloud Architect)
- Building CI/CD pipelines (use DevOps Expert Consultant)
- Managing cloud infrastructure beyond Kubernetes (use cloud-specific agents)
Quick Start
# .claude/agents/kubernetes-specialist-mentor.yml name: Kubernetes Specialist Mentor description: Design and manage production Kubernetes clusters model: claude-sonnet tools: - Read - Write - Edit - Bash - Glob - Grep
Example invocation:
claude "Design a production-ready Kubernetes deployment for our microservices with HPA, PDB, network policies, and proper resource management"
Core Concepts
Production Deployment Manifest
# deployment.yaml β Production-ready workload apiVersion: apps/v1 kind: Deployment metadata: name: payment-service namespace: production labels: app: payment-service version: v2.4.1 spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 selector: matchLabels: app: payment-service template: metadata: labels: app: payment-service version: v2.4.1 spec: serviceAccountName: payment-service securityContext: runAsNonRoot: true fsGroup: 1000 containers: - name: payment-service image: registry.example.com/payment-service:v2.4.1 ports: - containerPort: 8080 protocol: TCP resources: requests: cpu: 250m memory: 256Mi limits: cpu: 500m memory: 512Mi readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 15 periodSeconds: 20 env: - name: DB_HOST valueFrom: secretKeyRef: name: payment-db key: host topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: payment-service --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: payment-service-pdb spec: minAvailable: 2 selector: matchLabels: app: payment-service --- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-service-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-service minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80
Network Policy
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: payment-service-netpol namespace: production spec: podSelector: matchLabels: app: payment-service policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: api-gateway ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app: payment-db ports: - protocol: TCP port: 5432 - to: # DNS - namespaceSelector: {} ports: - protocol: UDP port: 53
Configuration
| Parameter | Description | Default |
|---|---|---|
cluster_type | Managed Kubernetes (eks, aks, gke, self-managed) | Auto-detect |
manifest_format | Manifest tooling (raw-yaml, helm, kustomize) | kustomize |
ingress_controller | Ingress solution (nginx, traefik, istio) | nginx |
service_mesh | Service mesh (istio, linkerd, none) | none |
monitoring | Monitoring stack (prometheus, datadog) | prometheus |
security_baseline | Security standard (pod-security-standards, custom) | restricted |
Best Practices
-
Always set resource requests AND limits on every container. Requests determine scheduling (which node has capacity). Limits prevent containers from consuming unbounded resources. Set requests based on observed P50 usage and limits at 2x requests. Without requests, the scheduler cannot make informed decisions. Without limits, one misbehaving pod can starve all others on the node.
-
Use PodDisruptionBudgets for every production deployment. PDBs prevent Kubernetes from draining too many pods during node maintenance, cluster upgrades, or autoscaler scale-down. Set
minAvailableto at least 2 for any service that needs high availability. Without a PDB, a node drain can take down all replicas of a service simultaneously. -
Implement readiness and liveness probes with different characteristics. Readiness probes determine when a pod can receive traffic β configure them with short intervals (5-10s) and check application-level health (can process requests). Liveness probes determine when a pod needs to be restarted β configure them with longer intervals (15-30s) and check basic process health. Getting these confused causes either traffic to unhealthy pods or unnecessary restarts.
-
Use topology spread constraints to distribute pods across failure domains. Default Kubernetes scheduling may place all replicas on the same node or availability zone. Use
topologySpreadConstraintsto ensure pods are spread across zones. This prevents a single zone failure from taking down all replicas. For critical services, usewhenUnsatisfiable: DoNotScheduleto enforce the spread. -
Apply network policies to every namespace in production. Default Kubernetes networking allows all pods to communicate with all other pods. This means a compromised pod can reach any service in the cluster. Apply deny-all default policies to each namespace, then explicitly allow only the traffic patterns your services need. Test policies in a staging namespace first to avoid blocking legitimate traffic.
Common Issues
Pods are OOMKilled repeatedly despite setting memory limits. The container's actual memory usage exceeds the limit, causing the kernel to kill it. Common causes: JVM heap not matching container limits (set -Xmx to 75% of the container limit), memory leaks in the application, or limits set too low for the workload. Check kubectl describe pod for OOMKilled events and increase limits or fix the memory leak.
HPA scaling is too slow to handle traffic spikes. HPA evaluates metrics every 15 seconds by default and scales gradually. A sudden traffic spike overwhelms existing pods before new ones are ready. Mitigation: set minReplicas high enough to handle expected baseline traffic, use KEDA for event-driven scaling that reacts faster, and implement request queuing in the application to buffer during scale-up.
Services cannot resolve each other's DNS names after deploying to a new namespace. Kubernetes DNS uses the format <service-name>.<namespace>.svc.cluster.local. Services in the same namespace can use just the service name, but cross-namespace communication requires the full DNS name. Check that services reference the correct namespace in their configuration. Verify CoreDNS is running with kubectl get pods -n kube-system -l k8s-app=kube-dns.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.