Kubernetes Specialist Mentor

A senior Kubernetes specialist with deep expertise in designing, deploying, and managing production Kubernetes clusters, covering cluster architecture, workload orchestration, security hardening, and operational excellence.

When to Use This Agent

Choose Kubernetes Specialist Mentor when:

Designing Kubernetes cluster architecture for production workloads
Writing and debugging Kubernetes manifests, Helm charts, or Kustomize configs
Implementing Kubernetes security (RBAC, network policies, pod security)
Troubleshooting pod failures, networking issues, or resource constraints
Setting up Kubernetes observability and autoscaling

Consider alternatives when:

Choosing between Kubernetes and simpler alternatives (use Guide Cloud Architect)
Building CI/CD pipelines (use DevOps Expert Consultant)
Managing cloud infrastructure beyond Kubernetes (use cloud-specific agents)

Quick Start


# .claude/agents/kubernetes-specialist-mentor.yml
name: Kubernetes Specialist Mentor
description: Design and manage production Kubernetes clusters
model: claude-sonnet
tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep

Example invocation:

claude "Design a production-ready Kubernetes deployment for our microservices with HPA, PDB, network policies, and proper resource management"

Core Concepts

Production Deployment Manifest


# deployment.yaml — Production-ready workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
  labels:
    app: payment-service
    version: v2.4.1
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
        version: v2.4.1
    spec:
      serviceAccountName: payment-service
      securityContext:
        runAsNonRoot: true
        fsGroup: 1000
      containers:
        - name: payment-service
          image: registry.example.com/payment-service:v2.4.1
          ports:
            - containerPort: 8080
              protocol: TCP
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
          env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: payment-db
                  key: host
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: payment-service
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-service
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Network Policy


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payment-service-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api-gateway
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: payment-db
      ports:
        - protocol: TCP
          port: 5432
    - to:  # DNS
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53

Configuration

Parameter	Description	Default
`cluster_type`	Managed Kubernetes (eks, aks, gke, self-managed)	Auto-detect
`manifest_format`	Manifest tooling (raw-yaml, helm, kustomize)	`kustomize`
`ingress_controller`	Ingress solution (nginx, traefik, istio)	`nginx`
`service_mesh`	Service mesh (istio, linkerd, none)	`none`
`monitoring`	Monitoring stack (prometheus, datadog)	`prometheus`
`security_baseline`	Security standard (pod-security-standards, custom)	`restricted`

Best Practices

Always set resource requests AND limits on every container. Requests determine scheduling (which node has capacity). Limits prevent containers from consuming unbounded resources. Set requests based on observed P50 usage and limits at 2x requests. Without requests, the scheduler cannot make informed decisions. Without limits, one misbehaving pod can starve all others on the node.
Use PodDisruptionBudgets for every production deployment. PDBs prevent Kubernetes from draining too many pods during node maintenance, cluster upgrades, or autoscaler scale-down. Set minAvailable to at least 2 for any service that needs high availability. Without a PDB, a node drain can take down all replicas of a service simultaneously.
Implement readiness and liveness probes with different characteristics. Readiness probes determine when a pod can receive traffic — configure them with short intervals (5-10s) and check application-level health (can process requests). Liveness probes determine when a pod needs to be restarted — configure them with longer intervals (15-30s) and check basic process health. Getting these confused causes either traffic to unhealthy pods or unnecessary restarts.
Use topology spread constraints to distribute pods across failure domains. Default Kubernetes scheduling may place all replicas on the same node or availability zone. Use topologySpreadConstraints to ensure pods are spread across zones. This prevents a single zone failure from taking down all replicas. For critical services, use whenUnsatisfiable: DoNotSchedule to enforce the spread.
Apply network policies to every namespace in production. Default Kubernetes networking allows all pods to communicate with all other pods. This means a compromised pod can reach any service in the cluster. Apply deny-all default policies to each namespace, then explicitly allow only the traffic patterns your services need. Test policies in a staging namespace first to avoid blocking legitimate traffic.

Common Issues

Pods are OOMKilled repeatedly despite setting memory limits. The container's actual memory usage exceeds the limit, causing the kernel to kill it. Common causes: JVM heap not matching container limits (set -Xmx to 75% of the container limit), memory leaks in the application, or limits set too low for the workload. Check kubectl describe pod for OOMKilled events and increase limits or fix the memory leak.

HPA scaling is too slow to handle traffic spikes. HPA evaluates metrics every 15 seconds by default and scales gradually. A sudden traffic spike overwhelms existing pods before new ones are ready. Mitigation: set minReplicas high enough to handle expected baseline traffic, use KEDA for event-driven scaling that reacts faster, and implement request queuing in the application to buffer during scale-up.

Services cannot resolve each other's DNS names after deploying to a new namespace. Kubernetes DNS uses the format <service-name>.<namespace>.svc.cluster.local. Services in the same namespace can use just the service name, but cross-namespace communication requires the full DNS name. Check that services reference the correct namespace in their configuration. Verify CoreDNS is running with kubectl get pods -n kube-system -l k8s-app=kube-dns.

⚠️ Loading Issue

Kubernetes Specialist Mentor

Kubernetes Specialist Mentor

When to Use This Agent

Quick Start

Core Concepts

Production Deployment Manifest

Network Policy

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner