Kubernetes looks elegant in tutorials, but production tells a different story. Here are 15 real-world scenarios that separate junior operators from battle-tested engineers—with actionable solutions for each.

⚔️ Deep Dive Debugging

1. Pod stuck in CrashLoopBackOff, no logs, no errors

How do you debug beyond kubectl logs and describe?

  • Check previous container logs: kubectl logs <pod> --previous - the container might crash before writing logs
  • Exec into the container with a debug shell: kubectl debug -it <pod> --image=busybox --target=<container>
  • Inspect events at node level: kubectl get events --field-selector involvedObject.name=<pod>
  • Check init containers: Often the culprit is a failing init container that runs before your main container
  • Review container entrypoint: The process might exit immediately—test the image locally with docker run -it <image> sh
  • Check resource limits: OOMKilled containers don't always leave logs—check kubectl describe pod for Last State: OOMKilled

2. StatefulSet pod won't reattach its PVC after a node crash

How do you recover without recreating storage?

  • Check PV status: kubectl get pv - look for Released or Failed state
  • Force-delete the stuck pod: kubectl delete pod <pod> --force --grace-period=0
  • Check VolumeAttachment objects: kubectl get volumeattachments - stale attachments can block reattachment
  • Delete stale VolumeAttachment: kubectl delete volumeattachment <name>
  • Verify CSI driver health: Check the CSI driver pods in kube-system namespace
  • Last resort - patch the PV: Remove the claimRef to allow rebinding: kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'

3. Pods are Pending, Cluster Autoscaler won't scale up

Top 3 debugging steps:

  • Step 1 - Check Autoscaler logs: kubectl logs -n kube-system -l app=cluster-autoscaler - look for "scale up" decisions and why they were rejected
  • Step 2 - Verify node group constraints: Check if max nodes limit is reached, if the required instance type is available in your region, and if there are sufficient IAM permissions
  • Step 3 - Examine pod scheduling requirements: Check for nodeSelector, affinity rules, or taints that no node group can satisfy. Run kubectl describe pod and look at the Events section for scheduling failures
  • Common culprits: PodDisruptionBudget blocking scale-down (which blocks scale-up cycles), pods with local storage that can't be rescheduled, and insufficient quota in cloud provider

4. NetworkPolicy blocks cross-namespace traffic

How do you design least-privilege rules and test them safely?

  • Start with audit mode: Use tools like Cilium's policy audit mode to see what would be blocked before enforcing
  • Design pattern - Default deny + explicit allow:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  • Allow cross-namespace traffic explicitly:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-frontend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
      podSelector:
        matchLabels:
          app: web
  • Testing safely: Deploy a debug pod and use nc -zv <service> <port> to test connectivity before and after policy changes
  • Use policy visualization: Tools like Cilium Hubble or Calico's policy board help visualize traffic flows

5. Service must connect to external DB via VPN inside the cluster

How do you architect it for HA + security?

  • Option 1 - Sidecar VPN container: Run a VPN client as a sidecar in pods that need DB access. Good for isolation but duplicates connections.
  • Option 2 - Dedicated VPN gateway pods: Deploy a DaemonSet or Deployment of VPN gateway pods with a Service. Route traffic through these gateways using NetworkPolicy.
  • Option 3 - Node-level VPN (recommended for HA):
# Use a DaemonSet with hostNetwork: true
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: vpn-gateway
spec:
  selector:
    matchLabels:
      app: vpn-gateway
  template:
    spec:
      hostNetwork: true
      containers:
      - name: vpn
        image: your-vpn-image
        securityContext:
          capabilities:
            add: ["NET_ADMIN"]
  • Security considerations: Use Kubernetes Secrets for VPN credentials, implement NetworkPolicy to restrict which pods can reach the VPN gateway, enable mTLS if supported by your VPN solution
  • HA architecture: Run multiple VPN pods across availability zones, use a headless Service with client-side load balancing, implement health checks that verify actual DB connectivity

🧱 Security + Architecture

6. Running a multi-tenant EKS cluster

How do you isolate workloads with RBAC, quotas, and network segmentation?

  • Namespace isolation: One namespace per tenant with strict RBAC
# Role limited to tenant namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: tenant-a
  name: tenant-role
rules:
- apiGroups: ["", "apps"]
  resources: ["pods", "deployments", "services"]
  verbs: ["get", "list", "create", "update", "delete"]
  • Resource quotas per tenant:
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
  • Network segmentation: Default-deny NetworkPolicy per namespace, explicit allow rules only for necessary cross-tenant communication
  • Additional hardening: Use PodSecurityAdmission to enforce restricted policies, implement OPA/Gatekeeper for custom policies, consider dedicated node pools per tenant for sensitive workloads

7. Kubelet keeps restarting on one node

Where do you look first – systemd, container runtime, or cgroups?

  • Step 1 - Check systemd status: systemctl status kubelet and journalctl -u kubelet -f
  • Step 2 - Look for OOM kills: dmesg | grep -i "oom\|killed" - kubelet might be getting OOM killed
  • Step 3 - Verify container runtime: systemctl status containerd (or docker), then crictl ps to check if runtime is responsive
  • Step 4 - Check cgroup configuration: Ensure cgroup driver matches between kubelet and container runtime (both should use systemd or cgroupfs)
  • Step 5 - Disk pressure: df -h on /var/lib/kubelet and /var/lib/containerd
  • Common causes: Certificate expiration, API server connectivity issues, corrupted kubelet state (try removing /var/lib/kubelet/cpu_manager_state)

8. Critical pod got evicted due to node pressure

Explain QoS classes and eviction policies:

  • QoS Classes (eviction order):
  • BestEffort (evicted first): No requests or limits set
  • Burstable (evicted second): Requests < Limits, or only some containers have limits
  • Guaranteed (evicted last): Requests = Limits for ALL containers
# Guaranteed QoS example
resources:
  requests:
    memory: "1Gi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "500m"
  • Protect critical pods: Use PriorityClass with preemptionPolicy: Never or high priority value
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-workload
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Critical workloads that should not be evicted"
  • Eviction thresholds: Configure kubelet flags like --eviction-hard and --eviction-soft to set memory/disk pressure thresholds

9. A rolling update caused downtime

What went wrong in your readiness/startup probe or deployment config?

  • Common culprits:
  • Missing readiness probe: Traffic routed before app is ready
  • Aggressive probe settings: initialDelaySeconds too short for slow-starting apps
  • Wrong endpoint: Probe hitting a heavy endpoint that times out
# Proper probe configuration
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3
startupProbe:  # For slow-starting apps
  httpGet:
    path: /health/started
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
  • Deployment settings to review:
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0      # Never remove healthy pods
      maxSurge: 1            # Add one new pod at a time
  minReadySeconds: 30        # Wait before marking as available
  • Don't forget preStop hooks: Allow graceful shutdown before SIGTERM
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]

10. Ingress Controller fails under load

How do you debug and scale routing efficiently?

  • Debug steps:
  • Check controller logs: kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
  • Monitor connection metrics: Look for nginx_ingress_controller_nginx_process_connections
  • Check for config reload storms: Frequent Ingress changes cause reloads
  • Scaling strategies:
  • Horizontal scaling: Increase replicas and use pod anti-affinity for distribution
  • Tune worker processes: Set worker-processes: "auto" in ConfigMap
  • Enable keep-alive: Reduce connection overhead with upstream keep-alive
# Ingress ConfigMap tuning
data:
  worker-processes: "auto"
  max-worker-connections: "65536"
  upstream-keepalive-connections: "200"
  keep-alive: "75"
  • Consider splitting: Use separate Ingress controllers for different traffic classes (internal vs external)

⚙️ Performance + Reliability

11. Istio sidecar consumes more CPU than your app

How do you profile and optimize mesh performance?

  • Profile first: Use istioctl proxy-config to inspect Envoy configuration
  • Check configuration size: istioctl proxy-status shows config sync status
  • Optimization strategies:
  • Limit sidecar scope:
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: my-app
spec:
  egress:
  - hosts:
    - "./*"                    # Only local namespace
    - "istio-system/*"         # Control plane
  • Tune resource limits:
# In IstioOperator
spec:
  values:
    global:
      proxy:
        resources:
          requests:
            cpu: 50m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
  • Disable unnecessary features: Turn off access logging, reduce tracing sampling rate
  • Consider sidecar-less: Istio ambient mesh removes sidecars for L4 traffic

12. etcd is slowing down control plane ops

Root causes + how do you tune it safely?

  • Common root causes:
  • Disk I/O latency (etcd is very sensitive to disk performance)
  • Too many objects (secrets, configmaps, events)
  • Network latency between etcd members
  • Large object sizes (secrets with large data)
  • Diagnosis:
# Check etcd metrics
etcdctl endpoint status --write-out=table
etcdctl endpoint health

# Key metrics to watch
- etcd_disk_wal_fsync_duration_seconds (should be < 10ms)
- etcd_disk_backend_commit_duration_seconds
- etcd_server_slow_apply_total
  • Safe tuning:
  • Use SSD/NVMe storage (IOPS > 3000, latency < 1ms)
  • Increase --quota-backend-bytes if hitting space quota
  • Enable auto-compaction: --auto-compaction-retention=1
  • Regular defragmentation (during maintenance windows)
  • Reduce event TTL to limit event accumulation

13. You must enforce images from a trusted internal registry only

Gatekeeper, Kyverno, or custom Admission Webhook – what's your move?

  • Recommendation: Kyverno for simplicity, Gatekeeper for complex policies
  • Kyverno approach (simpler):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  rules:
  - name: validate-registries
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: "Images must come from internal registry"
      pattern:
        spec:
          containers:
          - image: "registry.internal.com/*"
          initContainers:
          - image: "registry.internal.com/*"
  • Gatekeeper approach (more flexible):
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: repo-is-internal
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
  parameters:
    repos:
    - "registry.internal.com/"
  • Custom webhook: Only if you need integration with external systems or custom logic that policy engines can't handle
  • Pro tip: Also implement image signature verification with Cosign + Kyverno for supply chain security

14. Pods stuck in ContainerCreating forever

CNI attach delay? OverlayFS corruption? Walk me through your root-cause process:

  • Step 1 - Describe the pod: kubectl describe pod <pod> - check Events section for specific errors
  • Step 2 - Check CNI:
# Check CNI plugin pods
kubectl get pods -n kube-system -l k8s-app=calico-node  # or your CNI

# Check CNI logs
kubectl logs -n kube-system -l k8s-app=calico-node

# Verify CNI config exists
ls /etc/cni/net.d/
  • Step 3 - Check container runtime:
# Containerd
crictl ps
crictl logs <container-id>
journalctl -u containerd

# Check for stuck sandbox containers
crictl pods | grep NotReady
  • Step 4 - Check storage:
# Overlay filesystem issues
df -h /var/lib/containerd
mount | grep overlay

# Clean up if corrupted
crictl rmi --prune
  • Step 5 - Check for resource exhaustion: IP addresses exhausted in CNI subnet, too many pods on node, inotify watch limit reached

15. Random DNS failures in Pods

How do you debug CoreDNS, kube-proxy, and conntrack interactions?

  • Step 1 - Test DNS from pod:
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes
kubectl run -it --rm debug --image=busybox -- nslookup google.com
  • Step 2 - Check CoreDNS:
# CoreDNS pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check CoreDNS metrics
kubectl top pods -n kube-system -l k8s-app=kube-dns
  • Step 3 - Check kube-proxy:
# Verify kube-proxy mode
kubectl get cm -n kube-system kube-proxy -o yaml | grep mode

# Check iptables/ipvs rules
iptables -t nat -L | grep coredns
ipvsadm -ln | grep <coredns-ip>
  • Step 4 - Conntrack issues (common culprit):
# Check conntrack table
conntrack -S
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Increase if hitting limits
sysctl -w net.netfilter.nf_conntrack_max=524288
  • Step 5 - The race condition fix: Add to pod spec to avoid UDP source port exhaustion:
dnsConfig:
  options:
  - name: single-request-reopen
  - name: ndots
    value: "2"
  • Pro tip: Use NodeLocal DNSCache to reduce CoreDNS load and avoid conntrack issues for DNS traffic

Summary

These 15 scenarios represent the reality of running Kubernetes in production. The key takeaways:

  • Always gather data first: Events, logs, metrics, and node status before making changes
  • Layer your defenses: RBAC + NetworkPolicy + Resource Quotas + Pod Security
  • Design for failure: Proper QoS classes, priority classes, and PodDisruptionBudgets
  • Monitor the control plane: etcd and kubelet health are critical
  • Test before enforce: Audit mode for policies, canary deployments for changes

Production Kubernetes requires a systematic debugging approach and deep understanding of the underlying components. Master these scenarios, and you'll be ready for whatever your clusters throw at you.