Kubernetes in Real Life: 15 Critical Scenarios and Solutions
Kubernetes looks elegant in tutorials, but production tells a different story. Here are 15 real-world scenarios that separate junior operators from battle-tested engineers—with actionable solutions for each.
⚔️ Deep Dive Debugging
1. Pod stuck in CrashLoopBackOff, no logs, no errors
How do you debug beyond kubectl logs and describe?
- Check previous container logs:
kubectl logs <pod> --previous- the container might crash before writing logs - Exec into the container with a debug shell:
kubectl debug -it <pod> --image=busybox --target=<container> - Inspect events at node level:
kubectl get events --field-selector involvedObject.name=<pod> - Check init containers: Often the culprit is a failing init container that runs before your main container
- Review container entrypoint: The process might exit immediately—test the image locally with
docker run -it <image> sh - Check resource limits: OOMKilled containers don't always leave logs—check
kubectl describe podforLast State: OOMKilled
2. StatefulSet pod won't reattach its PVC after a node crash
How do you recover without recreating storage?
- Check PV status:
kubectl get pv- look forReleasedorFailedstate - Force-delete the stuck pod:
kubectl delete pod <pod> --force --grace-period=0 - Check VolumeAttachment objects:
kubectl get volumeattachments- stale attachments can block reattachment - Delete stale VolumeAttachment:
kubectl delete volumeattachment <name> - Verify CSI driver health: Check the CSI driver pods in
kube-systemnamespace - Last resort - patch the PV: Remove the
claimRefto allow rebinding:kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'
3. Pods are Pending, Cluster Autoscaler won't scale up
Top 3 debugging steps:
- Step 1 - Check Autoscaler logs:
kubectl logs -n kube-system -l app=cluster-autoscaler- look for "scale up" decisions and why they were rejected - Step 2 - Verify node group constraints: Check if max nodes limit is reached, if the required instance type is available in your region, and if there are sufficient IAM permissions
- Step 3 - Examine pod scheduling requirements: Check for nodeSelector, affinity rules, or taints that no node group can satisfy. Run
kubectl describe podand look at the Events section for scheduling failures - Common culprits: PodDisruptionBudget blocking scale-down (which blocks scale-up cycles), pods with local storage that can't be rescheduled, and insufficient quota in cloud provider
4. NetworkPolicy blocks cross-namespace traffic
How do you design least-privilege rules and test them safely?
- Start with audit mode: Use tools like Cilium's policy audit mode to see what would be blocked before enforcing
- Design pattern - Default deny + explicit allow:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
- Allow cross-namespace traffic explicitly:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-frontend
namespace: backend
spec:
podSelector:
matchLabels:
app: api
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend
podSelector:
matchLabels:
app: web
- Testing safely: Deploy a debug pod and use
nc -zv <service> <port>to test connectivity before and after policy changes - Use policy visualization: Tools like Cilium Hubble or Calico's policy board help visualize traffic flows
5. Service must connect to external DB via VPN inside the cluster
How do you architect it for HA + security?
- Option 1 - Sidecar VPN container: Run a VPN client as a sidecar in pods that need DB access. Good for isolation but duplicates connections.
- Option 2 - Dedicated VPN gateway pods: Deploy a DaemonSet or Deployment of VPN gateway pods with a Service. Route traffic through these gateways using NetworkPolicy.
- Option 3 - Node-level VPN (recommended for HA):
# Use a DaemonSet with hostNetwork: true
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vpn-gateway
spec:
selector:
matchLabels:
app: vpn-gateway
template:
spec:
hostNetwork: true
containers:
- name: vpn
image: your-vpn-image
securityContext:
capabilities:
add: ["NET_ADMIN"]
- Security considerations: Use Kubernetes Secrets for VPN credentials, implement NetworkPolicy to restrict which pods can reach the VPN gateway, enable mTLS if supported by your VPN solution
- HA architecture: Run multiple VPN pods across availability zones, use a headless Service with client-side load balancing, implement health checks that verify actual DB connectivity
🧱 Security + Architecture
6. Running a multi-tenant EKS cluster
How do you isolate workloads with RBAC, quotas, and network segmentation?
- Namespace isolation: One namespace per tenant with strict RBAC
# Role limited to tenant namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: tenant-a
name: tenant-role
rules:
- apiGroups: ["", "apps"]
resources: ["pods", "deployments", "services"]
verbs: ["get", "list", "create", "update", "delete"]
- Resource quotas per tenant:
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota
namespace: tenant-a
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
- Network segmentation: Default-deny NetworkPolicy per namespace, explicit allow rules only for necessary cross-tenant communication
- Additional hardening: Use PodSecurityAdmission to enforce restricted policies, implement OPA/Gatekeeper for custom policies, consider dedicated node pools per tenant for sensitive workloads
7. Kubelet keeps restarting on one node
Where do you look first – systemd, container runtime, or cgroups?
- Step 1 - Check systemd status:
systemctl status kubeletandjournalctl -u kubelet -f - Step 2 - Look for OOM kills:
dmesg | grep -i "oom\|killed"- kubelet might be getting OOM killed - Step 3 - Verify container runtime:
systemctl status containerd(or docker), thencrictl psto check if runtime is responsive - Step 4 - Check cgroup configuration: Ensure cgroup driver matches between kubelet and container runtime (both should use
systemdorcgroupfs) - Step 5 - Disk pressure:
df -hon/var/lib/kubeletand/var/lib/containerd - Common causes: Certificate expiration, API server connectivity issues, corrupted kubelet state (try removing
/var/lib/kubelet/cpu_manager_state)
8. Critical pod got evicted due to node pressure
Explain QoS classes and eviction policies:
- QoS Classes (eviction order):
- BestEffort (evicted first): No requests or limits set
- Burstable (evicted second): Requests < Limits, or only some containers have limits
- Guaranteed (evicted last): Requests = Limits for ALL containers
# Guaranteed QoS example
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "500m"
- Protect critical pods: Use PriorityClass with
preemptionPolicy: Neveror high priority value
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-workload
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Critical workloads that should not be evicted"
- Eviction thresholds: Configure kubelet flags like
--eviction-hardand--eviction-softto set memory/disk pressure thresholds
9. A rolling update caused downtime
What went wrong in your readiness/startup probe or deployment config?
- Common culprits:
- Missing readiness probe: Traffic routed before app is ready
- Aggressive probe settings:
initialDelaySecondstoo short for slow-starting apps - Wrong endpoint: Probe hitting a heavy endpoint that times out
# Proper probe configuration
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
startupProbe: # For slow-starting apps
httpGet:
path: /health/started
port: 8080
failureThreshold: 30
periodSeconds: 10
- Deployment settings to review:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Never remove healthy pods
maxSurge: 1 # Add one new pod at a time
minReadySeconds: 30 # Wait before marking as available
- Don't forget preStop hooks: Allow graceful shutdown before SIGTERM
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
10. Ingress Controller fails under load
How do you debug and scale routing efficiently?
- Debug steps:
- Check controller logs:
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx - Monitor connection metrics: Look for
nginx_ingress_controller_nginx_process_connections - Check for config reload storms: Frequent Ingress changes cause reloads
- Scaling strategies:
- Horizontal scaling: Increase replicas and use pod anti-affinity for distribution
- Tune worker processes: Set
worker-processes: "auto"in ConfigMap - Enable keep-alive: Reduce connection overhead with upstream keep-alive
# Ingress ConfigMap tuning
data:
worker-processes: "auto"
max-worker-connections: "65536"
upstream-keepalive-connections: "200"
keep-alive: "75"
- Consider splitting: Use separate Ingress controllers for different traffic classes (internal vs external)
⚙️ Performance + Reliability
11. Istio sidecar consumes more CPU than your app
How do you profile and optimize mesh performance?
- Profile first: Use
istioctl proxy-configto inspect Envoy configuration - Check configuration size:
istioctl proxy-statusshows config sync status - Optimization strategies:
- Limit sidecar scope:
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: default
namespace: my-app
spec:
egress:
- hosts:
- "./*" # Only local namespace
- "istio-system/*" # Control plane
- Tune resource limits:
# In IstioOperator
spec:
values:
global:
proxy:
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
- Disable unnecessary features: Turn off access logging, reduce tracing sampling rate
- Consider sidecar-less: Istio ambient mesh removes sidecars for L4 traffic
12. etcd is slowing down control plane ops
Root causes + how do you tune it safely?
- Common root causes:
- Disk I/O latency (etcd is very sensitive to disk performance)
- Too many objects (secrets, configmaps, events)
- Network latency between etcd members
- Large object sizes (secrets with large data)
- Diagnosis:
# Check etcd metrics
etcdctl endpoint status --write-out=table
etcdctl endpoint health
# Key metrics to watch
- etcd_disk_wal_fsync_duration_seconds (should be < 10ms)
- etcd_disk_backend_commit_duration_seconds
- etcd_server_slow_apply_total
- Safe tuning:
- Use SSD/NVMe storage (IOPS > 3000, latency < 1ms)
- Increase
--quota-backend-bytesif hitting space quota - Enable auto-compaction:
--auto-compaction-retention=1 - Regular defragmentation (during maintenance windows)
- Reduce event TTL to limit event accumulation
13. You must enforce images from a trusted internal registry only
Gatekeeper, Kyverno, or custom Admission Webhook – what's your move?
- Recommendation: Kyverno for simplicity, Gatekeeper for complex policies
- Kyverno approach (simpler):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
spec:
validationFailureAction: Enforce
rules:
- name: validate-registries
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Images must come from internal registry"
pattern:
spec:
containers:
- image: "registry.internal.com/*"
initContainers:
- image: "registry.internal.com/*"
- Gatekeeper approach (more flexible):
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
name: repo-is-internal
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
parameters:
repos:
- "registry.internal.com/"
- Custom webhook: Only if you need integration with external systems or custom logic that policy engines can't handle
- Pro tip: Also implement image signature verification with Cosign + Kyverno for supply chain security
14. Pods stuck in ContainerCreating forever
CNI attach delay? OverlayFS corruption? Walk me through your root-cause process:
- Step 1 - Describe the pod:
kubectl describe pod <pod>- check Events section for specific errors - Step 2 - Check CNI:
# Check CNI plugin pods
kubectl get pods -n kube-system -l k8s-app=calico-node # or your CNI
# Check CNI logs
kubectl logs -n kube-system -l k8s-app=calico-node
# Verify CNI config exists
ls /etc/cni/net.d/
- Step 3 - Check container runtime:
# Containerd
crictl ps
crictl logs <container-id>
journalctl -u containerd
# Check for stuck sandbox containers
crictl pods | grep NotReady
- Step 4 - Check storage:
# Overlay filesystem issues
df -h /var/lib/containerd
mount | grep overlay
# Clean up if corrupted
crictl rmi --prune
- Step 5 - Check for resource exhaustion: IP addresses exhausted in CNI subnet, too many pods on node, inotify watch limit reached
15. Random DNS failures in Pods
How do you debug CoreDNS, kube-proxy, and conntrack interactions?
- Step 1 - Test DNS from pod:
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes
kubectl run -it --rm debug --image=busybox -- nslookup google.com
- Step 2 - Check CoreDNS:
# CoreDNS pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check CoreDNS metrics
kubectl top pods -n kube-system -l k8s-app=kube-dns
- Step 3 - Check kube-proxy:
# Verify kube-proxy mode
kubectl get cm -n kube-system kube-proxy -o yaml | grep mode
# Check iptables/ipvs rules
iptables -t nat -L | grep coredns
ipvsadm -ln | grep <coredns-ip>
- Step 4 - Conntrack issues (common culprit):
# Check conntrack table
conntrack -S
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# Increase if hitting limits
sysctl -w net.netfilter.nf_conntrack_max=524288
- Step 5 - The race condition fix: Add to pod spec to avoid UDP source port exhaustion:
dnsConfig:
options:
- name: single-request-reopen
- name: ndots
value: "2"
- Pro tip: Use NodeLocal DNSCache to reduce CoreDNS load and avoid conntrack issues for DNS traffic
Summary
These 15 scenarios represent the reality of running Kubernetes in production. The key takeaways:
- Always gather data first: Events, logs, metrics, and node status before making changes
- Layer your defenses: RBAC + NetworkPolicy + Resource Quotas + Pod Security
- Design for failure: Proper QoS classes, priority classes, and PodDisruptionBudgets
- Monitor the control plane: etcd and kubelet health are critical
- Test before enforce: Audit mode for policies, canary deployments for changes
Production Kubernetes requires a systematic debugging approach and deep understanding of the underlying components. Master these scenarios, and you'll be ready for whatever your clusters throw at you.