Kubernetes in Real Life: 15 Critical Scenarios and Solutions

Kubernetes tutorials make everything look elegant - deploy a YAML, and magic happens. But production? That is a different story. After years of managing clusters for enterprises, we have collected the 15 scenarios that truly separate beginners from battle-tested engineers. Here is what you will actually face, and how to solve it.

Debugging Nightmares

1. Pod Stuck in CrashLoopBackOff with No Logs

This is the silent killer of Kubernetes debugging. Your pod keeps restarting, but there is nothing in the logs. Here is where to look:

Check the previous container logs - the container might crash before writing anything
Look at init containers - they often fail silently before your main container even starts
Check for OOMKilled - memory-killed containers do not always leave logs behind
Test the image locally - sometimes the entrypoint exits immediately due to missing env vars or configs
Use debug containers - attach a busybox sidecar to inspect the pod environment

2. StatefulSet Pod Will Not Reattach Its Storage After Node Crash

Your node went down, and now the pod is stuck waiting for its PVC. The data is safe, but Kubernetes cannot reconnect it. Common fixes:

Force-delete the stuck pod - sometimes Kubernetes needs a push to let go
Check VolumeAttachment objects - stale attachments from the dead node can block new ones
Verify CSI driver health - the storage driver pods in kube-system might need attention
Last resort: patch the PV - remove the claimRef to allow rebinding

3. Pods Pending, But Cluster Autoscaler Will Not Scale Up

You have pending pods, available budget, but no new nodes. Here is the debugging checklist:

Check autoscaler logs - they will tell you exactly why it rejected the scale-up
Verify node group limits - you might have hit max nodes without realizing it
Examine pod requirements - nodeSelector, affinity rules, or taints that no node group can satisfy
Look for PodDisruptionBudgets - they can block the autoscaler decision cycle
Check cloud quotas - your provider might be out of capacity in that region

4. Pods Stuck in ContainerCreating Forever

The pod was scheduled, but it never starts. This is usually a CNI or storage issue:

CNI plugin problems - check if Calico, Cilium, or your CNI pods are healthy
IP address exhaustion - the subnet might be full
Container runtime issues - containerd might be stuck or overloaded
Storage mount failures - PVC binding issues or CSI driver problems
Overlay filesystem corruption - sometimes you need to clean up and restart containerd

5. Random DNS Failures in Pods

Some requests work, others timeout randomly. DNS issues are notoriously hard to debug:

Check CoreDNS pods - they might be overloaded or crashing
Look at conntrack limits - connection tracking table exhaustion causes random failures
UDP race condition - add single-request-reopen to your dnsConfig
Consider NodeLocal DNSCache - it eliminates most DNS-related issues
Check ndots setting - high values cause excessive DNS queries

Security and Architecture

6. Running a Multi-Tenant Cluster

Multiple teams, one cluster. How do you keep everyone isolated and safe?

Namespace per tenant - the foundation of isolation
RBAC roles limited to namespace - teams can only see their own resources
Resource quotas - prevent any single tenant from consuming all resources
NetworkPolicy default-deny - no cross-namespace traffic unless explicitly allowed
PodSecurityAdmission - enforce restricted security policies per namespace
For sensitive workloads - consider dedicated node pools per tenant

7. NetworkPolicy Blocking Cross-Namespace Traffic

You enabled NetworkPolicies and now everything is broken. Here is how to design them right:

Start with audit mode - tools like Cilium let you see what would be blocked
Default deny first - then explicitly allow what is needed
Use namespace labels - allow traffic from namespaces, not individual pods
Test with netcat - deploy a debug pod and verify connectivity
Visualize with Hubble - see actual traffic flows in real-time

8. Enforcing Trusted Container Images Only

You need to ensure only approved images from your internal registry can run. Your options:

Kyverno - simple YAML-based policies, great for most teams
OPA Gatekeeper - more powerful but steeper learning curve
Custom Admission Webhook - only if you need external system integration
Pro tip: Add image signing - use Cosign plus policy engine for supply chain security

9. Connecting to External Database via VPN

Your app needs to reach a database that is only accessible through VPN. Architecture options:

Sidecar VPN container - good isolation, but duplicates connections
Dedicated VPN gateway pods - centralized, easier to manage
Node-level VPN (recommended) - DaemonSet with hostNetwork for high availability
Security must-haves - store credentials in Secrets, restrict access with NetworkPolicy, enable mTLS

Performance and Reliability

10. Critical Pod Got Evicted Due to Node Pressure

Your most important pod was killed because the node ran out of resources. Understanding QoS classes is key:

BestEffort (evicted first) - no requests or limits set
Burstable (evicted second) - requests less than limits
Guaranteed (evicted last) - requests equal limits for all containers
For critical workloads - always use Guaranteed QoS plus high PriorityClass
Configure eviction thresholds - tune kubelet memory and disk pressure settings

11. Rolling Update Caused Downtime

You deployed a new version and users saw errors. What went wrong?

Missing readiness probe - traffic was sent before the app was ready
Probe too aggressive - initialDelaySeconds too short for slow-starting apps
Wrong probe endpoint - using a heavy endpoint that times out
Deployment settings matter - set maxUnavailable to 0, add minReadySeconds
Do not forget preStop hooks - give pods time for graceful shutdown

12. Ingress Controller Fails Under Load

Traffic spike hit and your ingress became the bottleneck. How to fix it:

Check for config reload storms - frequent Ingress changes cause performance drops
Scale horizontally - more replicas with anti-affinity
Tune worker processes - set to auto in ConfigMap
Enable keep-alive - reduce connection overhead
Consider splitting - separate controllers for internal vs external traffic

13. Istio Sidecar Consumes More CPU Than Your App

Service mesh overhead is eating your resources. Optimization strategies:

Limit sidecar scope - configure egress to only needed namespaces
Tune resource limits - sidecars do not need as much as defaults suggest
Disable unused features - turn off access logging, reduce tracing sampling
Consider ambient mesh - Istio sidecar-less mode for L4 traffic

14. etcd Slowing Down Control Plane

API server responses are slow, and everything feels laggy. etcd is usually the culprit:

Disk I/O is critical - etcd needs SSDs with sub-millisecond latency
Too many objects - secrets, configmaps, and events accumulate over time
Network latency between members - keep etcd nodes in the same availability zone
Enable auto-compaction - prevents database from growing indefinitely
Schedule regular defragmentation - but only during maintenance windows

15. Kubelet Keeps Restarting on One Node

One node is misbehaving while others are fine. Where to look:

Check systemd status - kubelet logs via journalctl
Look for OOM kills - kubelet itself might be memory-killed
Verify container runtime - containerd might be unresponsive
Check cgroup driver - must match between kubelet and runtime
Disk pressure - /var/lib/kubelet running out of space
Certificate expiration - often overlooked but common cause

Key Takeaways

After troubleshooting hundreds of production incidents, here is what we have learned:

Gather data first - events, logs, metrics, and node status before making changes
Layer your defenses - RBAC plus NetworkPolicy plus Resource Quotas plus Pod Security
Design for failure - proper QoS classes, priority classes, and PodDisruptionBudgets
Monitor the control plane - etcd and kubelet health are critical indicators
Test before enforce - audit mode for policies, canary deployments for changes

Production Kubernetes is a journey, not a destination. These 15 scenarios are the foundation - master them, and you will be ready for whatever your clusters throw at you.

Tags: #Kubernetes #K8s #DevOps #Debugging #EKS #Security #CloudNative

Kubernetes in Real Life: 15 Critical Scenarios Every Engineer Faces