Kubernetes in Real Life: 15 Critical Scenarios Every Engineer Faces
Kubernetes tutorials make everything look elegant - deploy a YAML, and magic happens. But production? That is a different story. After years of managing clusters for enterprises, we have collected the 15 scenarios that truly separate beginners from battle-tested engineers. Here is what you will actually face, and how to solve it.
Debugging Nightmares
1. Pod Stuck in CrashLoopBackOff with No Logs
This is the silent killer of Kubernetes debugging. Your pod keeps restarting, but there is nothing in the logs. Here is where to look:
- Check the previous container logs - the container might crash before writing anything
- Look at init containers - they often fail silently before your main container even starts
- Check for OOMKilled - memory-killed containers do not always leave logs behind
- Test the image locally - sometimes the entrypoint exits immediately due to missing env vars or configs
- Use debug containers - attach a busybox sidecar to inspect the pod environment
2. StatefulSet Pod Will Not Reattach Its Storage After Node Crash
Your node went down, and now the pod is stuck waiting for its PVC. The data is safe, but Kubernetes cannot reconnect it. Common fixes:
- Force-delete the stuck pod - sometimes Kubernetes needs a push to let go
- Check VolumeAttachment objects - stale attachments from the dead node can block new ones
- Verify CSI driver health - the storage driver pods in kube-system might need attention
- Last resort: patch the PV - remove the claimRef to allow rebinding
3. Pods Pending, But Cluster Autoscaler Will Not Scale Up
You have pending pods, available budget, but no new nodes. Here is the debugging checklist:
- Check autoscaler logs - they will tell you exactly why it rejected the scale-up
- Verify node group limits - you might have hit max nodes without realizing it
- Examine pod requirements - nodeSelector, affinity rules, or taints that no node group can satisfy
- Look for PodDisruptionBudgets - they can block the autoscaler decision cycle
- Check cloud quotas - your provider might be out of capacity in that region
4. Pods Stuck in ContainerCreating Forever
The pod was scheduled, but it never starts. This is usually a CNI or storage issue:
- CNI plugin problems - check if Calico, Cilium, or your CNI pods are healthy
- IP address exhaustion - the subnet might be full
- Container runtime issues - containerd might be stuck or overloaded
- Storage mount failures - PVC binding issues or CSI driver problems
- Overlay filesystem corruption - sometimes you need to clean up and restart containerd
5. Random DNS Failures in Pods
Some requests work, others timeout randomly. DNS issues are notoriously hard to debug:
- Check CoreDNS pods - they might be overloaded or crashing
- Look at conntrack limits - connection tracking table exhaustion causes random failures
- UDP race condition - add single-request-reopen to your dnsConfig
- Consider NodeLocal DNSCache - it eliminates most DNS-related issues
- Check ndots setting - high values cause excessive DNS queries
Security and Architecture
6. Running a Multi-Tenant Cluster
Multiple teams, one cluster. How do you keep everyone isolated and safe?
- Namespace per tenant - the foundation of isolation
- RBAC roles limited to namespace - teams can only see their own resources
- Resource quotas - prevent any single tenant from consuming all resources
- NetworkPolicy default-deny - no cross-namespace traffic unless explicitly allowed
- PodSecurityAdmission - enforce restricted security policies per namespace
- For sensitive workloads - consider dedicated node pools per tenant
7. NetworkPolicy Blocking Cross-Namespace Traffic
You enabled NetworkPolicies and now everything is broken. Here is how to design them right:
- Start with audit mode - tools like Cilium let you see what would be blocked
- Default deny first - then explicitly allow what is needed
- Use namespace labels - allow traffic from namespaces, not individual pods
- Test with netcat - deploy a debug pod and verify connectivity
- Visualize with Hubble - see actual traffic flows in real-time
8. Enforcing Trusted Container Images Only
You need to ensure only approved images from your internal registry can run. Your options:
- Kyverno - simple YAML-based policies, great for most teams
- OPA Gatekeeper - more powerful but steeper learning curve
- Custom Admission Webhook - only if you need external system integration
- Pro tip: Add image signing - use Cosign plus policy engine for supply chain security
9. Connecting to External Database via VPN
Your app needs to reach a database that is only accessible through VPN. Architecture options:
- Sidecar VPN container - good isolation, but duplicates connections
- Dedicated VPN gateway pods - centralized, easier to manage
- Node-level VPN (recommended) - DaemonSet with hostNetwork for high availability
- Security must-haves - store credentials in Secrets, restrict access with NetworkPolicy, enable mTLS
Performance and Reliability
10. Critical Pod Got Evicted Due to Node Pressure
Your most important pod was killed because the node ran out of resources. Understanding QoS classes is key:
- BestEffort (evicted first) - no requests or limits set
- Burstable (evicted second) - requests less than limits
- Guaranteed (evicted last) - requests equal limits for all containers
- For critical workloads - always use Guaranteed QoS plus high PriorityClass
- Configure eviction thresholds - tune kubelet memory and disk pressure settings
11. Rolling Update Caused Downtime
You deployed a new version and users saw errors. What went wrong?
- Missing readiness probe - traffic was sent before the app was ready
- Probe too aggressive - initialDelaySeconds too short for slow-starting apps
- Wrong probe endpoint - using a heavy endpoint that times out
- Deployment settings matter - set maxUnavailable to 0, add minReadySeconds
- Do not forget preStop hooks - give pods time for graceful shutdown
12. Ingress Controller Fails Under Load
Traffic spike hit and your ingress became the bottleneck. How to fix it:
- Check for config reload storms - frequent Ingress changes cause performance drops
- Scale horizontally - more replicas with anti-affinity
- Tune worker processes - set to auto in ConfigMap
- Enable keep-alive - reduce connection overhead
- Consider splitting - separate controllers for internal vs external traffic
13. Istio Sidecar Consumes More CPU Than Your App
Service mesh overhead is eating your resources. Optimization strategies:
- Limit sidecar scope - configure egress to only needed namespaces
- Tune resource limits - sidecars do not need as much as defaults suggest
- Disable unused features - turn off access logging, reduce tracing sampling
- Consider ambient mesh - Istio sidecar-less mode for L4 traffic
14. etcd Slowing Down Control Plane
API server responses are slow, and everything feels laggy. etcd is usually the culprit:
- Disk I/O is critical - etcd needs SSDs with sub-millisecond latency
- Too many objects - secrets, configmaps, and events accumulate over time
- Network latency between members - keep etcd nodes in the same availability zone
- Enable auto-compaction - prevents database from growing indefinitely
- Schedule regular defragmentation - but only during maintenance windows
15. Kubelet Keeps Restarting on One Node
One node is misbehaving while others are fine. Where to look:
- Check systemd status - kubelet logs via journalctl
- Look for OOM kills - kubelet itself might be memory-killed
- Verify container runtime - containerd might be unresponsive
- Check cgroup driver - must match between kubelet and runtime
- Disk pressure - /var/lib/kubelet running out of space
- Certificate expiration - often overlooked but common cause
Key Takeaways
After troubleshooting hundreds of production incidents, here is what we have learned:
- Gather data first - events, logs, metrics, and node status before making changes
- Layer your defenses - RBAC plus NetworkPolicy plus Resource Quotas plus Pod Security
- Design for failure - proper QoS classes, priority classes, and PodDisruptionBudgets
- Monitor the control plane - etcd and kubelet health are critical indicators
- Test before enforce - audit mode for policies, canary deployments for changes
Production Kubernetes is a journey, not a destination. These 15 scenarios are the foundation - master them, and you will be ready for whatever your clusters throw at you.