Ingress Controller Single Point of Failure
We had HA for every service. Except the one thing that routes traffic to all of them.
The setup:
- 20 services with 3+ replicas each
- Multi-AZ deployment
- Pod disruption budgets
- NGINX Ingress Controller: 1 replica
The incident:
- Ingress controller pod OOMKilled
- 30 seconds to reschedule
- All external traffic: 502 Bad Gateway
- Every. Single. Service. Affected.
Why just 1 replica?
- "It's just infrastructure, it never fails"
- Default Helm chart value: 1
- Nobody changed it
The fix:
controller:
replicaCount: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "kubernetes.io/hostname"
Also:
- PodDisruptionBudget with minAvailable: 2
- Proper resource requests/limits
- HPA for traffic spikes
Lesson: Your ingress controller IS your availability. Treat it that way.