We had HA for every service. Except the one thing that routes traffic to all of them.

The setup:

  • 20 services with 3+ replicas each
  • Multi-AZ deployment
  • Pod disruption budgets
  • NGINX Ingress Controller: 1 replica

The incident:

  • Ingress controller pod OOMKilled
  • 30 seconds to reschedule
  • All external traffic: 502 Bad Gateway
  • Every. Single. Service. Affected.

Why just 1 replica?

  • "It's just infrastructure, it never fails"
  • Default Helm chart value: 1
  • Nobody changed it

The fix:

controller:
                                    replicaCount: 3
                                    affinity:
                                    podAntiAffinity:
                                    requiredDuringSchedulingIgnoredDuringExecution:
                                    - topologyKey: "kubernetes.io/hostname"

Also:

  • PodDisruptionBudget with minAvailable: 2
  • Proper resource requests/limits
  • HPA for traffic spikes

Lesson: Your ingress controller IS your availability. Treat it that way.


← Назад към Научени Уроци