High Availability That Wasn't
We had 3 replicas across 3 nodes. We thought we were highly available. We weren't.
The setup:
- 3 replicas of each service
- 3 Kubernetes nodes
- Pod anti-affinity: spread across nodes ✅
- All nodes: us-east-1a ❌
The incident:
- AWS us-east-1a partial outage
- All 3 nodes affected
- All pods evicted
- No capacity in the zone
- Complete service outage
Why it happened:
- Cheaper instances in us-east-1a
- Auto-provisioner defaulted to single AZ
- Nobody checked node distribution
The fix:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
Lesson: Replicas without zone distribution is not high availability.