Spot Instance Black Friday Disaster
"Spot instances save 90%!" — Yes, they do. Until all 50 of them get reclaimed during a traffic spike.
Black Friday timeline:
- 06:00 - Traffic starts climbing
- 08:00 - Auto-scaling adds spot instances
- 10:00 - All systems nominal
- 11:30 - AWS reclaims 50 spot instances simultaneously
- 11:31 - Auto-scaling tries to spin up on-demand (at 10x cost)
- 11:32 - Insufficient capacity in our AZ
- 11:33 - 15 minutes of partial outage
The problem:
We were 100% spot for cost savings. No baseline of on-demand instances. When AWS needed the capacity back, we had no fallback.
What we fixed:
- Baseline: 30% on-demand for guaranteed capacity
- Spot fleet with multiple instance types
- Multi-AZ to reduce capacity constraints
- Graceful degradation when capacity is limited
Lesson: Spot for batch jobs? Perfect. Spot for production load balancing without fallback? You're gambling.