"Spot instances save 90%!" — Yes, they do. Until all 50 of them get reclaimed during a traffic spike.

Black Friday timeline:

  • 06:00 - Traffic starts climbing
  • 08:00 - Auto-scaling adds spot instances
  • 10:00 - All systems nominal
  • 11:30 - AWS reclaims 50 spot instances simultaneously
  • 11:31 - Auto-scaling tries to spin up on-demand (at 10x cost)
  • 11:32 - Insufficient capacity in our AZ
  • 11:33 - 15 minutes of partial outage

The problem:

We were 100% spot for cost savings. No baseline of on-demand instances. When AWS needed the capacity back, we had no fallback.

What we fixed:

  • Baseline: 30% on-demand for guaranteed capacity
  • Spot fleet with multiple instance types
  • Multi-AZ to reduce capacity constraints
  • Graceful degradation when capacity is limited

Lesson: Spot for batch jobs? Perfect. Spot for production load balancing without fallback? You're gambling.


← Alınan Derslere Dön