Our rate limiter stored state in Redis. Redis went down. Rate limiting stopped. DDoS came through.

The timeline:

  • 14:00 - Redis master went down for maintenance
  • 14:01 - Rate limiter couldn't read/write state
  • 14:01 - Fail-open: all requests allowed
  • 14:02 - 500K requests/second hit the API
  • 14:03 - API servers overwhelmed
  • 14:05 - Complete outage

The design flaw:

When rate limiter storage fails, we chose "fail-open" (allow everything) instead of "fail-closed" (block everything).

Why we made that choice:

"We don't want Redis issues to block legitimate users."

What we should have done:

  • In-memory fallback rate limiter
  • Conservative limits when storage is unavailable
  • Redis Sentinel for automatic failover
  • Rate limiting at multiple layers (CDN, API Gateway, App)

Lesson: Security controls should fail closed, not open. When in doubt, deny.


← Назад към Научени Уроци