Rate Limiter Redis Crash
Our rate limiter stored state in Redis. Redis went down. Rate limiting stopped. DDoS came through.
The timeline:
- 14:00 - Redis master went down for maintenance
- 14:01 - Rate limiter couldn't read/write state
- 14:01 - Fail-open: all requests allowed
- 14:02 - 500K requests/second hit the API
- 14:03 - API servers overwhelmed
- 14:05 - Complete outage
The design flaw:
When rate limiter storage fails, we chose "fail-open" (allow everything) instead of "fail-closed" (block everything).
Why we made that choice:
"We don't want Redis issues to block legitimate users."
What we should have done:
- In-memory fallback rate limiter
- Conservative limits when storage is unavailable
- Redis Sentinel for automatic failover
- Rate limiting at multiple layers (CDN, API Gateway, App)
Lesson: Security controls should fail closed, not open. When in doubt, deny.