Multi-Layer Caching Failure
Added L1, L2, and L3 caches for "better performance." Now we have 3 places where data can be wrong instead of 1.
Our brilliant architecture:
- L1: In-process cache (Guava)
- L2: Local Redis
- L3: Distributed Redis cluster
- Source of truth: PostgreSQL
What went wrong:
- L3 cache invalidated correctly
- L2 cache still had old data (different TTL)
- L1 cache on server A had old data
- L1 cache on server B had new data
- Users saw different data on each refresh
Debugging hell:
"Which cache layer has the bad data?" became a 2-hour investigation every time.
What we learned:
- Each cache layer multiplies invalidation complexity
- TTLs must be coordinated (inner cache < outer cache)
- Need observability into every layer
- Sometimes one cache is enough
Lesson: More cache layers ≠ better performance. It equals more places to debug when things go wrong.