Partial Failure Hell
In microservices, everything can partially fail. 1 of 8 services went down. The result was worse than a complete outage.
The scenario:
- User flow: 8 services in sequence
- Service 6 (recommendations): 50% failure rate
- No circuit breaker, no fallback
- 30-second timeout per retry
What users experienced:
- 50% of requests: Success (eventually)
- 50% of requests: 90+ second timeout
- Users retry → more load
- Thread pools exhausted across all services
- Everything degrades
The worst part:
- Dashboards showed "Service 6: 50% healthy"
- User-facing experience: 0% usable
- No clear owner (whose fault is partial failure?)
The fix:
- Circuit breakers on all outbound calls
- Graceful degradation (show page without recommendations)
- Timeouts measured in ms, not seconds
- Bulkhead pattern to isolate failures
Lesson: In distributed systems, design for partial failure first.