Partial Failure Hell

LESSON #28

Partial Failure Hell

In microservices, everything can partially fail. 1 of 8 services went down. The result was worse than a complete outage.

The scenario:

User flow: 8 services in sequence
Service 6 (recommendations): 50% failure rate
No circuit breaker, no fallback
30-second timeout per retry

What users experienced:

50% of requests: Success (eventually)
50% of requests: 90+ second timeout
Users retry → more load
Thread pools exhausted across all services
Everything degrades

The worst part:

Dashboards showed "Service 6: 50% healthy"
User-facing experience: 0% usable
No clear owner (whose fault is partial failure?)

The fix:

Circuit breakers on all outbound calls
Graceful degradation (show page without recommendations)
Timeouts measured in ms, not seconds
Bulkhead pattern to isolate failures

Lesson: In distributed systems, design for partial failure first.

← Back to Lessons Learned

Tags: #Microservices #Reliability #ErrorHandling #Distributed

Graf Clouds

H.K. Zapaden Park, BL.106 Sofia

[email protected]

wa.grafclouds.com

tg.grafclouds.com

© GRAF CLOUDS 2024 All Rights Reserved
This website was crafted with the assistance of AI agents.
Privacy Policy Cookie Policy Terms of Service
BG DE TR