In microservices, everything can partially fail. 1 of 8 services went down. The result was worse than a complete outage.

The scenario:

  • User flow: 8 services in sequence
  • Service 6 (recommendations): 50% failure rate
  • No circuit breaker, no fallback
  • 30-second timeout per retry

What users experienced:

  • 50% of requests: Success (eventually)
  • 50% of requests: 90+ second timeout
  • Users retry → more load
  • Thread pools exhausted across all services
  • Everything degrades

The worst part:

  • Dashboards showed "Service 6: 50% healthy"
  • User-facing experience: 0% usable
  • No clear owner (whose fault is partial failure?)

The fix:

  • Circuit breakers on all outbound calls
  • Graceful degradation (show page without recommendations)
  • Timeouts measured in ms, not seconds
  • Bulkhead pattern to isolate failures

Lesson: In distributed systems, design for partial failure first.


← Alınan Derslere Dön