We scaled horizontally to 50 instances. The bottleneck just moved.

Phase 1: Application bottleneck

  • 5 app servers, 100% CPU
  • Solution: Scale to 20 servers
  • Result: App CPU 30%

Phase 2: Database bottleneck

  • Database: 100% CPU, 5000 connections
  • Solution: Connection pooling, read replicas
  • Result: Database CPU 60%

Phase 3: Load balancer bottleneck

  • Single ALB hitting packet limit
  • Solution: Multiple ALBs with DNS round-robin
  • Result: Traffic distributed

Phase 4: Message queue bottleneck

  • RabbitMQ single node saturated
  • Solution: Clustered queue, partitioned topics
  • Result: Messages flowing

The lesson:

Amdahl's Law: Speedup limited by sequential parts
                                    System throughput = min(
                                    app_capacity,
                                    db_capacity,
                                    network_capacity,
                                    queue_capacity,
                                    external_api_capacity
                                    )

What we learned:

  • Identify the bottleneck BEFORE scaling
  • Load test the entire path
  • Monitor all components, not just apps
  • Sometimes vertical scaling is cheaper

Lesson: Scaling one component moves the bottleneck to the next. Plan for end-to-end capacity.


← Alınan Derslere Dön