Scaling from 1K to 10K requests per second wasn't 10x harder. It was 100x harder.

What worked at 1K RPS:

  • Synchronous service-to-service calls
  • Database joins for complex queries
  • Logs to CloudWatch without sampling
  • Simple round-robin load balancing

What broke at 10K RPS:

  • Database connection limits hit
  • Synchronous calls created cascading timeouts
  • CloudWatch costs exploded (10x logs = 10x cost)
  • Hot keys in caching layer
  • Network socket exhaustion

New patterns needed:

  • Event-driven instead of request-response
  • Read replicas and connection pooling
  • Log sampling (1% at debug level)
  • Rate limiting at edge
  • Consistent hashing for cache distribution

The insight:

At each order of magnitude, you're solving different problems. Architecture that works for 1K won't work for 10K. What works for 10K won't work for 100K.

Lesson: Scale testing isn't optional. Your 10x traffic day will find every weakness.


← Back to Lessons Learned