Our Kubernetes cluster autoscaled perfectly. Too perfectly.

The incident:

  • Traffic spike at 10 AM
  • HPA scaled pods: 5 → 50
  • Each pod: 20 database connections
  • Total connections: 100 → 1,000
  • PostgreSQL max_connections: 200
  • 💥 "too many connections" errors

The cascade:

  • New pods can't connect to database
  • Health checks fail
  • Pods restart
  • Connection storm on restart
  • Entire cluster thrashing

The fix:

  • PgBouncer as connection pooler
  • 1,000 app connections → 100 database connections
  • HPA max replicas capped at sustainable level
  • Readiness probe waits for DB connection

Lesson: Autoscaling doesn't know about downstream limits. You must enforce them.


← Zurück zu Erfahrungsberichte