Autoscaling That Killed the Database
Our Kubernetes cluster autoscaled perfectly. Too perfectly.
The incident:
- Traffic spike at 10 AM
- HPA scaled pods: 5 → 50
- Each pod: 20 database connections
- Total connections: 100 → 1,000
- PostgreSQL max_connections: 200
- 💥 "too many connections" errors
The cascade:
- New pods can't connect to database
- Health checks fail
- Pods restart
- Connection storm on restart
- Entire cluster thrashing
The fix:
- PgBouncer as connection pooler
- 1,000 app connections → 100 database connections
- HPA max replicas capped at sustainable level
- Readiness probe waits for DB connection
Lesson: Autoscaling doesn't know about downstream limits. You must enforce them.