ConfigMap Typo Outage
Production down for 2 hours. Root cause: A single typo.
The change:
# ConfigMap update
data:
DATABASE_CONNECITON_TIMEOUT: "30000" # Typo: CONNECITON
What happened:
- App looked for DATABASE_CONNECTION_TIMEOUT
- Didn't find it (typo)
- No default value set
- Timeout = null → interpreted as 0
- Every database query timed out immediately
Why it wasn't caught:
- ConfigMaps are just YAML—no schema
- Kubernetes doesn't validate key names
- PR review missed the typo
- Integration tests used different config
The fixes:
- App validates required env vars at startup
- Fail fast if critical config is missing
- Schema validation for ConfigMaps in CI
- Canary deployments to catch issues early
Lesson: ConfigMaps are not code—treat them like they are.