Production down for 2 hours. Root cause: A single typo.

The change:

# ConfigMap update
                                    data:
                                    DATABASE_CONNECITON_TIMEOUT: "30000" # Typo: CONNECITON

What happened:

  • App looked for DATABASE_CONNECTION_TIMEOUT
  • Didn't find it (typo)
  • No default value set
  • Timeout = null → interpreted as 0
  • Every database query timed out immediately

Why it wasn't caught:

  • ConfigMaps are just YAML—no schema
  • Kubernetes doesn't validate key names
  • PR review missed the typo
  • Integration tests used different config

The fixes:

  • App validates required env vars at startup
  • Fail fast if critical config is missing
  • Schema validation for ConfigMaps in CI
  • Canary deployments to catch issues early

Lesson: ConfigMaps are not code—treat them like they are.


← Back to Lessons Learned