Pod Eviction Cascade
Node ran out of disk. All pods evicted. New pods scheduled on same node. Evicted again. Repeat forever.
The timeline:
- 2:00 AM: Node disk reaches 90%
- 2:05 AM: Kubelet starts evicting pods
- 2:06 AM: Pods rescheduled to... same node (most available resources)
- 2:07 AM: More logs, more disk usage
- 2:08 AM: Evicted again
- 2:09 AM: Alert fires (finally)
Root cause:
- Application logging to container filesystem
- No log rotation configured
- ephemeral-storage limits not set
- Node only tainted after disk pressure, but taints cleared when pods left
The fix:
resources:
limits:
ephemeral-storage: "2Gi"
requests:
ephemeral-storage: "1Gi"
Plus:
- Log to stdout (collected by Fluentd)
- Configure container log rotation
- Monitor node disk usage with alerts at 70%
Lesson: Set ephemeral-storage limits. Always.