You're paged at 2am. The api service keeps flapping — it restarts, serves traffic for a minute, dies, restarts again. The alert from pagerduty reads "api restart count > 10 in 1h". Your job is to find why the container is dying and apply a fix that keeps it running.
The app itself is healthy: logs show it boots, answers requests, then is terminated mid-flight. The word Killed appears on the last line of stdout with no Python traceback — the process didn't exit, it was stopped by the kernel. Something outside the process is killing it.
Three containers on the host:
api — your service, exited with an unusual code, restart policy always, restart count 12nginx-proxy — healthy reverse-proxy, up 2hredis — healthy cache, up 2hmyapi:v2.1.0 (current), myapi:v2.0.0 (previous stable).
$ docker ps -a
CONTAINER ID IMAGE STATUS NAMES
11223344aabb myapi:v2.1.0 Exited (137) 3 minutes ago api
2233445566bb nginx:alpine Up 2 hours nginx-proxy
334455667788 redis:7-alpine Up 2 hours redis
$ docker logs api
...
INFO: 172.17.0.1:54234 - "POST /api/process" 200 OK
Killed
From here you should reach for docker inspect api to see why the kernel killed it — the State block carries the answer.
You've solved it when:
api was killed by the kernel OOM killer (exit 137, State.OOMKilled=true).api with a larger memory limit (or another fix that keeps memory under the limit), and api.status == running.docker CLI only. No kubectl, terraform, or cloud CLIs.v2.0.0 as the fix — you need to make v2.1.0 survive. Rolling back is a separate question and not the acceptance criterion here.--restart=on-failure:3 (instead of always) have surfaced this alert sooner, or hidden it longer?docker stats api number, in absolute terms, would have been a leading indicator for this OOM before it fired?