Debug a container killed by the OOM killer (exit 137)
Medium

Problem

You're paged at 2am. The api service keeps flapping — it restarts, serves traffic for a minute, dies, restarts again. The alert from pagerduty reads "api restart count > 10 in 1h". Your job is to find why the container is dying and apply a fix that keeps it running.

The app itself is healthy: logs show it boots, answers requests, then is terminated mid-flight. The word Killed appears on the last line of stdout with no Python traceback — the process didn't exit, it was stopped by the kernel. Something outside the process is killing it.

Initial setup

Three containers on the host:

  • api — your service, exited with an unusual code, restart policy always, restart count 12
  • nginx-proxy — healthy reverse-proxy, up 2h
  • redis — healthy cache, up 2h
Images: myapi:v2.1.0 (current), myapi:v2.0.0 (previous stable).

Example interaction

$ docker ps -a
CONTAINER ID   IMAGE             STATUS                       NAMES
11223344aabb   myapi:v2.1.0      Exited (137) 3 minutes ago   api
2233445566bb   nginx:alpine      Up 2 hours                   nginx-proxy
334455667788   redis:7-alpine    Up 2 hours                   redis

$ docker logs api
...
INFO:     172.17.0.1:54234 - "POST /api/process" 200 OK
Killed

From here you should reach for docker inspect api to see why the kernel killed it — the State block carries the answer.

Acceptance

You've solved it when:

  • You've identified that api was killed by the kernel OOM killer (exit 137, State.OOMKilled=true).
  • You've linked the kill to the container's memory limit (256 MB).
  • You've restarted api with a larger memory limit (or another fix that keeps memory under the limit), and api.status == running.

Constraints

  • Tools: docker CLI only. No kubectl, terraform, or cloud CLIs.
  • Time limit: 20 minutes (realistic incident-response budget).
  • Don't downgrade to v2.0.0 as the fix — you need to make v2.1.0 survive. Rolling back is a separate question and not the acceptance criterion here.

Follow-up

  1. Would setting --restart=on-failure:3 (instead of always) have surfaced this alert sooner, or hidden it longer?
  2. What docker stats api number, in absolute terms, would have been a leading indicator for this OOM before it fired?
Live session
Code
SavedNo commands yet