Cgroup OOM-Kill: A Capped Service Keeps Dying on a Healthy Host
Medium

Problem

The api service on api-host keeps dying. You see its python3 process in ps, holding steady for a while, then it's gone — and a moment later it's back with a new PID (something keeps relaunching it). There is NO Python traceback anywhere in the app's own logs: the process isn't crashing, something is killing it.

The box itself looks healthy. Find out what is killing the service and prove it — then say how you'd stop it.

Initial setup

  • Host: api-host, Alpine, cgroup v2 enabled.
  • Service: python3 -m uvicorn api:app runs under a 256 MB memory
cgroup.
  • The host has plenty of RAM free; the disk is fine.

Acceptance

You've solved it when:

  • You've shown via dmesg that the kernel cgroup OOM-killer killed it
("Memory cgroup out of memory: Killed process 4711 (python3)"), with the memory: usage 262144kB, limit 262144kB line proving usage hit the cgroup cap.
  • You've ruled OUT global memory pressure with free — the host has
~1.4 GB available, so this is the per-cgroup limit, not the host.
  • You've named the fix: raise the cgroup memory limit (cgroup v2
memory.max / docker update --memory / systemd MemoryMax=), or shrink the workload so peak RSS fits under the cap.
Live session
Code
SavedNo commands yet