Zombie Apocalypse: A Parent That Never Reaps
Medium

Problem

jobs-host started throwing fork: Resource temporarily unavailable when cron and login shells try to spawn — but the box is nearly idle. CPU is quiet, there's plenty of free RAM, and the disk has room. Monitoring shows the process count climbing all day even though almost nothing is doing work.

Something is leaking process-table slots. Find what's accumulating, figure out which process is responsible, and identify the correct fix.

Initial setup

  • Host: jobs-host, Alpine, 1 vCPU.
  • A deploy supervisor runs short jobs all day plus the usual init/cron/shell.
  • The process table keeps growing; new fork()s are starting to fail.

Acceptance

You've solved it when:

  • You've used top to find the cluster of processes in STAT Z (zombies /
[python3] defunct) that carry 0% CPU and 0 memory — they're not a CPU/RAM hog, they're consuming PID slots.
  • You've noticed every zombie shares the same parent (PPID 820) and traced
it to python3 /srv/deploy/runner.py — the supervisor that spawns job children and never reaps them.
  • You've ruled out the red-herrings (free shows ample RAM, df shows the
disk is fine) and named the correct fix: make the PARENT reap its children (restart runner.py, or fix it to .wait()/handle SIGCHLD). You must NOT try to kill the zombies — they're already dead; killing them does nothing.
Live session
Code
SavedNo commands yet