jobs-host started throwing fork: Resource temporarily unavailable when cron and login shells try to spawn — but the box is nearly idle. CPU is quiet, there's plenty of free RAM, and the disk has room. Monitoring shows the process count climbing all day even though almost nothing is doing work.
Something is leaking process-table slots. Find what's accumulating, figure out which process is responsible, and identify the correct fix.
jobs-host, Alpine, 1 vCPU.deploy supervisor runs short jobs all day plus the usual init/cron/shell.fork()s are starting to fail.You've solved it when:
top to find the cluster of processes in STAT Z (zombies /[python3] defunct) that carry 0% CPU and 0 memory — they're not a
CPU/RAM hog, they're consuming PID slots.
PPID 820) and tracedpython3 /srv/deploy/runner.py — the supervisor that spawns job
children and never reaps them.
free shows ample RAM, df shows therunner.py, or fix it to .wait()/handle SIGCHLD). You must
NOT try to kill the zombies — they're already dead; killing them does
nothing.