Pager just went off: web is returning HTTP 500s on every endpoint that reads data. The health check (/healthz, which doesn't touch the database) still returns 200, so web's own process is clearly alive.
The reflex is to restart web. Resist it. Figure out what actually broke before you touch anything — the 500s are a symptom, and the thing throwing them is not necessarily the thing that's broken.
web is Up and answering /healthz with 200.web's logs showpsycopg2.OperationalError ... Connection refused to the
database on port 5432.
db is not in the running set.$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES STATUS
web Up 2 hours
db Exited (137) 3 minutes ago
$ docker inspect db --format '{{.State.OOMKilled}} / {{.State.ExitCode}} / {{.HostConfig.Memory}}'
true / 137 / 268435456
Exit code 137 is 128 + 9 — the process was killed by SIGKILL, and State.OOMKilled=true says the kill came from the cgroup out-of-memory killer: db blew past its 256 MiB (268435456-byte) memory cap and the kernel terminated it. The app was never broken — its database died of memory pressure.
You've solved it when:
db was OOM-killedState.OOMKilled=true) because its container memory limit
was too tight — this is a database problem, not a web problem.
web as the fix (it's healthy; restarting itdb back with more memory headroom so it doesn'tdocker update --memory 1g --memory-swap 1g db
(or remove the cap / lower sharedbuffers/workmem), then
docker start db.
db.status == running and web's DB-backed endpointsConnection refused errors stop.
docker CLI only.db or its data. Do not rebuild images.web's code and image are fine — the fix does not touch web.db had been started with --restart=on-failure, how woulddocker ps -a have looked, and why can a too-tight memory limit turn
--restart into a crash-loop rather than a recovery?
State.OOMKilled is true here because the container's cgroupOOMKilled=false even
though memory was the cause — and where would you look instead?
shared_buffers-vs-
--memory lint, or an alert on containermemoryfailcnt?