App is throwing 500s — but the app is fine, the database was OOM-killed
Hard

Problem

Pager just went off: web is returning HTTP 500s on every endpoint that reads data. The health check (/healthz, which doesn't touch the database) still returns 200, so web's own process is clearly alive.

The reflex is to restart web. Resist it. Figure out what actually broke before you touch anything — the 500s are a symptom, and the thing throwing them is not necessarily the thing that's broken.

What you can observe right now

  • web is Up and answering /healthz with 200.
  • Every DB-backed endpoint returns 500, and web's logs show
repeated psycopg2.OperationalError ... Connection refused to the database on port 5432.
  • db is not in the running set.

Example interaction

$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES   STATUS
web     Up 2 hours
db      Exited (137) 3 minutes ago

$ docker inspect db --format '{{.State.OOMKilled}} / {{.State.ExitCode}} / {{.HostConfig.Memory}}'
true / 137 / 268435456

Exit code 137 is 128 + 9 — the process was killed by SIGKILL, and State.OOMKilled=true says the kill came from the cgroup out-of-memory killer: db blew past its 256 MiB (268435456-byte) memory cap and the kernel terminated it. The app was never broken — its database died of memory pressure.

Acceptance

You've solved it when:

  1. You've stated the root cause correctly: db was OOM-killed
(exit 137, State.OOMKilled=true) because its container memory limit was too tight — this is a database problem, not a web problem.
  1. You've not restarted web as the fix (it's healthy; restarting it
changes nothing — the 500s return the instant it serves a DB request).
  1. You bring db back with more memory headroom so it doesn't
immediately re-OOM — e.g. docker update --memory 1g --memory-swap 1g db (or remove the cap / lower sharedbuffers/workmem), then docker start db.
  1. After the fix: db.status == running and web's DB-backed endpoints
stop returning 500 / the Connection refused errors stop.

Constraints

  • Tools: docker CLI only.
  • Do not delete db or its data. Do not rebuild images.
  • web's code and image are fine — the fix does not touch web.

Follow-up

  1. If db had been started with --restart=on-failure, how would
docker ps -a have looked, and why can a too-tight memory limit turn --restart into a crash-loop rather than a recovery?
  1. State.OOMKilled is true here because the container's cgroup
limit fired. When would exit 137 on a DB show OOMKilled=false even though memory was the cause — and where would you look instead?
  1. What deploy-time guardrail catches "Postgres with a 256 MiB cap"
before prod — a minimum-memory admission check, a shared_buffers-vs- --memory lint, or an alert on containermemoryfailcnt?
Live session
Code
SavedNo commands yet