Support is escalating: customers say their CSV exports never arrive. The site itself is completely fine — pages load, the dashboard works, submitting an export returns the usual "your export is being prepared" (HTTP 202 Accepted). It just... never finishes. No error page, no 5xx, nothing in the web logs that looks wrong.
This is an async-only outage. The web tier accepts the work and hands it off to a queue; something downstream is supposed to pick it up and isn't. Find what stopped consuming.
web is Up, serving 200s/202s normally.redis (the broker) is Up.celery queue and never drain.$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES STATUS
web Up 5 hours
redis Up 5 hours
worker Exited (137) 22 minutes ago
$ docker inspect worker --format '{{.State.OOMKilled}} / {{.State.ExitCode}}'
true / 137
# In real life you'd confirm the backlog is climbing:
$ docker exec redis redis-cli LLEN celery
(integer) 1487
worker is Exited (137) — 128 + 9, a SIGKILL, with State.OOMKilled=true: the worker blew past its memory cap and the kernel killed it. Note its logs just stop mid-stream after receiving task 7c9d — no "Warm shutdown", no clean exit line. A hard kill leaves no shutdown trace; the silence is the signal. While it's been gone, web has kept accepting exports and pushing them onto the celery queue with nothing on the other end.
You've solved it when:
worker container is deadExited (137), OOM-killed) so nothing is consuming the queue — this
is a worker outage, not a web or Redis problem.
web and redis are both healthy and not treatedweb keeps enqueuing into a queue no one
drains; redis is fine).
docker start worker (and, since itdocker update --memory 512m --memory-swap 512m worker).
worker.status == running and it begins draining thecelery queue length falls toward 0).
docker CLI only.web is healthy; the fix does not touch it.retries_exhausted callback — in-flight jobs can be silently lost.
What broker-side feature (Celery ackslate, Sidekiq superfetch)
turns a silent kill into at-least-once delivery?
Up right until it wasn't — there is no "degraded"docker ps. What signal should have paged before the
backlog hit 1,487: queue depth, consumer count, or oldest-job age?
--restart=on-failuredocker ps -a?