The site is up, but nothing async works: a dead worker and a growing backlog

Problem

Support is escalating: customers say their CSV exports never arrive. The site itself is completely fine — pages load, the dashboard works, submitting an export returns the usual "your export is being prepared" (HTTP 202 Accepted). It just... never finishes. No error page, no 5xx, nothing in the web logs that looks wrong.

This is an async-only outage. The web tier accepts the work and hands it off to a queue; something downstream is supposed to pick it up and isn't. Find what stopped consuming.

What you can observe right now

web is Up, serving 200s/202s normally.
redis (the broker) is Up.
Export jobs are piling up in the celery queue and never drain.
There are no errors on the web tier — the trap is to go hunting

there anyway.

Example interaction

$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES    STATUS
web      Up 5 hours
redis    Up 5 hours
worker   Exited (137) 22 minutes ago

$ docker inspect worker --format '{{.State.OOMKilled}} / {{.State.ExitCode}}'
true / 137

# In real life you'd confirm the backlog is climbing:
$ docker exec redis redis-cli LLEN celery
(integer) 1487

worker is Exited (137) — 128 + 9, a SIGKILL, with State.OOMKilled=true: the worker blew past its memory cap and the kernel killed it. Note its logs just stop mid-stream after receiving task 7c9d — no "Warm shutdown", no clean exit line. A hard kill leaves no shutdown trace; the silence is the signal. While it's been gone, web has kept accepting exports and pushing them onto the celery queue with nothing on the other end.

Acceptance

You've solved it when:

You've stated the root cause: the worker container is dead

(Exited (137), OOM-killed) so nothing is consuming the queue — this is a worker outage, not a web or Redis problem.

You've confirmed web and redis are both healthy and not treated

either as the fix (restarting web keeps enqueuing into a queue no one drains; redis is fine).

You bring the worker back — docker start worker (and, since it

OOM-died, give it headroom so it doesn't re-die under the backlog it now has to chew through, e.g. docker update --memory 512m --memory-swap 512m worker).

After the fix: worker.status == running and it begins draining the

backlog (the celery queue length falls toward 0).

Constraints

Tools: docker CLI only.
Do not flush Redis or delete queued jobs — the backlog is real work

that must be processed, not discarded.

web is healthy; the fix does not touch it.

Follow-up

A hard-killed worker leaves no shutdown log and (for Sidekiq) fires no

retries_exhausted callback — in-flight jobs can be silently lost. What broker-side feature (Celery ackslate, Sidekiq superfetch) turns a silent kill into at-least-once delivery?

The worker was Up right until it wasn't — there is no "degraded"

state in docker ps. What signal should have paged before the backlog hit 1,487: queue depth, consumer count, or oldest-job age?

The worker OOM-died chewing a large export. Would --restart=on-failure

have recovered it, or crash-looped it against the same job — and how would you tell the difference in docker ps -a?