Two outages at once, one cause: the shared Redis is a single point of failure

Problem

Two pages fired in the same minute and it looks like two separate fires:

The web team sees the dashboard throwing 500s — logs full of

redis ... Connection refused.

The data team sees the background worker stop processing — emails

and notifications stalled, the worker logging `Cannot connect to redis://redis:6379`.

It is tempting to split up and debug two outages. Don't. Two services that don't talk to each other failing at the exact same second with the same error host is a tell: they share something. Find the one thing both depend on.

What you can observe right now

web is Up but every request that touches the cache 500s with

redis.exceptions.ConnectionError: Error 111 connecting to redis:6379.

worker is Up but its consumer loop is stuck retrying

Cannot connect to redis://redis:6379/0.

The same token — redis:6379 — appears in both unrelated services'

logs.

Example interaction

$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES    STATUS
web      Up 8 hours
worker   Up 8 hours
redis    Exited (137) 90 seconds ago

$ docker inspect redis --format '{{.State.OOMKilled}} / {{.State.ExitCode}}'
true / 137

$ docker network inspect appnet
[
    {
        "Name": "appnet",
        "Containers": {
            "web":    { "Name": "web",    "EndpointState": "connected", "ipv4_address": "172.32.0.10" },
            "worker": { "Name": "worker", "EndpointState": "connected", "ipv4_address": "172.32.0.20" }
        }
    }
]

Note redis is absent from the network's active members — an exited container drops its endpoint — while web and worker are still there. redis is Exited (137) (SIGKILL, State.OOMKilled=true) — one dead container. Both web (which uses it as a cache) and worker (which uses it as a broker) point at the same redis:6379. That single Redis is a shared dependency / single point of failure: when it died, the blast radius hit two tiers at once. Two symptoms, one root cause.

Acceptance

You've solved it when:

You've correctly identified that the two symptoms have one shared

root cause: the single redis container is down (Exited (137), OOM-killed), and both web (cache) and worker (broker) depend on it — not two independent outages.

You've shown how you proved it was shared (same redis:6379 in both

services' logs, and/or docker network inspect / both containers' Redis target being the one node).

You restore the shared dependency: docker start redis (and, since it

OOM-died with no cap, the real fix is to set maxmemory + an eviction policy, and ideally split the cache Redis from the broker Redis so one death can't take down two tiers).

After the fix: redis.status == running, web's cache 500s stop, and

worker's consumer reconnects.

Constraints

Tools: docker CLI only.
Don't restart web or worker as the fix — they are healthy clients of

a dead dependency; they recover when Redis comes back.

Don't delete the Redis data/container.

Follow-up

Raidbots had this exact incident: one Redis for the queue and login

sessions. After restoring service, what is the single highest-leverage change — maxmemory+eviction, a restart policy, a replica, or splitting cache from broker — and why?

web and worker were both Up the whole time — docker ps never

showed them as unhealthy. What does that teach you about alerting on symptoms (5xx, queue depth) versus alerting on the dependency itself?

node-redis surfaces a dropped established connection as

SocketClosedUnexpectedlyError: Socket closed unexpectedly rather than ECONNREFUSED. Why does "was connected, then dropped" look different from "never connected", and which one tells you the container died mid-flight?