Two outages at once, one cause: the shared Redis is a single point of failure
Hard

Problem

Two pages fired in the same minute and it looks like two separate fires:

  • The web team sees the dashboard throwing 500s — logs full of
redis ... Connection refused.
  • The data team sees the background worker stop processing — emails
and notifications stalled, the worker logging `Cannot connect to redis://redis:6379`.

It is tempting to split up and debug two outages. Don't. Two services that don't talk to each other failing at the exact same second with the same error host is a tell: they share something. Find the one thing both depend on.

What you can observe right now

  • web is Up but every request that touches the cache 500s with
redis.exceptions.ConnectionError: Error 111 connecting to redis:6379.
  • worker is Up but its consumer loop is stuck retrying
Cannot connect to redis://redis:6379/0.
  • The same token — redis:6379 — appears in both unrelated services'
logs.

Example interaction

$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES    STATUS
web      Up 8 hours
worker   Up 8 hours
redis    Exited (137) 90 seconds ago

$ docker inspect redis --format '{{.State.OOMKilled}} / {{.State.ExitCode}}'
true / 137

$ docker network inspect appnet
[
    {
        "Name": "appnet",
        "Containers": {
            "web":    { "Name": "web",    "EndpointState": "connected", "ipv4_address": "172.32.0.10" },
            "worker": { "Name": "worker", "EndpointState": "connected", "ipv4_address": "172.32.0.20" }
        }
    }
]

Note redis is absent from the network's active members — an exited container drops its endpoint — while web and worker are still there. redis is Exited (137) (SIGKILL, State.OOMKilled=true) — one dead container. Both web (which uses it as a cache) and worker (which uses it as a broker) point at the same redis:6379. That single Redis is a shared dependency / single point of failure: when it died, the blast radius hit two tiers at once. Two symptoms, one root cause.

Acceptance

You've solved it when:

  1. You've correctly identified that the two symptoms have one shared
root cause: the single redis container is down (Exited (137), OOM-killed), and both web (cache) and worker (broker) depend on it — not two independent outages.
  1. You've shown how you proved it was shared (same redis:6379 in both
services' logs, and/or docker network inspect / both containers' Redis target being the one node).
  1. You restore the shared dependency: docker start redis (and, since it
OOM-died with no cap, the real fix is to set maxmemory + an eviction policy, and ideally split the cache Redis from the broker Redis so one death can't take down two tiers).
  1. After the fix: redis.status == running, web's cache 500s stop, and
worker's consumer reconnects.

Constraints

  • Tools: docker CLI only.
  • Don't restart web or worker as the fix — they are healthy clients of
a dead dependency; they recover when Redis comes back.
  • Don't delete the Redis data/container.

Follow-up

  1. Raidbots had this exact incident: one Redis for the queue and login
sessions. After restoring service, what is the single highest-leverage change — maxmemory+eviction, a restart policy, a replica, or splitting cache from broker — and why?
  1. web and worker were both Up the whole time — docker ps never
showed them as unhealthy. What does that teach you about alerting on symptoms (5xx, queue depth) versus alerting on the dependency itself?
  1. node-redis surfaces a dropped established connection as
SocketClosedUnexpectedlyError: Socket closed unexpectedly rather than ECONNREFUSED. Why does "was connected, then dropped" look different from "never connected", and which one tells you the container died mid-flight?
Live session
Code
SavedNo commands yet