Unwind a cascading failure: shared volume + lost network endpoint

Problem

Your data-pipeline deployment has partially collapsed and the symptoms don't line up into a single story. You need to figure out what actually happened, in what order, and bring every service back to healthy without losing the in-flight batch on the shared volume.

Observed symptoms, all visible right now on the host:

web is up and answers health checks, but every downstream call

to /api/process returns Connection refused on api:9000.

api is exited (137), but docker inspect will tell you the

kernel did not OOM-kill it — the exit code came from a signal sent by the container runtime itself.

worker is up and keeps logging _"api endpoint unreachable on

app-net"_, even though api has been gone for 4 minutes.

api's last log line reads Killed right after it started

loading /data/ingest/batch-042.parquet (a 1.8 GB file from the shared volume).

There are three distinct problems layered here. Fixing only one of them will leave the system broken in a different way, and a naive docker start api will reproduce the original crash. Your job is to untangle them, apply a fix that addresses root cause (not just symptom), and restore the pipeline.

Initial setup

web (up), api (exited 137), worker (up), db (up) — all four on network app-net (10.42.0.0/24)
api's edge to app-net is marked disconnected in the graph; every other container's edge is connected.
shared-data volume is mounted at /data on web, api, and worker. Contains ingest/batch-042.parquet (~1.8 GB).
api was started with no memory limit set on the container (HostConfig.Memory == 0). But api was OOM-killed by Docker's daemon after it blew past the host's per-cgroup soft memory pressure, not by the classic MemoryLimit path — so State.OOMKilled is false even though the kill was memory-driven.

Example interaction

$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES     STATUS
web       Up 30 minutes
api       Exited (137) 4 minutes ago
worker    Up 30 minutes
db        Up 30 minutes

$ docker inspect api --format '{{.State.OOMKilled}} / {{.HostConfig.Memory}}'
false / 0

$ docker network inspect app-net --format '{{range $k, $v := .Containers}}{{$v.Name}}:{{$v.IPv4Address}} {{end}}'
web:10.42.0.10/24 worker:10.42.0.12/24 db:10.42.0.20/24

Notice api is not in app-net's member list — it was kicked off the network when it died, and nothing has re-attached it. That's why worker keeps logging "endpoint unreachable" and web keeps getting connection-refused: the DNS name api no longer resolves on this network.

Acceptance

You've solved it when all of the following are true:

You've identified and stated all three root causes:

- api died loading batch-042.parquet because the process grew past available memory (the kill was memory-pressure driven, even though classic OOMKilled was false). - api was detached from app-net when it exited — so even restarting it naively won't fix DNS for web and worker without re-attaching. - The 1.8 GB parquet file on shared-data/ingest/batch-042.parquet will crash api again the moment it restarts, unless the load is bounded or the file is handled differently.

You've applied a fix that:

- Gives api a memory bound it can handle (or changes how it consumes the file) so a restart doesn't re-crash. - Re-attaches api to app-net so web and worker can resolve it by name. - Preserves shared-data and its contents — don't delete the volume or the batch file.

After your fix: api.status == running, api is a member of app-net, and web's downstream calls to api:9000 stop erroring.

Constraints

Tools: docker CLI only.
Time limit: 30 minutes. This is intentionally a staff-level

problem — rushing a fix usually breaks something else.

shared-data is ground truth. Do not remove the volume, do not

delete files from /data, and do not edit the parquet file.

db is healthy and must stay healthy. Don't restart it.

Follow-up

If api had been started with --restart=always, how would this

incident have looked differently in docker ps -a, and would it have been easier or harder to diagnose?

The graph's on_network edge for api was marked disconnected

the instant the container exited — that's a cascade effect, not something Docker itself tracks as a property of a stopped container. What surface should your observability stack use to see this cascade in real production?

Sketch a deploy-time guardrail that would have caught the

"container without a memory limit consuming a multi-GB file" pattern before it reached prod. docker run --memory-swappiness=0? A cgroup OOM-score adjustment? An init-container that validates the file's size against the container's memory cap?