Your data-pipeline deployment has partially collapsed and the symptoms don't line up into a single story. You need to figure out what actually happened, in what order, and bring every service back to healthy without losing the in-flight batch on the shared volume.
Observed symptoms, all visible right now on the host:
web is up and answers health checks, but every downstream call/api/process returns Connection refused on api:9000.
api is exited (137), but docker inspect will tell you theworker is up and keeps logging _"api endpoint unreachable onapi has been gone for 4 minutes.
api's last log line reads Killed right after it started/data/ingest/batch-042.parquet (a 1.8 GB file from the
shared volume).
There are three distinct problems layered here. Fixing only one of them will leave the system broken in a different way, and a naive docker start api will reproduce the original crash. Your job is to untangle them, apply a fix that addresses root cause (not just symptom), and restore the pipeline.
web (up), api (exited 137), worker (up), db (up) — all four on network app-net (10.42.0.0/24)api's edge to app-net is marked disconnected in the graph; every other container's edge is connected.shared-data volume is mounted at /data on web, api, and worker. Contains ingest/batch-042.parquet (~1.8 GB).api was started with no memory limit set on the container (HostConfig.Memory == 0). But api was OOM-killed by Docker's daemon after it blew past the host's per-cgroup soft memory pressure, not by the classic MemoryLimit path — so State.OOMKilled is false even though the kill was memory-driven.$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES STATUS
web Up 30 minutes
api Exited (137) 4 minutes ago
worker Up 30 minutes
db Up 30 minutes
$ docker inspect api --format '{{.State.OOMKilled}} / {{.HostConfig.Memory}}'
false / 0
$ docker network inspect app-net --format '{{range $k, $v := .Containers}}{{$v.Name}}:{{$v.IPv4Address}} {{end}}'
web:10.42.0.10/24 worker:10.42.0.12/24 db:10.42.0.20/24
Notice api is not in app-net's member list — it was kicked off the network when it died, and nothing has re-attached it. That's why worker keeps logging "endpoint unreachable" and web keeps getting connection-refused: the DNS name api no longer resolves on this network.
You've solved it when all of the following are true:
api died loading batch-042.parquet because the process grew past available memory (the kill was memory-pressure driven, even though classic OOMKilled was false).
- api was detached from app-net when it exited — so even restarting it naively won't fix DNS for web and worker without re-attaching.
- The 1.8 GB parquet file on shared-data/ingest/batch-042.parquet will crash api again the moment it restarts, unless the load is bounded or the file is handled differently.
api a memory bound it can handle (or changes how it consumes the file) so a restart doesn't re-crash.
- Re-attaches api to app-net so web and worker can resolve it by name.
- Preserves shared-data and its contents — don't delete the volume or the batch file.
api.status == running, api is a member of app-net, and web's downstream calls to api:9000 stop erroring.docker CLI only.shared-data is ground truth. Do not remove the volume, do not/data, and do not edit the parquet file.
db is healthy and must stay healthy. Don't restart it.api had been started with --restart=always, how would thisdocker ps -a, and would it
have been easier or harder to diagnose?
on_network edge for api was marked disconnecteddocker run --memory-swappiness=0?
A cgroup OOM-score adjustment? An init-container that validates
the file's size against the container's memory cap?