storage-host is crawling: Postgres queries that normally take milliseconds are taking many seconds, and load is up. The on-call already knows "it's I/O" from top, and iostat -x shows the data volume sdb at ~98% %util — which looks saturated, so someone is about to throttle the workload or provision a bigger, faster volume.
Before you do: is sdb actually busy (overloaded by load), or is it dying (slow at servicing each I/O)? Those need opposite fixes. Prove which one, and say what you'd actually do.
storage-host, Debian 12, 4 vCPU, two disks: sda (OS) andsdb (the Postgres data volume).
You've solved it when:
top that the CPU is mostly in iowait (high%wa, idle CPU) — so it's I/O, not compute.
iostat -x and read sdb's columns together: yes%util ~98.6, but aqu-sz ~1.4 (a tiny queue) and throughput is
near-zero (rkB/s ~152) while r_await is ~485 ms. A huge await with a
small queue and almost no throughput means the device is slow at
servicing each I/O, not overloaded by work — %util near 100 here is
the trap (per-I/O retries keep the spindle "busy"). sda is healthy for
contrast.
dmesg: sdb isMedium Error / Unrecovered read error (UNC) and
blkupdaterequest: I/O error, dev sdb — hardware media errors, not a
workload.
sdb / fail it out of the array and rebuild redundancy onto a
healthy disk). NOT throttle the workload, NOT add I/O capacity /
a bigger volume, NOT reboot, NOT kill the Postgres backends (they
are D-state victims of the slow disk).