iostat Says 100% Util — But Is the Disk Busy, or Dying?
Hard

Problem

storage-host is crawling: Postgres queries that normally take milliseconds are taking many seconds, and load is up. The on-call already knows "it's I/O" from top, and iostat -x shows the data volume sdb at ~98% %util — which looks saturated, so someone is about to throttle the workload or provision a bigger, faster volume.

Before you do: is sdb actually busy (overloaded by load), or is it dying (slow at servicing each I/O)? Those need opposite fixes. Prove which one, and say what you'd actually do.

Initial setup

  • Host: storage-host, Debian 12, 4 vCPU, two disks: sda (OS) and
sdb (the Postgres data volume).

Acceptance

You've solved it when:

  • You've confirmed from top that the CPU is mostly in iowait (high
%wa, idle CPU) — so it's I/O, not compute.
  • You've run iostat -x and read sdb's columns together: yes
%util ~98.6, but aqu-sz ~1.4 (a tiny queue) and throughput is near-zero (rkB/s ~152) while r_await is ~485 ms. A huge await with a small queue and almost no throughput means the device is slow at servicing each I/O, not overloaded by work — %util near 100 here is the trap (per-I/O retries keep the spindle "busy"). sda is healthy for contrast.
  • You've confirmed the device is failing in dmesg: sdb is
logging Medium Error / Unrecovered read error (UNC) and blkupdaterequest: I/O error, dev sdb — hardware media errors, not a workload.
  • You've named the fix: the disk is failing — replace it (migrate the
data off sdb / fail it out of the array and rebuild redundancy onto a healthy disk). NOT throttle the workload, NOT add I/O capacity / a bigger volume, NOT reboot, NOT kill the Postgres backends (they are D-state victims of the slow disk).
Live session
Code
SavedNo commands yet