app-db slows to a crawl every weekday mid-morning: queries that are normally instant take seconds, and load is up. The on-call sees the disk "maxed out" and is about to file for a faster/bigger volume or more vCPUs.
Before spending money: the disk is busy — but doing what, and driven by whom? Reads or writes? Find the offender and say what you'd actually do.
app-db, Debian 12, 4 vCPU, two volumes: vda (OS) and vdbYou've solved it when:
top that the CPU is mostly in iowait (high%wa, idle CPU) — so it's I/O, not compute.
vmstat and seen that bi (blocks/KiB read IN frombo (written out) is low — the system is
READING from disk hard, not writing. (This is the opposite of a
write-saturation / VACUUM-style problem.)
iostat -x and localised it to vdb: high r/srkB/s (~297000, ~290 MB/s), %util ~99, aqu-sz ~86 —
but w/s and w_await are low. The saturation is entirely on the
read path; vda is idle.
ps (and /etc/cron.d/orders-backup) to find the reader: apg_dump backup (with its Postgres COPY backend in D) reading the
whole orders database — scheduled at 09:00 on weekdays, i.e. during
business hours, and now overrunning. The ordinary backends in D are its
victims.
ionice/rate-limit it
(ionice -c3 / --rate), or run it from a read replica / standby.
NOT add vCPUs (the CPU is idle in %wa), NOT add RAM, NOT
chase a writer / VACUUM (writes are low), NOT kill -9 the D-state
backend (it's uninterruptible until its I/O completes), NOT reboot.