CPU Pegged, App Idle: a Context-Switch Storm
Hard

Problem

pg-host (4 vCPU) shows 100% CPU and throughput has collapsed — queries that took milliseconds now take seconds. The on-call's reflex is "CPU is maxed, scale up the cores." But the CPU is busy doing the wrong thing. Work out what, and why adding cores would make it worse.

Initial setup

  • Host: pg-host, Debian 12, 4 vCPU, a Postgres database under a
traffic spike.

Acceptance

You've solved it when:

  • You've read top: the CPU is pegged but it's almost all sy (system /
kernel)%sy ~84, %us ~10. The box is busy in the kernel, not running application code.
  • You've run vmstat and seen the smoking gun: cs (context
switches/s) ~180,000 and a run queue r ~38 (far above 4 cores), while si/so are 0 (not swap) and wa is 0 (not I/O). This is a context-switch / lock-contention storm.
  • You've used ps/top to see dozens of Postgres backends all STAT R,
each at modest CPU, all contending — and found the root cause in postgresql.conf: max_connections = 600 with no connection pooler, so the spike put hundreds of active backends on 4 cores.
  • You've named the fix: cap concurrency to roughly the core count — put
pgbouncer in front (or lower max_connections) so only a handful run at once. NOT add vCPUs (more cores worsens the contention), NOT kill backends one by one (they're victims), NOT reboot.
Live session
Code
SavedNo commands yet