awesome-everything RU
↑ Back to the climb

Observability

USE on Linux: CPU, memory, disk, network, and PSI

Crux How to collect utilization, saturation, and error signals for every Linux resource — and why Pressure Stall Information (PSI, kernel 4.20+) is the modern saturation signal that replaces load averages.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

A Stripe webhook worker stalled for 11 minutes in 2024. Free-RAM dashboards showed plenty of available memory. The actual cause was kernel memory pressure at 90% — the kernel was thrashing reclaim. PSI caught the next incident in 30 seconds. Free-RAM dashboards still showed nothing.

The USE resource checklist for Linux

Brendan Gregg’s original USE checklist for Linux covers roughly 30 resource × U/S/E cells. In node_exporter + Prometheus, the key ones are:

ResourceUtilization metricSaturation metricError metric
CPUnode_cpu_seconds_total (mode=idle subtracted)node_pressure_cpu_waiting_seconds_total (PSI) or vmstat r columnMCE errors (rare)
Memory1 - node_memory_MemAvailable / node_memory_MemTotalnode_pressure_memory_waiting_seconds_total (PSI) or vmstat si/soECC errors, node_memory_HardwareCorrupted
Disk I/Onode_disk_io_time_seconds_total rate (%util)node_disk_io_time_weighted_seconds_total (queue depth) + PSI ionode_disk_read_errors_total, EIO
Networknode_network_transmit_bytes_total / link speednode_network_transmit_drop_total, TCP retransmitsnode_network_receive_errs_total, CRC errors

PSI: Pressure Stall Information

PSI was merged into Linux 4.20 (December 2018) by Johannes Weiner at Facebook, solving a long-standing observability gap: pre-PSI you had to infer pressure from load averages, vmstat’s r column, and iowait — none of which scale correctly across many cores or capture transient saturation.

PSI exposes at /proc/pressure/{cpu,memory,io} a six-value report per resource:

some avg10=5.20 avg60=3.11 avg300=1.02 total=1234567
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
  • some: at least one task was stalled waiting for this resource, averaged over 10s / 60s / 300s.
  • full: all non-idle tasks were stalled. This is the critical signal — if full > 0 for memory or I/O for more than 10 seconds, you almost certainly have a crunch even if free-RAM looks plentiful (the kernel is thrashing reclaim).

PSI overhead: sub-1% (kernel-side counters, no per-task tracing).

node_exporter exposes PSI as node_pressure_cpu_waiting_seconds_total, node_pressure_memory_waiting_seconds_total, and node_pressure_io_waiting_seconds_total.

Why PSI beats load average

Load average mixes runnable and uninterruptible tasks, scales oddly with core count, and is a 1-minute exponential moving average — a transient 5-second CPU crunch barely registers. On a 64-core host, a load average of 10 is fine; on a 2-core host it is a disaster. PSI normalises by wall-clock time and is independent of core count.

Production alert thresholds:

  • rate(node_pressure_cpu_waiting_seconds_total[5m]) > 0.1 (10% of time stalled on CPU) → warning
  • rate(node_pressure_memory_stalled_seconds_total[5m]) > 0 for 10 min → memory crunch, page immediately
  • rate(node_pressure_io_stalled_seconds_total[5m]) > 0 for 5 min → I/O crunch

Kubernetes 1.27+ uses PSI internally for eviction decisions for the same reason.

How USE signals look under failure

CPU-bound workload: Utilization climbs toward 100%, saturation (run-queue length) rises above the core count. This is the correct signal that scaling out will help. CPU per request stays flat — it is volume, not regression.

Memory leak: Utilization climbs slowly with si/so (swap-in / swap-out) eventually non-zero. PSI memory full goes from 0 to non-zero — the kernel is thrashing reclaim to free pages.

Disk-bound workload: %util near 100% on one or more disks, queue depth growing, PSI io some rising. This is what killed the Stripe worker above — visible in PSI, invisible in free-RAM.

Network-bound: TCP retransmits rising (node_netstat_Tcp_RetransSegs), NIC drops rising, bandwidth at negotiated maximum.

Why this works

Why does PSI memory full fire even when free RAM looks plentiful? Because the kernel’s reclaim system runs in the background to keep a free-page reserve, and when that reclaim cannot keep pace with allocation, it starts blocking allocating tasks. This manifests as PSI memory full — tasks stalled waiting for the kernel to free pages — while MemAvailable still shows hundreds of megabytes. The free-RAM metric measures what is available right now; PSI measures what the kernel had to do to provide it.

Quiz

A pod's CPU utilization is 40% but real user requests are experiencing delays. What PSI signal confirms the CPU is actually causing stalls?

Quiz

PSI memory 'full' is non-zero for 15 seconds. What does this mean, even if MemAvailable shows 500 MB free?

Recall before you leave
  1. 01
    What is PSI's 'some' vs 'full' distinction, and which is more severe?
  2. 02
    Why is PSI cpu some > 20% a better alert than CPU utilization > 80%?
  3. 03
    Name the four Linux resources in the USE checklist and one node_exporter metric for each's saturation.
Recap

The USE method on Linux covers four main resource groups: CPU, memory, disk I/O, and network — each with a utilization metric (average % busy), a saturation metric (queue depth or stall time), and an error metric (ECC, EIO, NIC drops). Before kernel 4.20, saturation had to be inferred from load averages and vmstat columns, which scale poorly. Linux’s Pressure Stall Information (PSI), introduced by Johannes Weiner at Facebook and merged December 2018, exposes at /proc/pressure/{cpu,memory,io} the percentage of wall-clock time during which tasks were stalled on each resource. PSI some means at least one task stalled; PSI full means all non-idle tasks stalled — if full is non-zero for more than 10 seconds on memory or I/O, you have a crunch even if free-RAM looks plentiful. node_exporter exposes PSI as node_pressure_* metrics; alerting on their rate is the modern replacement for load-average thresholds.

Connected lessons
appears again in167
Continue the climb ↑Golden signals, dashboard layout, and service mesh auto-RED
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.