Observability OBS · 04 · 03

USE on Linux: CPU, memory, disk, network, and PSI

How to collect utilization, saturation, and error signals for every Linux resource — and why Pressure Stall Information (PSI, kernel 4.20+) is the modern saturation signal that replaces load averages.

OBS Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A Stripe webhook worker stalled for 11 minutes in 2024. Free-RAM dashboards showed plenty of available memory. The actual cause was kernel memory pressure at 90% — the kernel was thrashing reclaim. PSI caught the next incident in 30 seconds. Free-RAM dashboards still showed nothing.

The USE resource checklist for Linux

What makes USE practical is that it gives you a finite sweep: for every resource, check exactly three cells and move on. When you are debugging an incident, that structure prevents both tunnel vision (checking only CPU) and paralysis (checking everything at once). In node_exporter + Prometheus, the key cells from Brendan Gregg’s original ~30-resource checklist are:

Resource	Utilization metric	Saturation metric	Error metric
CPU	node_cpu_seconds_total (mode=idle subtracted)	node_pressure_cpu_waiting_seconds_total (PSI) or vmstat r column	MCE errors (rare)
Memory	1 - node_memory_MemAvailable / node_memory_MemTotal	node_pressure_memory_waiting_seconds_total (PSI) or vmstat si/so	ECC errors, node_memory_HardwareCorrupted
Disk I/O	node_disk_io_time_seconds_total rate (%util)	node_disk_io_time_weighted_seconds_total (queue depth) + PSI io	node_disk_read_errors_total, EIO
Network	node_network_transmit_bytes_total / link speed	node_network_transmit_drop_total, TCP retransmits	node_network_receive_errs_total, CRC errors

PSI: Pressure Stall Information

PSI was merged into Linux 4.20 (December 2018) by Johannes Weiner at Facebook, solving a long-standing observability gap: pre-PSI you had to infer pressure from load averages, vmstat’s r column, and iowait — none of which scale correctly across many cores or capture transient saturation.

PSI exposes at /proc/pressure/{cpu,memory,io} a six-value report per resource:

some avg10=5.20 avg60=3.11 avg300=1.02 total=1234567
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some: at least one task was stalled waiting for this resource, averaged over 10s / 60s / 300s.
full: all non-idle tasks were stalled. This is the critical signal — if full > 0 for memory or I/O for more than 10 seconds, you almost certainly have a crunch even if free-RAM looks plentiful (the kernel is thrashing reclaim).

PSI overhead: sub-1% (kernel-side counters, no per-task tracing).

node_exporter exposes PSI as node_pressure_cpu_waiting_seconds_total, node_pressure_memory_waiting_seconds_total, and node_pressure_io_waiting_seconds_total.

Why PSI beats load average

Load average mixes runnable and uninterruptible tasks, scales oddly with core count, and is a 1-minute exponential moving average — a transient 5-second CPU crunch barely registers. On a 64-core host, a load average of 10 is fine; on a 2-core host it is a disaster. PSI normalises by wall-clock time and is independent of core count.

Load average (left) is a core-count-sensitive, spike-hiding proxy you must guess a threshold for; PSI (right) measures wall-clock stall time directly, so the same alert rule holds on any host. That is why PSI replaces load average as the saturation signal.

Production alert thresholds:

rate(node_pressure_cpu_waiting_seconds_total[5m]) > 0.1 (10% of time stalled on CPU) → warning
rate(node_pressure_memory_stalled_seconds_total[5m]) > 0 for 10 min → memory crunch, page immediately
rate(node_pressure_io_stalled_seconds_total[5m]) > 0 for 5 min → I/O crunch

Kubernetes 1.27+ uses PSI internally for eviction decisions for the same reason.

How USE signals look under failure

CPU-bound workload: Utilization climbs toward 100%, saturation (run-queue length) rises above the core count. This is the correct signal that scaling out will help. CPU per request stays flat — it is volume, not regression.

Memory leak: Utilization climbs slowly with si/so (swap-in / swap-out) eventually non-zero. PSI memory full goes from 0 to non-zero — the kernel is thrashing reclaim to free pages.

Disk-bound workload: %util near 100% on one or more disks, queue depth growing, PSI io some rising. This is what killed the Stripe worker above — visible in PSI, invisible in free-RAM.

Network-bound: TCP retransmits rising (node_netstat_Tcp_RetransSegs), NIC drops rising, bandwidth at negotiated maximum.

The USE method is a 4×3 sweep: for every resource, read one Utilization, one Saturation, and one Error signal — then map each cell to a concrete Linux tool. Saturation (run-queue, swap, iowait, backlog) is the signal free-resource dashboards miss; PSI is its modern, core-count-independent form.

▸Why this works

Why does PSI memory full fire even when free RAM looks plentiful? Because the kernel’s reclaim system runs in the background to keep a free-page reserve, and when that reclaim cannot keep pace with allocation, it starts blocking allocating tasks. This manifests as PSI memory full — tasks stalled waiting for the kernel to free pages — while MemAvailable still shows hundreds of megabytes. The free-RAM metric measures what is available right now; PSI measures what the kernel had to do to provide it.

Quiz

A pod's CPU utilization is 40% but real user requests are experiencing delays. What PSI signal confirms the CPU is actually causing stalls?

Quiz

PSI memory 'full' is non-zero for 15 seconds. What does this mean, even if MemAvailable shows 500 MB free?

Recall before you leave

01
What is PSI's 'some' vs 'full' distinction, and which is more severe?
02
Why is PSI cpu some > 20% a better alert than CPU utilization > 80%?
03
Name the four Linux resources in the USE checklist and one node_exporter metric for each's saturation.

Recap

The USE method on Linux covers four main resource groups: CPU, memory, disk I/O, and network — each with a utilization metric (average % busy), a saturation metric (queue depth or stall time), and an error metric (ECC, EIO, NIC drops). Before kernel 4.20, saturation had to be inferred from load averages and vmstat columns, which scale poorly. Linux’s Pressure Stall Information (PSI), introduced by Johannes Weiner at Facebook and merged December 2018, exposes at /proc/pressure/{cpu,memory,io} the percentage of wall-clock time during which tasks were stalled on each resource. PSI some means at least one task stalled; PSI full means all non-idle tasks stalled — if full is non-zero for more than 10 seconds on memory or I/O, you have a crunch even if free-RAM looks plentiful. node_exporter exposes PSI as node_pressure_* metrics; alerting on their rate is the modern replacement for load-average thresholds. Now when you see a service latency spike with CPU utilization only at 40%, you know to check PSI cpu some first — because stall time, not busy time, is what users feel.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

RED and USE: two checklists, one triage disciplinejunior

unlocks

Golden signals, dashboard layout, and service mesh auto-REDmiddle

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.