WFI silently halted our PMU counter on Cortex-R52

TL;DR. When you measure interrupt-period jitter via PMCCNTR_EL0 on Cortex-R52, an idle thread that issues wfi will silently halt the PMU counter. Your medians collapse from wall-clock cycles to CPU-active-only cycles, and any apples-to-apples comparison with an RTOS whose idle is a busy loop becomes meaningless. Use a tight nop busy loop instead. Also: don’t put MMIO probes in your idle hot loop — the bus-drain perturbs the next exception entry by ~130 cycles.

Setup

We were comparing the LightOS 100 Hz tick on Cortex-R52 silicon against ThreadX 6.4 on the same chip, same compiler, same tick rate. The metric was the cycle delta between consecutive timer-FIQ entries captured via PMCCNTR_EL0 (free-running with PMCR_EL0.D=0), batched in a 16K-sample ring, dumped over JTAG.

ThreadX measured 1-cycle IQR over 16,384 samples, max-min = 17 cycles. LightOS measured 130-cycle IQR, max-min ≈ 250 cycles. Same chip. Same handler entry path. Different by two orders of magnitude.

The wrong hypothesis

The first explanation we reached for was “the C++ handler dispatch adds variable-latency in front of the PMU read site.” We moved the PMU capture earlier, into hand-written ARM assembly, with banked r8_fiq registers — same instruction sequence the ThreadX handler uses. Zero effect on the measurement. Hypothesis falsified.

The actual cause

The LightOS idle thread polled six MMIO registers per loop iteration for live diagnostics — GIC distributor state, timer interrupt status, GIC active/pending bits, the highest-pending-interrupt registers. Innocent enough on a desktop. On embedded silicon, every one of those reads goes out over the AXI bus and leaves the bus with an in-flight transaction.

When a timer FIQ then arrives, the exception entry has to wait for the bus to drain before the CPU can take the interrupt. The drain time depends on which transaction was outstanding, what the interconnect was doing, and whether the response came from a near or far device. That variable-latency drain was the entire 130-cycle IQR.

Removing the MMIO probes from the idle loop should fix it. We tried the obvious thing first: replace the loop body with wfi.

The trap: WFI clock-gates the core

After the wfi change, the IQR collapsed beautifully — 1 cycle, just like ThreadX. But the median also collapsed, from 6,000,002 cycles per tick (which is wall-clock 10 ms at 600 MHz) to roughly 10,000 cycles per tick.

That number — 10,000 — is the CPU-active cycles per tick. The CPU spent 99.83% of every 10 ms period asleep, woke up to take the FIQ, ran the handler, and went back to sleep. PMCCNTR_EL0 only counted while the CPU was running.

Cortex-R52 implements WFI by clock-gating the core. PMCCNTR is clocked from the same source. So WFI silently turns “wall-clock cycle counter” into “CPU-active cycle counter”. Both are valid quantities but they are not the same quantity, and you cannot compare an RTOS that uses wfi in idle against one that uses a busy loop and call the result an apples-to-apples comparison.

The Arm Architecture Reference Manual does call this out, but only implicitly: WFI is a hint to enter low-power state; an implementation may stop the clocks; PMU counters are clocked by the same domain. You won’t find a single sentence saying “WFI invalidates your PMCCNTR-based jitter measurement.” We had to learn it from the data.

The fix: a single nop

The right idle loop is the same one ThreadX uses — a tight busy loop with a single nop:

extern "C" void idle_thread_entry() {
    while (true) {
        asm volatile("nop" ::: "memory");
    }
}

No MMIO. No WFI. The core stays clocked, PMCCNTR keeps counting wall clock, and the bus has nothing in flight when a FIQ arrives.

After the fix, on the same silicon, same compiler, same tick rate:

metric	LightOS m01 FIQ	ThreadX 6.4 FIQ
n	16,384	16,384
median (cyc)	6,000,002	6,000,002
IQR (cyc)	1	1
max − min (cyc)	17	17
stdev (cyc)	1	1

Byte-identical.

Lessons for anyone porting an RTOS or measuring ISR jitter on R52

Look at the competitor’s idle path before publishing a comparison number. If theirs is a busy loop and yours uses WFI, you are not measuring the same thing.
Don’t put MMIO in the idle hot loop. Even one read per iteration leaves an outstanding bus transaction that perturbs the next exception’s entry latency. If you need live diagnostics, run them off a separate thread that yields immediately, or off a timer-driven snapshot.
Don’t reach for wfi to “fix” jitter without measuring the median too. A tighter IQR on a different metric is not progress.
PMCCNTR_EL0 is wall-clock if and only if the core is clocked. On R52 that means: no wfi, no wfe, no clock-gating power-management. Every benchmark methodology document should state this assumption explicitly.

What about energy?

Yes — busy-loop idle costs more power than wfi idle. For a deterministic-jitter benchmark you accept that cost so the numbers mean something. For a shipping product, the right answer is usually tickless idle: extend the timer to the next deadline, then wfi, and accept that your “jitter” measurement on a tickless system is a different (more useful) quantity than periodic-tick jitter.

The benchmarks in this post used periodic tick because the customer specified periodic tick. We will publish a tickless-idle measurement separately.

Reproduction

The full harness, capture scripts, and raw .bin files are open under the LightOS measurements tree. The before-and-after data for this post is at the 2026-05-01 tag against build SHA de52cf7 (idle-thread fix) and c62ad28 (determinism report).

See the validation guide for the full table. Detailed release notes are available to LightOS customers on request via contact.