WFI silently halted our PMU counter on Cortex-R52
WFI silently halted our PMU counter on Cortex-R52
TL;DR. When you measure interrupt-period jitter via PMCCNTR_EL0
on Cortex-R52, an idle thread that issues wfi will silently halt the
PMU counter. Your medians collapse from wall-clock cycles to
CPU-active-only cycles, and any apples-to-apples comparison with an
RTOS whose idle is a busy loop becomes meaningless. Use a tight nop
busy loop instead. Also: don’t put MMIO probes in your idle hot loop —
the bus-drain perturbs the next exception entry by ~130 cycles.
Setup
We were comparing the LightOS 100 Hz tick on Cortex-R52 silicon
against ThreadX 6.4 on the same chip, same compiler, same tick rate.
The metric was the cycle delta between consecutive timer-FIQ entries
captured via PMCCNTR_EL0 (free-running with PMCR_EL0.D=0),
batched in a 16K-sample ring, dumped over JTAG.
ThreadX measured 1-cycle IQR over 16,384 samples, max-min = 17 cycles. LightOS measured 130-cycle IQR, max-min ≈ 250 cycles. Same chip. Same handler entry path. Different by two orders of magnitude.
The wrong hypothesis
The first explanation we reached for was “the C++ handler dispatch
adds variable-latency in front of the PMU read site.” We moved the
PMU capture earlier, into hand-written ARM assembly, with banked
r8_fiq registers — same instruction sequence the ThreadX handler
uses. Zero effect on the measurement. Hypothesis falsified.
The actual cause
The LightOS idle thread polled six MMIO registers per loop iteration for live diagnostics — GIC distributor state, timer interrupt status, GIC active/pending bits, the highest-pending-interrupt registers. Innocent enough on a desktop. On embedded silicon, every one of those reads goes out over the AXI bus and leaves the bus with an in-flight transaction.
When a timer FIQ then arrives, the exception entry has to wait for the bus to drain before the CPU can take the interrupt. The drain time depends on which transaction was outstanding, what the interconnect was doing, and whether the response came from a near or far device. That variable-latency drain was the entire 130-cycle IQR.
Removing the MMIO probes from the idle loop should fix it. We tried
the obvious thing first: replace the loop body with wfi.
The trap: WFI clock-gates the core
After the wfi change, the IQR collapsed beautifully — 1 cycle, just
like ThreadX. But the median also collapsed, from 6,000,002
cycles per tick (which is wall-clock 10 ms at 600 MHz) to roughly
10,000 cycles per tick.
That number — 10,000 — is the CPU-active cycles per tick. The CPU
spent 99.83% of every 10 ms period asleep, woke up to take the FIQ,
ran the handler, and went back to sleep. PMCCNTR_EL0 only counted
while the CPU was running.
Cortex-R52 implements WFI by clock-gating the core. PMCCNTR is
clocked from the same source. So WFI silently turns “wall-clock cycle
counter” into “CPU-active cycle counter”. Both are valid quantities
but they are not the same quantity, and you cannot compare an RTOS
that uses wfi in idle against one that uses a busy loop and call
the result an apples-to-apples comparison.
The Arm Architecture Reference Manual does call this out, but only implicitly: WFI is a hint to enter low-power state; an implementation may stop the clocks; PMU counters are clocked by the same domain. You won’t find a single sentence saying “WFI invalidates your PMCCNTR-based jitter measurement.” We had to learn it from the data.
The fix: a single nop
The right idle loop is the same one ThreadX uses — a tight busy loop
with a single nop:
extern "C" void idle_thread_entry() {
while (true) {
asm volatile("nop" ::: "memory");
}
}
No MMIO. No WFI. The core stays clocked, PMCCNTR keeps counting wall clock, and the bus has nothing in flight when a FIQ arrives.
After the fix, on the same silicon, same compiler, same tick rate:
| metric | LightOS m01 FIQ | ThreadX 6.4 FIQ |
|---|---|---|
| n | 16,384 | 16,384 |
| median (cyc) | 6,000,002 | 6,000,002 |
| IQR (cyc) | 1 | 1 |
| max − min (cyc) | 17 | 17 |
| stdev (cyc) | 1 | 1 |
Byte-identical.
Lessons for anyone porting an RTOS or measuring ISR jitter on R52
- Look at the competitor’s idle path before publishing a comparison number. If theirs is a busy loop and yours uses WFI, you are not measuring the same thing.
- Don’t put MMIO in the idle hot loop. Even one read per iteration leaves an outstanding bus transaction that perturbs the next exception’s entry latency. If you need live diagnostics, run them off a separate thread that yields immediately, or off a timer-driven snapshot.
- Don’t reach for
wfito “fix” jitter without measuring the median too. A tighter IQR on a different metric is not progress. PMCCNTR_EL0is wall-clock if and only if the core is clocked. On R52 that means: nowfi, nowfe, no clock-gating power-management. Every benchmark methodology document should state this assumption explicitly.
What about energy?
Yes — busy-loop idle costs more power than wfi idle. For a
deterministic-jitter benchmark you accept that cost so the numbers
mean something. For a shipping product, the right answer is usually
tickless idle: extend the timer to the next deadline, then wfi,
and accept that your “jitter” measurement on a tickless system is a
different (more useful) quantity than periodic-tick jitter.
The benchmarks in this post used periodic tick because the customer specified periodic tick. We will publish a tickless-idle measurement separately.
Reproduction
The full harness, capture scripts, and raw .bin files are open
under the LightOS measurements tree. The before-and-after data for
this post is at the 2026-05-01 tag against build SHA de52cf7
(idle-thread fix) and c62ad28 (determinism report).
See the validation guide for the full table. Detailed release notes are available to LightOS customers on request via contact.