How Small Can a Measured Region Be Before perf Counters Lie?

Profiling code with hardware performance counters introduces overhead that can completely dwarf the actual measurement. We measured the cost on AMD Zen 4 and two Intel generations: rdpmc becomes reliable above ~12K instructions on every platform, while the kernel-mediated ioctl path needs ~100K on AMD and ~400K on Intel. Below those thresholds, you are mostly measuring the instrumentation.

I started looking at this while working on a paper on cache coherency. The key measurement there is the cost of a single cache-line transfer, which means benchmarks at L1 / L2 cache sizes (a few KB at most). Prior work covers transfer latency well (Hackenberg et al., 2009; Molka et al., 2015); I wanted to see what hardware performance counters could add at that scale. perf stat is (obviously) too coarse for workloads that small: it measures the whole process. The counters need to be in the code itself. I wondered whether even the standard in-code path was too coarse for tight loops at L1/L2 scale.

On Linux, perf_event_open starts and stops counters around any code region. The catch is that starting and stopping has its own cost. With perf stat this overhead disappears into the noise of a full program; in-code measurements have no such cushion, and around a tight loop the bracket itself can be larger than what is being measured.

Two paths exist for interacting with hardware performance counters from within code. The first goes through ioctl calls, asking the kernel to enable and disable counters. This is the standard path, and what most (in-code) profiling tools default to, including perf stat itself. The second reads counter values directly from user space using the rdpmc instruction, bypassing the kernel entirely. Setting aside that it is x86-only, that sounds like a clear win.

However, the perf_event_open man page is careful not to promise that:

Using rdpmc is not necessarily faster than other methods for reading event values.

The only way to know how much each actually costs, and at what measured region size each becomes trustworthy, is to measure it.

What the Counter Actually Sees

Before any numbers, a quick note on what “overhead of start/stop” actually means: the counter does not see the enable or disable operations themselves. It sees the asymmetric window between them.

On the ioctl path, the performance counters start counting at some point inside the syscall handler. Everything that follows — the rest of the handler, the syscall return path (sysret/iret, the KPTI page-table swap, retpoline / IBRS tail, restoring user state) — executes with the counter already running. Then the measured region itself runs. Then ioctl is invoked to disable counting: the full syscall entry sequence (KPTI swap, stack switch, dispatch into the handler) runs with the counter still on, until the disable point inside the handler is reached. Consequently, what the counter records, beyond the body, is the tail of enable, the head of disable, and a thin slice of user-space scaffolding inside the bracket.

On the rdpmc path, the bracket reduces to a few user-space reads. No kernel round-trip, no privilege transition; only the read instructions themselves and whatever harness surrounds them. This is the entire reason the two paths differ in cost.

Methodology

We ran a random-access benchmark across working set sizes from 1 KB to 32 MB, using perf-cpp to drive both counter paths. Each iteration accesses one cache line at a shuffled index: exactly 6 instructions, 2 memory loads, and 1 branch per iteration, regardless of working set size. Those fixed, known baselines are the point: any excess in the measured counts comes from measurement overhead, not the workload.

We picked three machines to separate vendor effects from micro-architecture ones: AMD Zen 4, Intel Skylake (desktop i7), and Intel Sapphire Rapids (Xeon server). The two Intel chips are eight years apart — if the ioctl overhead were purely a micro-architectural cost, they should diverge. perf-cpp exposes the ioctl path through its EventCounter (docs) interface and the rdpmc path through LiveEventCounter (docs).

We made the setup as constant as possible across the three machines: each benchmark pinned to a single core (using taskset), turbo disabled, and kernel.perf_event_paranoid = -1; kernels were 6.8.0-101 (Skylake), 6.8.0-106 (Sapphire Rapids), and 6.17.0-23 (Zen 4).

One note: Spectre / KPTI mitigations were left at distribution defaults. We did not toggle them, so any claim about them below is based on the cross-platform pattern, not on direct A/B tests.

And one more: each platform was tested on one machine. The instruction-count constants should be deterministic (they come from kernel code paths), but cycle numbers can shift with silicon variation, BIOS settings, and kernel version.

We measured four counters: instructions, cycles, L1-dcache-loads, and branches. Instructions, loads, and branches have constant per-iteration baselines (6, 2, and 1) and carry the main analysis. Branches reproduce the instruction thresholds exactly, so we leave them out of the per-platform walkthrough. Cycles have no fixed baseline (cache-miss latency grows with the working set), but the same qualitative pattern shows up there too, and we use them below as a cross-check on the kernel-path argument.

Code and plots. Benchmark source, all plots (including cycles and branches), and supporting files live in the companion repo: see main.cpp for the random-access benchmark and plots/ for the figures.

Anatomy on AMD Zen 4

We start with Zen 4 in detail; the shape we see here repeats on the Intel chips with different magnitudes.

Instructions per cache line, AMD Zen 4

Instructions. At 1 KB of working set (16 iterations or accessed cache lines), ioctl reports 193 instructions per iteration; the true number is 6 (the red dashed line). That works out to roughly 3,000 extra instructions per start/stop bracket, dominating at small working sets. Reminder: this is the overhead of the tail of enable and the head of disable.

rdpmc adds only about 400 instructions per bracket at the same working set size. It settles within 5% of the true baseline at 128 KB (2,048 iterations, ~12,000 total instructions executed). ioctl does not reach that same threshold until 16,384 iterations — 1 MB of randomly accessed data, or ~98,000 total instructions.

L1-dcache-loads per cache line, AMD Zen 4

L1-dcache-loads. The same pattern holds for memory loads, with a true baseline of 2 per iteration. At 1 KB, ioctl reports 70 loads per iteration. Both turnaround points match the instruction thresholds: rdpmc at 128 KB (~4K total loads), ioctl at 1 MB (~33K total loads).

So on Zen 4 the ioctl bracket costs ~7.5× more than rdpmc per call (3,000 vs. 400 instructions), which translates directly into needing ~8× more workload to dilute that overhead to ≤5%.

Intel: Skylake and Sapphire Rapids

Both Intel chips share the same rdpmc behaviour as Zen 4, but the ioctl overhead is dramatically higher.

Instructions per cache line, Intel Skylake

On Skylake, ioctl reports 705 instructions per iteration at 16 iterations, roughly 4× the Zen 4 number. The overhead works out to about 11,200 extra instructions per start/stop bracket. rdpmc adds the same ~400 instructions as on Zen 4 and stabilizes at exactly the same threshold: ~12,000 total instructions (128 KB accessed data). ioctl, by contrast, doesn’t settle within 5% until ~393K total instructions (4 MB).

Instructions per cache line, Intel Sapphire Rapids

Sapphire Rapids produces nearly the same numbers. At 1 KB, ioctl reports 679 instructions per iteration, ~10,800 instructions of overhead per bracket. The instruction threshold in our random-access benchmark is identical to Skylake: 4 MB for ioctl, 128 KB for rdpmc.

Two Intel chips (eight years apart) show essentially the same ioctl instruction overhead: very likely not a coincidence. Cycles, in contrast, tell a different story: the same ioctl bracket consumes ~28,100 cycles on Skylake but only ~8,525 cycles on Sapphire Rapids.

Cycles per cache line, Intel Sapphire Rapids & Skylake; AMD Zen4

It is the same kernel work, but different silicon is executing it. Hardware-level optimizations such as deeper buffers and better branch predictors on Sapphire Rapids let the same ~11K-instruction kernel path retire in roughly a third of the cycles. This is exactly the signature of kernel-path cost: the count of overhead instructions is constant across generations (determined by the syscall entry/exit code, not by the hardware substrate), while the cycle count varies with the hardware executing those instructions. The kernel-path account fits both observations; a micro-architectural one fits neither.

The most plausible source of those ~11K instructions is the syscall entry/exit path under Spectre / KPTI mitigations — KPTI page-table swaps on enter and exit, retpoline / IBRS tails on indirect branches. We frame this as consistent with the data rather than proven, since we did not run with mitigations=off.

L1-dcache-loads per cache line, Intel Skylake

The L1-dcache-loads results follow the instruction picture, with one minor twist. At 1 KB, ioctl reports 161 loads/iter on Skylake and 157 on Sapphire Rapids against a true value of 2. rdpmc again converges at 128 KB (~4K total loads) on both. For ioctl, Skylake mirrors its instruction threshold at 4 MB (~131K total loads), but Sapphire Rapids converges earlier, at 2 MB (~66K loads). The kernel path is instruction-heavy, not load-heavy, so Sapphire Rapids’ load count crosses the ≤5% line slightly sooner than its instruction count does.

What the Bracket Actually Costs

The thresholds in the next section are all derived from a small set of underlying constants: the counters added by a single enable-to-disable window, on each platform, in each path. Using the small-WSS rows (where the per-iteration overhead is large enough to read off cleanly), the per-bracket cost is:

Platform	Path	Instructions	L1 loads	Branches	Cycles
AMD Zen 4	`ioctl`	~3,000	~1,085	~665	~5,460
AMD Zen 4	`rdpmc`	~390	~170	~70	~450
Intel Skylake	`ioctl`	~11,200	~2,545	~1,760	~28,100
Intel Skylake	`rdpmc`	~405	~115	~65	~1,680
Intel Sapphire Rapids	`ioctl`	~10,800	~2,475	~1,710	~8,525
Intel Sapphire Rapids	`rdpmc`	~405	~125	~70	~1,165

The Intel ioctl bracket is ~1.5–4× larger than AMD’s in every counter dimension; instructions, loads, and branches all scale together, meaning the kernel is doing more work on the syscall path on Intel. rdpmc, by contrast, is small and roughly constant across platforms (~400 instructions, ~70 branches, ~120–170 loads). Zen 4’s read sequence reads slightly more data than Intel’s, but the differences are small enough to be a footnote.

Cycles are the exception. The two Intel generations show identical kernel-path instruction counts but a factor-of-three cycle gap, exactly the pattern one would expect from kernel work running on different silicon.

Side by Side: Threshold Tables

The thresholds below are the workload size at which the per-iteration overhead from the constants above dilutes to ≤5% and ≤1% of the true baseline.

Instructions (baseline: 6/iteration, ≤5% error = ≤6.30, ≤1% error = ≤6.06):

Method	Machine	Measured at 16 iterations	≤5% error from	≤1% error from
`rdpmc`	AMD Zen 4	~31 instr./iter	~12K instructions	~49K instructions
`rdpmc`	Intel Skylake	~31 instr./iter	~12K instructions	~49K instructions
`rdpmc`	Intel Sapphire Rapids	~31 instr./iter	~12K instructions	~49K instructions
`ioctl`	AMD Zen 4	~193 instr./iter	~98K instructions	~393K instructions
`ioctl`	Intel Skylake	~705 instr./iter	~393K instructions	~1.3% error at ~3.1M instructions
`ioctl`	Intel Sapphire Rapids	~679 instr./iter	~393K instructions	~1.8% error at ~3.1M instructions

L1-dcache-loads (baseline: 2/iteration, ≤5% error = ≤2.10, ≤1% error = ≤2.02):

Method	Machine	Measured at 16 iterations	≤5% error from	≤1% error from
`rdpmc`	AMD Zen 4	~13 loads/iter	~4K loads	~16K loads
`rdpmc`	Intel Skylake	~10 loads/iter	~4K loads	~16K loads
`rdpmc`	Intel Sapphire Rapids	~11 loads/iter	~4K loads	~16K loads
`ioctl`	AMD Zen 4	~70 loads/iter	~33K loads	~131K loads
`ioctl`	Intel Skylake	~161 loads/iter	~131K loads	~1M loads
`ioctl`	Intel Sapphire Rapids	~157 loads/iter	~66K loads	~1.4% error at ~1M loads

Two conclusions:

rdpmc’s thresholds are the same on every platform. All three machines converge at 128 KB (≤5%) and 512 KB (≤1%) for both instructions and loads. A clean 4× gap between the two thresholds, identical across vendors and generations. The per-bracket cost varies by a small amount between vendors, but not enough to move the threshold.

ioctl requires ~4× more work on Intel than AMD to reach ≤5%. Not because Intel’s hardware is slower, but because its kernel path is longer, confirmed by the per-bracket instruction count (~11K vs ~3K) and backed by the Skylake-vs-Sapphire-Rapids cycle gap, which puts the cost in the kernel rather than in the silicon. Spectre / KPTI mitigations on the syscall path are the most plausible explanation; we did not isolate them experimentally.

When to Use Which

Yes, rdpmc is cheaper. But rdpmc is not always usable.

How many events do you need at once? rdpmc reads physical hardware counters directly, so it cannot multiplex. If you request more events than the hardware exposes simultaneously (typically 4–8 general-purpose counters), ioctl-based measurements rotate events in and out and scale the results; rdpmc has no such fallback. For a handful of events this is no constraint, but if you need many events at once, you’ll have to fall back to ioctl.

How small is your measured region? Below ~12K instructions / ~4K loads, neither path is reliable on any platform; restructure the benchmark. Between those thresholds and ~98K instructions, rdpmc is trustworthy on every platform; ioctl is trustworthy only on AMD. Above ~393K instructions, ioctl becomes acceptable on Intel too, except for ≤1% accuracy, which Intel ioctl never quite reaches in our data.

Bottom line: prefer rdpmc when your event count fits in the hardware slots and your region is small; fall back to ioctl when you need multiplexing or your region is large enough to amortize the kernel cost.

Practical Takeaways

The thresholds aren’t about the number of iterations. They’re about the total volume of work in the measured region.

If the workload falls below the thresholds, there are two options. The first is to restructure your benchmark so the measured region contains enough work — running the core loop multiple times and normalizing the result afterward is a common approach, and most perf APIs support passing the operation count directly so the division happens automatically. The second is to prefer rdpmc-based reads where your tooling supports it: perf-cpp’s LiveEventCounter interface exposes this directly, and the numbers above translate to an 8× reduction in required workload size on AMD and a 32× reduction on Intel, if you stay within the hardware counter limits.