<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en_us"><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="/https://jmuehlig.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="/https://jmuehlig.github.io/" rel="alternate" type="text/html" hreflang="en_us" /><updated>2026-05-09T10:40:06-02:00</updated><id>/https://jmuehlig.github.io/feed.xml</id><title type="html">jmuehlig.github.io</title><author><name>jan</name></author><entry><title type="html">How Small Can a Measured Region Be Before perf Counters Lie?</title><link href="/https://jmuehlig.github.io/when-can-you-trust-your-hardware-counters/" rel="alternate" type="text/html" title="How Small Can a Measured Region Be Before perf Counters Lie?" /><published>2026-05-09T00:00:00-02:00</published><updated>2026-05-09T00:00:00-02:00</updated><id>/https://jmuehlig.github.io/when-can-you-trust-your-hardware-counters</id><content type="html" xml:base="/https://jmuehlig.github.io/when-can-you-trust-your-hardware-counters/"><![CDATA[<p>I started looking at this while working on a paper on cache coherency. 
The key measurement there is the cost of a single cache-line transfer, which means benchmarks at L1 / L2 cache sizes (a few KB at most). 
Prior work covers transfer <em>latency</em> well (<a href="https://dl.acm.org/doi/pdf/10.1145/1669112.1669165">Hackenberg et al., 2009</a>; <a href="https://tu-dresden.de/zih/forschung/ressourcen/dateien/abgeschlossene-projekte/benchit/2015_ICPP_authors_version.pdf">Molka et al., 2015</a>); I wanted to see what hardware performance counters could add at that scale. 
<a href="https://man7.org/linux/man-pages/man1/perf-stat.1.html"><em>perf stat</em></a> is (obviously) too coarse for workloads that small: it measures the whole process. 
The counters need to be in the code itself. 
I wondered whether even the standard in-code path was too coarse for tight loops at L1/L2 scale.</p>

<p>On Linux, <a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html"><code>perf_event_open</code></a> starts and stops counters around any code region. 
The catch is that starting and stopping has its own cost. With <em>perf stat</em> this overhead disappears into the noise of a full program; in-code measurements have no such cushion, and around a tight loop the bracket itself can be larger than what is being measured.</p>

<p>Two paths exist for interacting with hardware performance counters from within code.
The first goes through <code>ioctl</code> calls, asking the kernel to enable and disable counters. 
This is the standard path, and what most (in-code) profiling tools default to, including <em>perf stat</em> itself. 
The second reads counter values directly from <strong>user space</strong> using the <a href="https://www.felixcloutier.com/x86/rdpmc"><code>rdpmc</code></a> instruction, bypassing the kernel entirely.
Setting aside that it is <em>x86</em>-only, that sounds like a clear win.</p>

<p>However, the <code>perf_event_open</code> <a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html">man page</a> is careful not to promise that:</p>

<blockquote>
  <p>Using <code>rdpmc</code> is not necessarily faster than other methods for reading event values.</p>
</blockquote>

<p>The only way to know how much each actually costs, and at what measured region size each becomes <em>trustworthy</em>, is to measure it.</p>

<h3 id="what-the-counter-actually-sees">What the Counter Actually Sees</h3>

<p>Before any numbers, a quick note on what “overhead of start/stop” actually means: the counter does not see the <code>enable</code> or <code>disable</code> operations themselves.
It sees the asymmetric <em>window</em> between them.</p>

<p>On the <code>ioctl</code> path, the performance counters start counting at some point inside the syscall handler.
Everything that follows — the rest of the handler, the syscall return path (sysret/iret, the <a href="https://breaking-bits.gitbook.io/breaking-bits/exploit-development/linux-kernel-exploit-development/kernel-page-table-isolation-kpti">KPTI page-table swap</a>, <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-branch-target-injection-mitigation.html">retpoline</a> / <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html">IBRS</a> tail, restoring user state) — executes with the counter already running.
Then the measured region itself runs.
Then <code>ioctl</code> is invoked to disable counting: the full syscall entry sequence (KPTI swap, stack switch, dispatch into the handler) runs with the counter still on, until the disable point inside the handler is reached.
Consequently, what the counter records, beyond the body, is the <em>tail of enable</em>, the <em>head of disable</em>, and a thin slice of user-space scaffolding inside the bracket.</p>

<p>On the <code>rdpmc</code> path, the bracket reduces to a few user-space reads. 
No kernel round-trip, no privilege transition; only the read instructions themselves and whatever harness surrounds them.
This is the entire reason the two paths differ in cost.</p>

<h3 id="methodology">Methodology</h3>

<p>We ran a random-access benchmark across working set sizes from 1 KB to 32 MB, using <a href="https://github.com/jmuehlig/perf-cpp">perf-cpp</a> to drive both counter paths.
Each iteration accesses one cache line at a shuffled index: exactly <strong>6 instructions</strong>, <strong>2 memory loads</strong>, and <strong>1 branch</strong> per iteration, regardless of working set size.
Those fixed, known baselines are the point: any excess in the measured counts comes from measurement overhead, not the workload.</p>

<p>We picked three machines to separate vendor effects from micro-architecture ones: <strong>AMD Zen 4</strong>, <strong>Intel Skylake</strong> (desktop i7), and <strong>Intel Sapphire Rapids</strong> (Xeon server).
The two Intel chips are eight years apart — if the <code>ioctl</code> overhead were purely a micro-architectural cost, they should diverge.
perf-cpp exposes the <code>ioctl</code> path through its <code>EventCounter</code> <a href="https://jmuehlig.github.io/perf-cpp/recording/">(docs)</a> interface and the <code>rdpmc</code> path through <code>LiveEventCounter</code> <a href="https://jmuehlig.github.io/perf-cpp/recording-live-events/">(docs)</a>.</p>

<p>We made the setup as constant as possible across the three machines: each benchmark pinned to a single core (using <code>taskset</code>), turbo disabled, and <code>kernel.perf_event_paranoid = -1</code>; kernels were 6.8.0-101 (Skylake), 6.8.0-106 (Sapphire Rapids), and 6.17.0-23 (Zen 4).</p>

<p>One note: Spectre / KPTI mitigations were left at distribution defaults. 
We did <em>not</em> <a href="https://fosspost.org/disable-cpu-mitigations-on-linux">toggle</a> them, so any claim about them below is based on the cross-platform pattern, not on direct A/B tests.</p>

<p>And one more: each platform was tested on one machine. The instruction-count constants should be deterministic (they come from kernel code paths), but cycle numbers can shift with silicon variation, BIOS settings, and kernel version.</p>

<p>We measured four counters: <code>instructions</code>, <code>cycles</code>, <code>L1-dcache-loads</code>, and <code>branches</code>. Instructions, loads, and branches have constant per-iteration baselines (6, 2, and 1) and carry the main analysis. Branches reproduce the instruction thresholds exactly, so we leave them out of the per-platform walkthrough. Cycles have no fixed baseline (cache-miss latency grows with the working set), but the same qualitative pattern shows up there too, and we use them below as a cross-check on the kernel-path argument.</p>

<p><strong>Code and plots.</strong> Benchmark source, all plots (including cycles and branches), and supporting files live in the <a href="https://github.com/jmuehlig/blog-resource/tree/main/02-measure-profiling">companion repo</a>: see <a href="https://github.com/jmuehlig/blog-resource/blob/main/02-measure-profiling/src/main.cpp"><code>main.cpp</code></a> for the random-access benchmark and <a href="https://github.com/jmuehlig/blog-resource/tree/main/02-measure-profiling/plots"><code>plots/</code></a> for the figures.</p>

<hr />

<h3 id="anatomy-on-amd-zen-4">Anatomy on AMD Zen 4</h3>

<p>We start with Zen 4 in detail; the shape we see here repeats on the Intel chips with different magnitudes.</p>

<p><img src="/assets/images/02/zen4-random_access_instructions.svg" alt="Instructions per cache line, AMD Zen 4" /></p>

<p><strong>Instructions.</strong> At 1 KB of working set (16 iterations or accessed cache lines), <code>ioctl</code> reports 193 instructions per iteration; the true number is 6 (the red dashed line).
That works out to roughly <strong>3,000 extra instructions</strong> per start/stop bracket, dominating at small working sets.
Reminder: this is the overhead of the <em>tail of enable</em> and the <em>head of disable</em>.</p>

<p><code>rdpmc</code> adds only about <strong>400 instructions</strong> per bracket at the same working set size. 
It settles within 5% of the true baseline at 128 KB (2,048 iterations, ~12,000 total instructions executed). 
<code>ioctl</code> does not reach that same threshold until 16,384 iterations — 1 MB of randomly accessed data, or ~98,000 total instructions.</p>

<p><img src="/assets/images/02/zen4-random_access_l1_dcache_loads.svg" alt="L1-dcache-loads per cache line, AMD Zen 4" /></p>

<p><strong>L1-dcache-loads.</strong> The same pattern holds for memory loads, with a true baseline of 2 per iteration.
At 1 KB, <code>ioctl</code> reports 70 loads per iteration. 
Both turnaround points match the instruction thresholds: <code>rdpmc</code> at 128 KB (~4K total loads), <code>ioctl</code> at 1 MB (~33K total loads).</p>

<p>So on Zen 4 the <code>ioctl</code> bracket costs ~7.5× more than <code>rdpmc</code> per call (3,000 vs. 400 instructions), which translates directly into needing ~8× more workload to dilute that overhead to ≤5%.</p>

<hr />

<h3 id="intel-skylake-and-sapphire-rapids">Intel: Skylake and Sapphire Rapids</h3>

<p>Both Intel chips share the same <code>rdpmc</code> behaviour as Zen 4, but the <code>ioctl</code> overhead is dramatically higher.</p>

<p><img src="/assets/images/02/skylake-random_access_instructions.svg" alt="Instructions per cache line, Intel Skylake" /></p>

<p>On Skylake, <code>ioctl</code> reports <strong>705 instructions</strong> per iteration at 16 iterations, roughly 4× the Zen 4 number.
The overhead works out to about <strong>11,200 extra instructions</strong> per start/stop bracket.
<code>rdpmc</code> adds the same ~400 instructions as on Zen 4 and stabilizes at exactly the same threshold: ~12,000 total instructions (128 KB accessed data).
<code>ioctl</code>, by contrast, doesn’t settle within 5% until ~393K total instructions (4 MB).</p>

<p><img src="/assets/images/02/sapphire-rapids-random_access_instructions.svg" alt="Instructions per cache line, Intel Sapphire Rapids" /></p>

<p>Sapphire Rapids produces nearly the same numbers.
At 1 KB, <code>ioctl</code> reports <strong>679 instructions</strong> per iteration, ~10,800 instructions of overhead per bracket.
The instruction threshold in our random-access benchmark is identical to Skylake: 4 MB for <code>ioctl</code>, 128 KB for <code>rdpmc</code>.</p>

<p>Two Intel chips (eight years apart) show essentially the same <code>ioctl</code> <em>instruction</em> overhead: very likely not a coincidence.
<strong>Cycles</strong>, in contrast, tell a different story: the same <code>ioctl</code> bracket consumes <strong>~28,100 cycles on Skylake</strong> but only <strong>~8,525 cycles on Sapphire Rapids</strong>.</p>

<p><img src="/assets/images/02/skylake-sapphire-rapids-zen4-random_access_cycles_1KB.svg" alt="Cycles per cache line, Intel Sapphire Rapids &amp; Skylake; AMD Zen4" /></p>

<p>It is the same kernel work, but different silicon is executing it.
Hardware-level optimizations such as deeper buffers and better branch predictors on Sapphire Rapids let the same ~11K-instruction kernel path retire in roughly a third of the cycles.
This is exactly the signature of kernel-path cost: the <em>count</em> of overhead instructions is constant across generations (determined by the syscall entry/exit code, not by the hardware substrate), while the <em>cycle</em> count varies with the hardware executing those instructions.
The kernel-path account fits both observations; a micro-architectural one fits neither.</p>

<p>The most plausible source of those ~11K instructions is the syscall entry/exit path under Spectre / KPTI mitigations — KPTI page-table swaps on enter and exit, retpoline / IBRS tails on indirect branches.
We frame this as <strong>consistent with the data rather than proven</strong>, since we did not run with <code>mitigations=off</code>.</p>

<p><img src="/assets/images/02/skylake-random_access_l1_dcache_loads.svg" alt="L1-dcache-loads per cache line, Intel Skylake" />
<img src="/assets/images/02/sapphire-rapids-random_access_l1_dcache_loads.svg" alt="L1-dcache-loads per cache line, Intel Sapphire Rapids" /></p>

<p>The L1-dcache-loads results follow the instruction picture, with one minor twist.
At 1 KB, <code>ioctl</code> reports 161 loads/iter on Skylake and 157 on Sapphire Rapids against a true value of 2.
<code>rdpmc</code> again converges at 128 KB (~4K total loads) on both.
For <code>ioctl</code>, Skylake mirrors its instruction threshold at 4 MB (~131K total loads), but Sapphire Rapids converges earlier, at 2 MB (~66K loads).
The kernel path is instruction-heavy, not load-heavy, so Sapphire Rapids’ load count crosses the ≤5% line slightly sooner than its instruction count does.</p>

<hr />

<h3 id="what-the-bracket-actually-costs">What the Bracket Actually Costs</h3>

<p>The thresholds in the next section are all derived from a small set of underlying constants: the counters added by a single <code>enable</code>-to-<code>disable</code> window, on each platform, in each path.
Using the small-WSS rows (where the per-iteration overhead is large enough to read off cleanly), the per-bracket cost is:</p>

<table>
  <thead>
    <tr>
      <th>Platform</th>
      <th>Path</th>
      <th>Instructions</th>
      <th>L1 loads</th>
      <th>Branches</th>
      <th>Cycles</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>AMD Zen 4</td>
      <td><code>ioctl</code></td>
      <td>~3,000</td>
      <td>~1,085</td>
      <td>~665</td>
      <td>~5,460</td>
    </tr>
    <tr>
      <td>AMD Zen 4</td>
      <td><code>rdpmc</code></td>
      <td>~390</td>
      <td>~170</td>
      <td>~70</td>
      <td>~450</td>
    </tr>
    <tr>
      <td>Intel Skylake</td>
      <td><code>ioctl</code></td>
      <td>~11,200</td>
      <td>~2,545</td>
      <td>~1,760</td>
      <td>~28,100</td>
    </tr>
    <tr>
      <td>Intel Skylake</td>
      <td><code>rdpmc</code></td>
      <td>~405</td>
      <td>~115</td>
      <td>~65</td>
      <td>~1,680</td>
    </tr>
    <tr>
      <td>Intel Sapphire Rapids</td>
      <td><code>ioctl</code></td>
      <td>~10,800</td>
      <td>~2,475</td>
      <td>~1,710</td>
      <td>~8,525</td>
    </tr>
    <tr>
      <td>Intel Sapphire Rapids</td>
      <td><code>rdpmc</code></td>
      <td>~405</td>
      <td>~125</td>
      <td>~70</td>
      <td>~1,165</td>
    </tr>
  </tbody>
</table>

<p>The Intel <code>ioctl</code> bracket is ~1.5–4× larger than AMD’s in every counter dimension; instructions, loads, and branches all scale together, meaning the kernel is doing more work on the syscall path on Intel. <code>rdpmc</code>, by contrast, is small and roughly constant across platforms (~400 instructions, ~70 branches, ~120–170 loads). Zen 4’s read sequence reads slightly more data than Intel’s, but the differences are small enough to be a footnote.</p>

<p>Cycles are the exception. The two Intel generations show identical kernel-path instruction counts but a factor-of-three cycle gap, exactly the pattern one would expect from kernel work running on different silicon.</p>

<hr />

<h3 id="side-by-side-threshold-tables">Side by Side: Threshold Tables</h3>

<p>The thresholds below are the workload size at which the per-iteration overhead from the constants above dilutes to ≤5% and ≤1% of the true baseline.</p>

<p><strong>Instructions</strong> (baseline: 6/iteration, ≤5% error = ≤6.30, ≤1% error = ≤6.06):</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Machine</th>
      <th>Measured at 16 iterations</th>
      <th>≤5% error from</th>
      <th>≤1% error from</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>rdpmc</code></td>
      <td>AMD Zen 4</td>
      <td>~31 instr./iter</td>
      <td><strong>~12K instructions</strong></td>
      <td><strong>~49K instructions</strong></td>
    </tr>
    <tr>
      <td><code>rdpmc</code></td>
      <td>Intel Skylake</td>
      <td>~31 instr./iter</td>
      <td><strong>~12K instructions</strong></td>
      <td><strong>~49K instructions</strong></td>
    </tr>
    <tr>
      <td><code>rdpmc</code></td>
      <td>Intel Sapphire Rapids</td>
      <td>~31 instr./iter</td>
      <td><strong>~12K instructions</strong></td>
      <td><strong>~49K instructions</strong></td>
    </tr>
    <tr>
      <td><code>ioctl</code></td>
      <td>AMD Zen 4</td>
      <td>~193 instr./iter</td>
      <td><strong>~98K instructions</strong></td>
      <td><strong>~393K instructions</strong></td>
    </tr>
    <tr>
      <td><code>ioctl</code></td>
      <td>Intel Skylake</td>
      <td>~705 instr./iter</td>
      <td><strong>~393K instructions</strong></td>
      <td><em>~1.3% error at ~3.1M instructions</em></td>
    </tr>
    <tr>
      <td><code>ioctl</code></td>
      <td>Intel Sapphire Rapids</td>
      <td>~679 instr./iter</td>
      <td><strong>~393K instructions</strong></td>
      <td><em>~1.8% error at ~3.1M instructions</em></td>
    </tr>
  </tbody>
</table>

<p><strong>L1-dcache-loads</strong> (baseline: 2/iteration, ≤5% error = ≤2.10, ≤1% error = ≤2.02):</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Machine</th>
      <th>Measured at 16 iterations</th>
      <th>≤5% error from</th>
      <th>≤1% error from</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>rdpmc</code></td>
      <td>AMD Zen 4</td>
      <td>~13 loads/iter</td>
      <td><strong>~4K loads</strong></td>
      <td><strong>~16K loads</strong></td>
    </tr>
    <tr>
      <td><code>rdpmc</code></td>
      <td>Intel Skylake</td>
      <td>~10 loads/iter</td>
      <td><strong>~4K loads</strong></td>
      <td><strong>~16K loads</strong></td>
    </tr>
    <tr>
      <td><code>rdpmc</code></td>
      <td>Intel Sapphire Rapids</td>
      <td>~11 loads/iter</td>
      <td><strong>~4K loads</strong></td>
      <td><strong>~16K loads</strong></td>
    </tr>
    <tr>
      <td><code>ioctl</code></td>
      <td>AMD Zen 4</td>
      <td>~70 loads/iter</td>
      <td><strong>~33K loads</strong></td>
      <td><strong>~131K loads</strong></td>
    </tr>
    <tr>
      <td><code>ioctl</code></td>
      <td>Intel Skylake</td>
      <td>~161 loads/iter</td>
      <td><strong>~131K loads</strong></td>
      <td><strong>~1M loads</strong></td>
    </tr>
    <tr>
      <td><code>ioctl</code></td>
      <td>Intel Sapphire Rapids</td>
      <td>~157 loads/iter</td>
      <td><strong>~66K loads</strong></td>
      <td><em>~1.4% error at ~1M loads</em></td>
    </tr>
  </tbody>
</table>

<p>Two conclusions:</p>

<p><strong><code>rdpmc</code>’s thresholds are the same on every platform.</strong> All three machines converge at 128 KB (≤5%) and 512 KB (≤1%) for both instructions and loads. A clean 4× gap between the two thresholds, identical across vendors and generations. The per-bracket cost varies by a small amount between vendors, but not enough to move the threshold.</p>

<p><strong><code>ioctl</code> requires ~4× more work on Intel than AMD to reach ≤5%.</strong> Not because Intel’s hardware is slower, but because its kernel path is longer, confirmed by the per-bracket instruction count (~11K vs ~3K) and backed by the Skylake-vs-Sapphire-Rapids cycle gap, which puts the cost in the kernel rather than in the silicon. <a href="https://meltdownattack.com/">Spectre</a> / KPTI mitigations on the syscall path are the most plausible explanation; we did not isolate them experimentally.</p>

<hr />

<h3 id="when-to-use-which">When to Use Which</h3>

<p>Yes, <code>rdpmc</code> is cheaper. But <code>rdpmc</code> is not always usable.</p>

<p><strong>How many events do you need at once?</strong>
<code>rdpmc</code> reads physical hardware counters directly, so it cannot <a href="https://perfwiki.github.io/main/tutorial/#multiplexing-and-scaling-events">multiplex</a>.
If you request more events than the hardware exposes simultaneously (typically 4–8 general-purpose counters), <code>ioctl</code>-based measurements rotate events in and out and scale the results; <code>rdpmc</code> has no such fallback.
For a handful of events this is no constraint, but if you need many events at once, you’ll have to fall back to <code>ioctl</code>.</p>

<p><strong>How small is your measured region?</strong>
Below ~12K instructions / ~4K loads, neither path is reliable on any platform; restructure the benchmark.
Between those thresholds and ~98K instructions, <code>rdpmc</code> is trustworthy on every platform; <code>ioctl</code> is trustworthy only on AMD.
Above ~393K instructions, <code>ioctl</code> becomes acceptable on Intel too, except for ≤1% accuracy, which Intel <code>ioctl</code> never quite reaches in our data.</p>

<p><strong>Bottom line:</strong> prefer <code>rdpmc</code> when your event count fits in the hardware slots and your region is small; fall back to <code>ioctl</code> when you need multiplexing or your region is large enough to amortize the kernel cost.</p>

<hr />

<h3 id="practical-takeaways">Practical Takeaways</h3>

<p>The thresholds aren’t about the number of iterations. 
They’re about the total volume of work in the measured region.</p>

<p>If the workload falls below the thresholds, there are two options.
The first is to restructure your benchmark so the measured region contains enough work — running the core loop multiple times and normalizing the result afterward is a common approach, and most perf APIs support passing the operation count directly so the division happens automatically.
The second is to prefer <code>rdpmc</code>-based reads where your tooling supports it: <a href="https://github.com/jmuehlig/perf-cpp">perf-cpp’s</a> <a href="https://jmuehlig.github.io/perf-cpp/recording-live-events/"><code>LiveEventCounter</code> interface</a> exposes this directly, and the numbers above translate to an 8× reduction in required workload size on AMD and a 32× reduction on Intel, if you stay within the hardware counter limits.</p>]]></content><author><name>jan</name></author><category term="perf" /><category term="cpp" /><category term="hardware performance counter" /><category term="intel" /><category term="amd" /><summary type="html"><![CDATA[I started looking at this while working on a paper on cache coherency. The key measurement there is the cost of a single cache-line transfer, which means benchmarks at L1 / L2 cache sizes (a few KB at most). Prior work covers transfer latency well (Hackenberg et al., 2009; Molka et al., 2015); I wanted to see what hardware performance counters could add at that scale. perf stat is (obviously) too coarse for workloads that small: it measures the whole process. The counters need to be in the code itself. I wondered whether even the standard in-code path was too coarse for tight loops at L1/L2 scale.]]></summary></entry><entry><title type="html">Profiling Specific Code Segments of Applications</title><link href="/https://jmuehlig.github.io/profiling-specific-code-segments-of-applications/" rel="alternate" type="text/html" title="Profiling Specific Code Segments of Applications" /><published>2024-11-13T00:00:00-02:00</published><updated>2024-11-13T00:00:00-02:00</updated><id>/https://jmuehlig.github.io/profiling-specific-code-segments-of-applications</id><content type="html" xml:base="/https://jmuehlig.github.io/profiling-specific-code-segments-of-applications/"><![CDATA[<p>Understanding the interaction between software and hardware has become increasingly essential for building high-performance applications. 
The architecture of modern hardware systems has grown significantly in complexity, including deep memory hierarchies and advanced CPUs with features like out-of-order execution and sophisticated branch prediction mechanisms.</p>

<p><a href="https://perfwiki.github.io/main/">Linux Perf</a>, <a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html">Intel VTune</a>, and <a href="https://www.amd.com/en/developer/uprof.html">AMD μProf</a> are helpful tools for understanding how applications use system resources. 
However, as these tools are typically designed as external applications, they profile the entire program, making it difficult to focus on specific code segments like particular functions.
This limitation is particularly challenging when analyzing micro-benchmarks, where the measured code may represent only a fraction of the overall runtime, or distinguishing between different phases of an application’s execution.</p>

<h2 id="counting-hardware-events">Counting Hardware Events</h2>
<p>At their core, these tools leverage <em>Performance Monitoring Units</em> (PMUs)–specialized components designed to track hardware events like <em>cache misses</em> and <em>branch mispredictions</em>.
Although these tools are far more powerful, this discussion will focus on the essentials of hardware event counting.</p>

<h3 id="scenario-random-access-pattern">Scenario: Random Access Pattern</h3>
<p>Consider a random access micro-benchmark designed to access a set of <em>cache lines</em> in a random sequence—a scenario that typically baffles the data prefetcher (<a href="https://github.com/jmuehlig/blog-resource/tree/main/01-profiling-specific-code-segments">see the full source code</a>).
The benchmark employs two distinct arrays: one holding the data and another containing indices that establish the random access pattern. 
After initializing these arrays, we execute the micro-benchmark by sequentially scanning through the indices array and access data from the data array, a method that generally leads to approximately <strong>one cache miss per access</strong> within the contiguous data array.</p>

<h3 id="perf-stat">Perf Stat</h3>
<p>To observe the underlying hardware dynamics, we utilize the <a href="https://perfwiki.github.io/main/tutorial/#counting-with-perf-stat"><code>perf stat</code> command</a>, which quantifies low-level hardware events such as <em>L1 data cache</em> accesses and references during the execution of the micro-benchmark:</p>

<div class="language-bash highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre>perf stat -e instructions,cycles,L1-dcache-loads,L1-dcache-load-misses -- ./random-access-bench --size 16777216
</pre></div>
</div>
</div>

<p>After running, <code>perf stat</code> displays the results on the command line, in combination with metrics such as <em>instructions per cycle</em>:</p>

<div class="language-bash highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre>Performance counter stats for './random-access-bench --size 16777216':

    3,697,089,032      instructions            #    0.63  insn per cycle            
    5,879,736,227      cycles                                                                
    1,186,826,319      L1-dcache-loads                                                       
      103,262,784      L1-dcache-load-misses   #    8.70% of all L1-dcache accesses 

      1.202831289 seconds time elapsed

      0.799309000 seconds user
      0.403155000 seconds sys
</pre></div>
</div>
</div>

<p>Zooming into details, the results reveal <code>103,262,784</code> <em>L1d</em> misses for <code>16,777,216</code> items, which translates to \(\frac{103,262,784}{16,777,216} \approx 6\) misses per item. 
This number significantly surpasses the <strong>anticipated single cache miss</strong> per item.
The source of this discrepancy lies in the comprehensive scope of the <code>perf stat</code> command, which records events throughout the entire runtime of the benchmark. 
This includes the initialization stage of the benchmark where both the data and pattern arrays are allocated and filled.
Ideally, however, profiling should be confined to the specific segment of the code that interacts directly with the data array to achieve more accurate metrics.</p>

<p>One effective strategy for more control over profiling is to start and stop hardware counters at specific code segments using file descriptors. 
This technique is well-documented in the <a href="https://man7.org/linux/man-pages/man1/perf-stat.1.html"><code>perf stat</code> man page</a>. 
Pramod Kumbhar provides a practical guide to implementing this technique on <a href="https://pramodkumbhar.com/2024/04/linux-perf-measuring-specific-code-sections-with-pause-resume-apis/">his blog</a>, though some might find the approach somewhat cumbersome to implement.</p>

<h2 id="controlling-performance-counters-from-c-applications">Controlling Performance Counters from C++ Applications</h2>
<p>Another strategy for achieving refined control over PMUs is to leverage the <em>perf subsystem</em> directly from C and C++ applications through the <a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html"><code>perf_event_open</code> system call</a>. 
Given the complexity of this interface, various libraries have been developed to simplify interaction by embedding the <code>perf_event_open</code> system call into their framework. 
Notable examples include <a href="https://github.com/icl-utk-edu/papi">PAPI</a>, <a href="https://github.com/viktorleis/perfevent">PerfEvent</a>, and <a href="https://github.com/jmuehlig/perf-cpp">perf-cpp</a>, each designed to offer a more accessible gateway to these advanced functionalities.</p>

<p>This article will specifically explore <a href="https://github.com/jmuehlig/perf-cpp">perf-cpp</a> and demonstrate practical examples of how to activate and deactivate hardware performance counters for targeted code segments. 
The <code>perf::EventCounter</code> class in <em>perf-cpp</em> allows users to define which events to measure and provides <code>start()</code> and <code>stop()</code> methods to manage the counters.
Below is a code snippet that sets up the <code>EventCounter</code> and focuses the measurement on the desired code segment:</p>

<div class="language-cpp highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre><span style="color:#579">#include</span> <span style="color:#B44;font-weight:bold">&lt;perfcpp/event_counter.h&gt;</span>

<span style="color:#777">/// Initialize the hardware event counter</span>
<span style="color:#088;font-weight:bold">auto</span> counters = perf::CounterDefinition{};
<span style="color:#088;font-weight:bold">auto</span> event_counter = perf::EventCounter{ counters };

<span style="color:#777">/// Specify hardware events to count</span>
event_counter.add({<span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">instructions</span><span style="color:#710">&quot;</span></span>, <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">cycles</span><span style="color:#710">&quot;</span></span>, <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">cache-references</span><span style="color:#710">&quot;</span></span>, <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">cache-misses</span><span style="color:#710">&quot;</span></span>});

<span style="color:#777">/// Setup benchmark here (this will not be measured)</span>
<span style="color:#080;font-weight:bold">struct</span> <span style="color:#088;font-weight:bold">alignas</span>(<span style="color:#00D">64</span>U) cache_line { std::uint64_t value; };
<span style="color:#088;font-weight:bold">auto</span> data = std::vector&lt;cache_line&gt;{};
<span style="color:#088;font-weight:bold">auto</span> indices = std::vector&lt;std::uint64_t&gt;{};
<span style="color:#777">/// Fill both vectors here...</span>

<span style="color:#088;font-weight:bold">auto</span> sum = <span style="color:#00D">0</span>ULL;

<span style="color:#777">/// Run the workload and count hardware events</span>
event_counter.start();
<span style="color:#080;font-weight:bold">for</span> (<span style="color:#088;font-weight:bold">const</span> <span style="color:#088;font-weight:bold">auto</span> index : indices) {
    sum += data[index]; <span style="color:#777">// &lt;-- critical memory access</span>
}
<span style="color:#080;font-weight:bold">asm</span> <span style="color:#088;font-weight:bold">volatile</span>(<span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#710">&quot;</span></span> : : <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">r,m</span><span style="color:#710">&quot;</span></span>(value) : <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">memory</span><span style="color:#710">&quot;</span></span>); <span style="color:#777">// Ensure the compiler will not optimize sum away</span>
event_counter.stop();
</pre></div>
</div>
</div>

<p>Once the <code>EventCounter</code> is initiated and the events of interest are added, we set up the benchmark by initializing the data and pattern arrays. 
Enclosing the workload we wish to measure with <code>start()</code> and <code>stop()</code> calls enables precise monitoring of that particular code segment. 
Upon stopping the counter, the <code>EventCounter</code> can be queried to obtain the measured events:</p>

<div class="language-cpp highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre><span style="color:#088;font-weight:bold">const</span> <span style="color:#088;font-weight:bold">auto</span> result = event_counter.result();

<span style="color:#777">/// Print the performance counters.</span>
<span style="color:#080;font-weight:bold">for</span> (<span style="color:#088;font-weight:bold">const</span> <span style="color:#088;font-weight:bold">auto</span> [name, value] : result)
{
    std::cout &lt;&lt; value &lt;&lt; <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20"> </span><span style="color:#710">&quot;</span></span> &lt;&lt; name &lt;&lt; <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20"> (</span><span style="color:#710">&quot;</span></span> &lt;&lt; value / <span style="color:#00D">16777216</span> &lt;&lt; <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20"> per access)</span><span style="color:#710">&quot;</span></span> &lt;&lt; std::endl;
}
</pre></div>
</div>
</div>

<p>The output reflects only the activity during the benchmark, effectively excluding the initial setup phase where data is allocated, and patterns are established:</p>

<div class="language-bash highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre>102,284,667 instructions            (6.09664 per access)
992,091,716 cycles                  (59.1333 per access)
 34,227,532 L1-dcache-loads         (2.04012 per access)
 18,944,008 L1-dcache-load-misses   (1.12915 per access)
</pre></div>
</div>
</div>

<p>The results obtained are markedly more explicable than those we got from the <code>perf stat</code> command.
We observe two <em>L1d cache references</em> per access: one for the randomly accessed cache line and another for the index of the pattern array.
Additionally, there are approximately <code>1.3</code> <em>cache misses</em>—one for each data cache line and <code>0.125</code> for the access index, as eight indices fit into a single cache line of the pattern array.</p>

<h2 id="hardware-specific-events">Hardware-specific Events</h2>
<p>While basic performance metrics such as <em>instructions</em>, <em>cycles</em>, and <em>cache misses</em> shed light on the interplay of hardware and software, modern CPUs offer a far broader spectrum of events to monitor.
However, it’s important to note that many of these events are specific to the underlying hardware substrate.
The <em>perf subsystem</em> standardizes only a select group of events universally supported across different processors (<a href="https://github.com/jmuehlig/perf-cpp/blob/dev/docs/counters.md#built-in-events">see a detailed list</a>).
To discover the full range of events available on specific CPUs, one can utilize the <code>perf list</code> command. 
Additionally, Intel provides an extensive catalog of events for various architectures on their <a href="https://perfmon-events.intel.com/">perfmon website</a>.</p>

<p>In order to use hardware-specific counters within applications, the readable event names need to be translated into event codes.
To that end, <a href="https://github.com/wcohen/libpfm4">Libpfm4</a> provides a valuable tool that translates event names (from <code>perf list</code>) into codes.</p>

<p>Let us consider the event <code>CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD</code> on the AMD Zen4 architecture as an example.
The event quantifies the CPU cycles stalled due to pending memory requests, which is particularly insightful for assessing the effects of cache misses on modern systems. 
Intel offers analogous events, such as <code>CYCLE_ACTIVITY.STALLS_MEM_ANY</code> on the Cascade Lake architecture, and both <code>EXE_ACTIVITY.BOUND_ON_LOADS</code> and <code>EXE_ACTIVITY.BOUND_ON_STORES</code> on the Sapphire Rapids architecture.</p>

<p>After downloading and compiling <em>Libpfm4</em>, developers can fetch the code for a specific event as shown below:</p>

<div class="language-bash highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre>./examples/check_events CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD

Requested Event: CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD
Actual    Event: amd64_fam19h_zen4::CYCLES_NO_RETIRE:NOT_COMPLETE_MISSING_LOAD:k=1:u=1:e=0:i=0:c=0:h=0:g=0
PMU            : AMD64 Fam19h Zen4
IDX            : 1077936192
Codes          : 0x53a2d6
</pre></div>
</div>
</div>

<p>Incorporating hardware-specific events into an application with <em>perf-cpp</em> would look something like this:</p>

<div class="language-cpp highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre><span style="color:#579">#include</span> <span style="color:#B44;font-weight:bold">&lt;perfcpp/event_counter.h&gt;</span>

<span style="color:#777">/// Initialize the hardware event counter</span>
<span style="color:#088;font-weight:bold">auto</span> counters = perf::CounterDefinition{};
counters.add(<span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD</span><span style="color:#710">&quot;</span></span>, <span style="color:#02b">0x53a2d6</span>); <span style="color:#777">// &lt;-- Event code from Libpfm4 output</span>
<span style="color:#088;font-weight:bold">auto</span> event_counter = perf::EventCounter{ counters };

<span style="color:#777">/// Specify hardware events to count</span>
event_counter.add({<span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">cycles</span><span style="color:#710">&quot;</span></span>, <span style="background-color:hsla(0,100%,50%,0.05)"><span style="color:#710">&quot;</span><span style="color:#D20">CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD</span><span style="color:#710">&quot;</span></span>});

<span style="color:#777">/// Setup and execute the benchmark as demonstrated above...</span>
</pre></div>
</div>
</div>

<p>This precise tracking reveals that approximately <code>57</code> of <code>59</code> CPU cycles are spent waiting for memory loads to complete–a finding consistent with the inability of the hardware to predict the benchmark’s random access pattern, relying instead on inherent memory latency:</p>

<div class="language-bash highlighter-coderay"><div class="CodeRay">
  <div class="code"><pre>992,091,716 cycles                                      (59.1333 per access)
967,301,682 CYCLES_NO_RETIRE.NOT_COMPLETE_MISSING_LOAD  (57.6557 per access)
</pre></div>
</div>
</div>

<p>However, thanks to sophisticated out-of-order execution, the hardware effectively masks much of this latency, which on the specific machine to execute the benchmark is around <code>700</code> cycles.</p>

<h2 id="summary">Summary</h2>
<p>Profiling tools play a crucial role in identifying bottlenecks and aiding developers in optimizing their code. 
Yet, the broad granularity often means that key code segments tracked with <code>perf stat</code> can be obscured by extraneous data. 
Libraries like <a href="https://github.com/icl-utk-edu/papi">PAPI</a>, <a href="https://github.com/viktorleis/perfevent">PerfEvent</a>, and <a href="https://github.com/jmuehlig/perf-cpp">perf-cpp</a> offer a solution by allowing direct control over hardware performance counters from within the application itself. 
By leveraging the <em>perf subsystem</em> (more precisely the <code>perf_event_open</code> system call), these tools enable precise measurements of only the code segments that are truly relevant.</p>]]></content><author><name>jan</name></author><category term="perf" /><category term="cpp" /><category term="linux" /><category term="hardware performance counter" /><summary type="html"><![CDATA[Understanding the interaction between software and hardware has become increasingly essential for building high-performance applications. The architecture of modern hardware systems has grown significantly in complexity, including deep memory hierarchies and advanced CPUs with features like out-of-order execution and sophisticated branch prediction mechanisms.]]></summary></entry></feed>