Skip to content

What are Hardware Performance Counters?

Modern CPUs contain dedicated hardware registers — performance monitoring counters (PMCs) — that track low-level events as your code executes. These counters run in hardware with virtually zero overhead, giving you precise insights that software-only profiling cannot provide.

What do they measure?

Every CPU core has a small number of programmable counters (typically 4–8). Each counter can be configured to count one event type at a time:

  • Instructions retired: how many instructions actually completed
  • CPU cycles: wall-clock time at the core level
  • Cache misses: L1, L2, L3, at different levels of the hierarchy
  • Branch mispredictions: wrong guesses by the branch predictor
  • TLB misses: virtual-to-physical address translation failures
  • Memory accesses: loads, stores, prefetches, and where data came from (L1, L2, RAM, remote NUMA node)

Beyond these common events, each CPU generation adds vendor-specific counters. Intel and AMD publish thousands of events per microarchitecture, from micro-op queue stalls to specific cache coherency transitions.

Counting vs. Sampling

There are two fundamentally different ways to use these counters:

Counting reads the counter registers after a region of code has executed. You get totals: "this loop executed 4.7 billion cycles and had 13 million cache misses." Counts are exact (no sampling error) but tell you nothing about which instructions caused those events.

Sampling captures snapshots at regular intervals. Every N events (e.g., every 50,000 cycles), the CPU interrupts and records context about that moment — the instruction pointer, memory address, timestamp, cache level, and more. This gives you a statistical picture of where events are concentrated, but since only every N-th event triggers a sample, it's an approximation — not every instruction is observed.

How many counters are available?

Each physical core has a fixed number of counter registers. Typical values:

Architecture General-purpose counters Fixed counters
Intel (recent) 4–8 3–4 (cycles, instructions, ref-cycles)
AMD (Zen 3+) 6 0

If you request more events than physical counters, the kernel multiplexes: it time-shares the counters and scales the results. Multiplexed counts are estimates, not exact — so fewer simultaneous events means more accurate data.

perf-cpp detects your hardware's counter layout automatically and manages multiplexing transparently.

Further reading

Introductory articles on hardware performance counters and how to work with them: