Recording Hardware Events¶
Record hardware performance counters for specific code regions using perf::EventCounter.
Note
EventCounter monitors a single thread. For multi-threaded or multi-core recording, use MultiThreadEventCounter, MultiCoreEventCounter, or MultiProcessEventCounter — see parallel recording.
Tip
See single_thread.cpp for a full working example.
Basic Lifecycle¶
Set up an event counter, wrap your code with start() / stop(), and retrieve the results:
#include <perfcpp/event_counter.hpp>
/// Create the counter and add events.
auto event_counter = perf::EventCounter{};
event_counter.add({"instructions", "cycles", "branches", "cache-misses"});
/// Optionally, open counters ahead of time to exclude setup from measurement.
event_counter.open();
/// Measure.
event_counter.start();
/// ... your code here ...
event_counter.stop();
/// Retrieve results.
const auto result = event_counter.result();
After stop(), you can call start() / stop() again without re-adding events.
Accessing Results¶
/// Query a specific event.
const auto cycles = result.get("cycles");
std::cout << "Took " << cycles.value() << " cycles" << std::endl;
/// Iterate over all results.
for (const auto [name, value] : result)
{
std::cout << name << " = " << value << std::endl;
}
/// Print as formatted table.
std::cout << result.to_string() << std::endl;
/// Export as CSV or JSON — to string or to file.
std::cout << result.to_csv() << std::endl;
std::cout << result.to_json() << std::endl;
result.to_csv("results.csv");
result.to_json("results.json");
Scheduling Events to Hardware Counters¶
Physical hardware counters are limited (typically 4–8 per core). When you request more events than counters, the kernel multiplexes — time-sharing counters and scaling results.
By default, perf-cpp packs events into as few counters as possible.
You can control this via a scheduling hint in add():
| Schedule Mode | Description |
|---|---|
Schedule::Append |
Pack into any counter, using multiplexing. Default. |
Schedule::Separate |
One event per physical counter — avoids multiplexing. |
Schedule::Group |
Force all listed events onto the same counter (multiplexed together). |
add() throws if the requested scheduling doesn't fit (e.g., too many events to group).
Binding to a CPU Core or Process¶
By default, events are counted across all cores the thread runs on, for the calling process only.
auto config = perf::Config{};
/// Count only on CPU core 5.
config.cpu_core(5U);
config.cpu_core(perf::CpuCore::Any); /// revert to all cores
/// Monitor a specific process or all processes.
config.process(perf::Process{1337});
config.process(perf::Process::Any);
auto event_counter = perf::EventCounter{ config };
Note
Monitoring other or all processes may require elevated privileges. See the perf paranoid setting.
Tip
Some hardware events (e.g., Intel off-core events) require monitoring all processes on a specific CPU core, as the hardware does not attribute these events to individual processes.
Detection of Physical Hardware Counters¶
perf-cpp automatically detects the number of physical counters and multiplexing capabilities on most systems.
Important
If the NMI watchdog is enabled (cat /proc/sys/kernel/nmi_watchdog returns 1), it permanently consumes one hardware counter.
perf-cpp detects this and adjusts automatically.
To reclaim the counter, disable the watchdog via echo 0 > /proc/sys/kernel/nmi_watchdog (requires root).
For unusual hardware where auto-detection fails, specify limits manually:
auto config = perf::Config{};
config.num_physical_counters(2U);
config.num_events_per_physical_counter(1U);
auto event_counter = perf::EventCounter{ config };
Further Configuration¶
| Setting | Default | Description |
|---|---|---|
include_child_threads(bool) |
false |
Also monitor child threads spawned by the recording thread. |
include_kernel(bool) |
true |
Include events from kernel activity. Disable when only user-space matters or perf paranoid restricts access. |
include_user(bool) |
true |
Include events from user-space activity. |
include_hypervisor(bool) |
true |
Include events from hypervisor activity. |
include_idle(bool) |
true |
Include events during CPU idle periods. |
include_guest(bool) |
true |
Include events from guest (VM) activity. |
include_host(bool) |
true |
Include events from host activity. |
pinned(bool) |
false |
Pin events to the CPU, preventing them from being multiplexed off. |
Troubleshooting¶
Enable debug output to inspect the counter configuration passed to the kernel:
auto config = perf::Config{};
config.debug(true);
auto event_counter = perf::EventCounter{ config };
This is equivalent to perf --debug perf-event-open stat -- sleep 1, which prints the perf_event_open arguments for each counter.
Useful for retrieving event codes or diagnosing why a counter fails to open.
Example: Random vs. Sequential Access¶
This example measures how unpredictable memory access patterns defeat the hardware prefetcher:
#include <random>
#include <iostream>
#include <cstdint>
#include <vector>
#include <algorithm>
#include <perfcpp/event_counter.hpp>
/// One cache line per element.
struct alignas(64U) cache_line { std::int64_t value; };
int main()
{
auto event_counter = perf::EventCounter{};
event_counter.add({"instructions", "cycles", "cache-misses", "cycles-per-instruction"});
/// 256 MB of cache lines.
auto cache_lines = std::vector<cache_line>{};
cache_lines.resize((1024U * 1024U * 256U) / sizeof(cache_line));
for (auto i = 0U; i < cache_lines.size(); ++i)
{
cache_lines[i].value = i;
}
/// Shuffle indices for random access.
auto indices = std::vector<std::uint64_t>(cache_lines.size());
std::iota(indices.begin(), indices.end(), 0U);
std::shuffle(indices.begin(), indices.end(), std::mt19937{std::random_device{}()});
/// Measure random access.
event_counter.start();
auto value = 0ULL;
for (const auto index : indices)
{
value += cache_lines[index].value;
}
asm volatile("" : "+r,m"(value) : : "memory");
event_counter.stop();
/// Print per-cache-line results.
const auto result = event_counter.result(cache_lines.size());
for (const auto [name, val] : result)
{
std::cout << val << " " << name << " per cache line" << std::endl;
}
event_counter.close();
}
Random access output — more than one cache miss per line:
7.12 instructions per cache line
57.19 cycles per cache line
1.63 cache-misses per cache line
8.03 cycles-per-instruction per cache line
Sequential access (without shuffling) — the prefetcher eliminates nearly all misses: