Skip to content

Recording Hardware Events

Record hardware performance counters for specific code regions using perf::EventCounter.

Note

EventCounter monitors a single thread. For multi-threaded or multi-core recording, use MultiThreadEventCounter, MultiCoreEventCounter, or MultiProcessEventCounter — see parallel recording.

Tip

See single_thread.cpp for a full working example.


Basic Lifecycle

Set up an event counter, wrap your code with start() / stop(), and retrieve the results:

#include <perfcpp/event_counter.hpp>

/// Create the counter and add events.
auto event_counter = perf::EventCounter{};
event_counter.add({"instructions", "cycles", "branches", "cache-misses"});

/// Optionally, open counters ahead of time to exclude setup from measurement.
event_counter.open();

/// Measure.
event_counter.start();
/// ... your code here ...
event_counter.stop();

/// Retrieve results.
const auto result = event_counter.result();

After stop(), you can call start() / stop() again without re-adding events.

/// Release resources explicitly, or let the destructor handle it.
event_counter.close();

Accessing Results

/// Query a specific event.
const auto cycles = result.get("cycles");
std::cout << "Took " << cycles.value() << " cycles" << std::endl;

/// Iterate over all results.
for (const auto [name, value] : result)
{
    std::cout << name << " = " << value << std::endl;
}

/// Print as formatted table.
std::cout << result.to_string() << std::endl;

/// Export as CSV or JSON — to string or to file.
std::cout << result.to_csv() << std::endl;
std::cout << result.to_json() << std::endl;
result.to_csv("results.csv");
result.to_json("results.json");

Scheduling Events to Hardware Counters

Physical hardware counters are limited (typically 4–8 per core). When you request more events than counters, the kernel multiplexes — time-sharing counters and scaling results.

By default, perf-cpp packs events into as few counters as possible. You can control this via a scheduling hint in add():

event_counter.add({"instructions", "cycles", "branches"},
                  perf::EventCounter::Schedule::Separate);
Schedule Mode Description
Schedule::Append Pack into any counter, using multiplexing. Default.
Schedule::Separate One event per physical counter — avoids multiplexing.
Schedule::Group Force all listed events onto the same counter (multiplexed together).

add() throws if the requested scheduling doesn't fit (e.g., too many events to group).

Binding to a CPU Core or Process

By default, events are counted across all cores the thread runs on, for the calling process only.

auto config = perf::Config{};

/// Count only on CPU core 5.
config.cpu_core(5U);
config.cpu_core(perf::CpuCore::Any); /// revert to all cores

/// Monitor a specific process or all processes.
config.process(perf::Process{1337});
config.process(perf::Process::Any);

auto event_counter = perf::EventCounter{ config };

Note

Monitoring other or all processes may require elevated privileges. See the perf paranoid setting.

Tip

Some hardware events (e.g., Intel off-core events) require monitoring all processes on a specific CPU core, as the hardware does not attribute these events to individual processes.

Detection of Physical Hardware Counters

perf-cpp automatically detects the number of physical counters and multiplexing capabilities on most systems.

Important

If the NMI watchdog is enabled (cat /proc/sys/kernel/nmi_watchdog returns 1), it permanently consumes one hardware counter. perf-cpp detects this and adjusts automatically. To reclaim the counter, disable the watchdog via echo 0 > /proc/sys/kernel/nmi_watchdog (requires root).

For unusual hardware where auto-detection fails, specify limits manually:

auto config = perf::Config{};
config.num_physical_counters(2U);
config.num_events_per_physical_counter(1U);
auto event_counter = perf::EventCounter{ config };

Further Configuration

Setting Default Description
include_child_threads(bool) false Also monitor child threads spawned by the recording thread.
include_kernel(bool) true Include events from kernel activity. Disable when only user-space matters or perf paranoid restricts access.
include_user(bool) true Include events from user-space activity.
include_hypervisor(bool) true Include events from hypervisor activity.
include_idle(bool) true Include events during CPU idle periods.
include_guest(bool) true Include events from guest (VM) activity.
include_host(bool) true Include events from host activity.
pinned(bool) false Pin events to the CPU, preventing them from being multiplexed off.

Troubleshooting

Enable debug output to inspect the counter configuration passed to the kernel:

auto config = perf::Config{};
config.debug(true);
auto event_counter = perf::EventCounter{ config };

This is equivalent to perf --debug perf-event-open stat -- sleep 1, which prints the perf_event_open arguments for each counter. Useful for retrieving event codes or diagnosing why a counter fails to open.


Example: Random vs. Sequential Access

This example measures how unpredictable memory access patterns defeat the hardware prefetcher:

#include <random>
#include <iostream>
#include <cstdint>
#include <vector>
#include <algorithm>
#include <perfcpp/event_counter.hpp>

/// One cache line per element.
struct alignas(64U) cache_line { std::int64_t value; };

int main()
{
    auto event_counter = perf::EventCounter{};
    event_counter.add({"instructions", "cycles", "cache-misses", "cycles-per-instruction"});

    /// 256 MB of cache lines.
    auto cache_lines = std::vector<cache_line>{};
    cache_lines.resize((1024U * 1024U * 256U) / sizeof(cache_line));
    for (auto i = 0U; i < cache_lines.size(); ++i)
    {
        cache_lines[i].value = i;
    }

    /// Shuffle indices for random access.
    auto indices = std::vector<std::uint64_t>(cache_lines.size());
    std::iota(indices.begin(), indices.end(), 0U);
    std::shuffle(indices.begin(), indices.end(), std::mt19937{std::random_device{}()});

    /// Measure random access.
    event_counter.start();
    auto value = 0ULL;
    for (const auto index : indices)
    {
        value += cache_lines[index].value;
    }
    asm volatile("" : "+r,m"(value) : : "memory");
    event_counter.stop();

    /// Print per-cache-line results.
    const auto result = event_counter.result(cache_lines.size());
    for (const auto [name, val] : result)
    {
        std::cout << val << " " << name << " per cache line" << std::endl;
    }

    event_counter.close();
}

Random access output — more than one cache miss per line:

7.12 instructions per cache line
57.19 cycles per cache line
1.63 cache-misses per cache line
8.03 cycles-per-instruction per cache line

Sequential access (without shuffling) — the prefetcher eliminates nearly all misses:

6.85 instructions per cache line
8.94 cycles per cache line
0.007 cache-misses per cache line
1.31 cycles-per-instruction per cache line