Analyzing Memory Access Patterns¶
The Memory Access Analyzer maps sampled memory accesses to individual data structure instances, producing per-attribute statistics (cache hits/misses, TLB performance, latency).
This is useful when multiple instances of the same data structure share identical code but exhibit different access patterns, e.g., different nodes within a tree.
Tip
See the example: memory_access_analyzer.cpp.
Step 1: Describing Data Types¶
The analyzer needs a description of your data type's layout. For example, given a binary tree node:
class BinaryTreeNode {
std::uint64_t value;
BinaryTreeNode* left_child;
BinaryTreeNode* right_child;
};
Create a perf::analyzer::DataType definition:
#include <perfcpp/analyzer/memory_access.hpp>
auto binary_tree_node = perf::analyzer::DataType{"BinaryTreeNode", sizeof(BinaryTreeNode)};
binary_tree_node.add("value", sizeof(std::uint64_t)); /// Describe the "value" attribute.
binary_tree_node.add("left_child", sizeof(BinaryTreeNode*)); /// Describe the "left_child" attribute.
binary_tree_node.add("right_child", sizeof(BinaryTreeNode*)); /// Describe the "right_child" attribute.
Tip
For accurate size and offset information, use pahole. See Pramod Kumbhar's guide for details.
Step 2: Registering Data Type Instances¶
Register individual instances so the analyzer can map sampled addresses to specific objects:
auto memory_access_analyzer = perf::analyzer::MemoryAccess{};
/// Expose the data type to the analyzer.
memory_access_analyzer.add(std::move(binary_tree_node));
/// Register each instance by pointer.
for (auto* node : tree->nodes()) {
memory_access_analyzer.annotate("BinaryTreeNode", node);
}
Step 3: Mapping Samples to Data Type Instances¶
Sample memory accesses using a memory-capable trigger: perf::MemoryLoads on Intel, perf::IbsOp on AMD (see CPU-specific notes).
perf::HardwareInfo detects the hardware at runtime:
#include <perfcpp/hardware_info.hpp>
#include <perfcpp/sampler.hpp>
auto sampler = perf::Sampler{};
/// Choose a memory-capable trigger, depending on the hardware.
if (perf::HardwareInfo::is_amd_ibs_supported()) {
sampler.trigger(perf::IbsOp{ /* is_uop = */ true }, perf::Precision::MustHaveZeroSkid, perf::Period{ 4000U });
} else if (perf::HardwareInfo::is_intel()) {
sampler.trigger(perf::MemoryLoads{ /* no latency filter */ 0U }, perf::Precision::MustHaveZeroSkid, perf::Period{ 2000U });
}
/// Record the memory address, the data source (e.g., L1d or RAM), and latencies.
sampler.values()
.logical_memory_address(true)
.data_source(true)
.data_access_latency(true)
.instruction_latency(true);
/// On AMD, record the additional fields provided by IBS.
if (perf::HardwareInfo::is_amd()) {
sampler.values()
.data_tlb_latency(true)
.mhb_allocations(true);
}
sampler.start();
/// ... computation here ...
sampler.stop();
/// Map samples to registered data types and instances.
const auto samples = sampler.result();
const auto result = memory_access_analyzer.map(samples);
/// Release resources explicitly, or let the destructor handle it.
sampler.close();
Step 4: Processing the Result¶
Example output, recorded on an AMD system:
DataType BinaryTreeNode (24B) {
| loads
| | latency | cache hits | RAM hits | MAB | TLB
samples | count | cache uOp dTLB | L1d L2 L3 | local remote | no alloc. slots | dTLB STLB miss
0: value (8B) 373 | 373 | 439 612 37 | 154 0 7 | 212 0 | 350 23 | 190 5 178
8: left_child (8B) 146 | 146 | 720 898 52 | 1 0 5 | 140 0 | 139 7 | 12 18 116
16: right_child (8B) 528 | 528 | 173 295 11 | 393 1 14 | 120 0 | 501 16 | 415 4 109
}
The columns depend on the recorded values and the hardware. For example, Intel systems report line fill buffer (LFB) hits instead of MAB allocations, and only loads sampled on AMD's Op PMU include dTLB latencies.
For structured export: