Accessing Performance Counters¶

CFiddle provide easy access to performance counters that can count numerous hardware and software events: cycles, instructions executed, cache misses, TLB misses, etc.

Enabling Performance Counters¶

Accessing performance counters can be hard for reasons that are out of CFiddle’s control, mostly to do with security – performance counters can leak a lot of information about a system.

The first step in using them is to check if they are enabled:

>>> from cfiddle import *
>>> print(are_perf_counters_available()) 
1

If it returns True you are good to go. If False you can enable with this shell command:

$ echo 0 > /proc/sys/kernel/perf_event_paranoid

You may also need to pass --privileged and/or --cap-add CAP_SYS_ADMIN to your docker command line if you’re using Docker.

Once cfiddle.are_perf_counters_available() returns True, you can start measuring.

Taking Performance Counter Measurements¶

You can measure performance counters by passing the perf_counters argument to cfiddle.run(). The argument takes a list performance counters you’d like to measure.

For example, we can measure the number of clock cycles required to execute a loop like so:

>>> from cfiddle import *
>>> results = run(build(code(r"""
... #include"cfiddle.hpp"
... extern "C"
... void foo(int count) {
...    start_measurement();
...    for (int i= 0;i < count; i++) {
...    }
...    end_measurement();
... }""")), function="foo", arguments=arg_map(count=[1000,10000,100000]), perf_counters=["CYCLES", "INSTRUCTIONS"])
>>> data = results.as_df()
>>> data["CYCLES_PER_INSTRUCTION"] = data["CYCLES"]/data["INSTRUCTIONS"]
>>> print(data) 
function   count        ET  CYCLES  INSTRUCTIONS  CYCLES_PER_INSTRUCTION
0      foo    1000  0.000005    8383          5196                1.613356
1      foo   10000  0.000017   71245         50196                1.419336
2      foo  100000  0.000149  707063        500196                1.413572

Specifying Performance Counters To Measure¶

There are two different ways to name performance counters in the perf_counters argument.

The first and simplest set is the counter names supported by perf_event_open() man page interface. The more flexible mechanism is set of names accessible via the libpfm4 library.

`perf_event_open()` Names¶

The perf_event_open() names are meant to be portable across Linux running on different architectures. The descriptions and tables below are summarized from the perf_event_open() man page.

The non-cache-related counters are:

Name	Description
`PERF_COUNT_HW_CPU_CYCLES`	Total cycles. Be wary of what happens during CPU frequency scaling.
`PERF_COUNT_HW_INSTRUCTIONS`	Retired instructions. Be careful, these can be affected by various issues, most notably hardware interrupt counts.
`PERF_COUNT_HW_CACHE_REFERENCES`	Cache accesses. Usually this indicates Last Level Cache accesses but this may vary depending on your CPU. This may include prefetches and coherency messages; again this depends on the design of your CPU.
`PERF_COUNT_HW_CACHE_MISSES`	Cache misses. Usually this indicates Last Level Cache misses; this is intended to be used in conjunction with the `PERF_COUNT_HW_CACHE_REFERENCES` event to calculate cache miss rates.
`PERF_COUNT_HW_BRANCH_INSTRUCTIONS`	Retired branch instructions.
`PERF_COUNT_HW_BRANCH_MISSES`	Mispredicted branch instructions.
`PERF_COUNT_HW_BUS_CYCLES`	Bus cycles, which can be different from total cycles.
`PERF_COUNT_HW_STALLED_CYCLES_FRONTEND`	Stalled cycles during issue.
`PERF_COUNT_HW_STALLED_CYCLES_BACKEND`	Stalled cycles during retirement.
`PERF_COUNT_HW_REF_CPU_CYCLES`	Total cycles; not affected by CPU frequency scaling.
`PERF_COUNT_SW_CPU_CLOCK`	This reports the CPU clock, a high-resolution per-CPU timer.
`PERF_COUNT_SW_TASK_CLOCK`	This reports a clock count specific to the task that is running.
`PERF_COUNT_SW_PAGE_FAULTS`	This reports the number of page faults.
`PERF_COUNT_SW_CONTEXT_SWITCHES`	This counts context switches.
`PERF_COUNT_SW_CPU_MIGRATIONS`	This reports the number of times the process has migrated to a new CPU.
`PERF_COUNT_SW_PAGE_FAULTS_MIN`	This counts the number of minor page faults. These did not require disk I/O to handle.
`PERF_COUNT_SW_PAGE_FAULTS_MAJ`	This counts the number of major page faults. These required disk I/O to handle.
`PERF_COUNT_SW_ALIGNMENT_FAULTS`	This counts the number of alignment faults. These happen when unaligned memory accesses happen; the kernel can handle these but it reduces performance. This happens only on some architectures (never on x86).
`PERF_COUNT_SW_EMULATION_FAULTS`	This counts the number of emulation faults. The kernel sometimes traps on unimplemented instructions and emulates them for user space. This can negatively impact performance.

There are a bunch of cache-related counters, too, and you can construct them as PERF_COUNT_HW_CACHE_<cache_identifier>:<access_type>:<result> as follows:

`cache_identifier`	`access_type`	`result`
`L1D` – Level-one data	`READ`	`ACCESS`
`L1I` – Level-one instruction	`WRITE`	`MISS`
`LL` – Last-level cache	`PREFETCH`
`DTLB` – Data TLB
`ITLB` – Instruction TLB
`BPU` – Branch predictor
`NODE` – Local memory accesses

So for instance, PERF_COUNT_HW_CACHE_L1D:READ:ACCESS will count the number Level-one data cache reads.

`libpfm4` Names¶

You can also pass a much large group of platform-specific counters. Which of these are available depends on the architecture and OS you’re running on. CFiddle use libpfm4 to parse these names.

You can generate a list of the available value with

$ showevtinfo

The resulting list can be a bit dizzying (showevtinfo is an example from the libpfm4 source distribution. It’s not installed by default. If you are using the CFiddle docker image or have run the install_prereqs.sh script in the CFiddle distribution, it should be available).

Here’s a quick orientation on the output:

At the top is a long list:

Supported PMU models:
     [7, netburst, "Pentium4"]
     [8, netburst_p, "Pentium4 (Prescott)"]
...

These are all the Performance Measurement Unit (PMUs) that libpfm4 knows about. A PMU is a generic Linux abstraction and it may or may not correspond to a piece hardware.

Next, comes a list of the PMUs libpfm4 detected:

Detected PMU models:
     [18, ix86arch, "Intel X86 architectural PMU", 7 events, 1 max encoding, 7 counters, core PMU]
     [51, perf, "perf_events generic PMU", 189 events, 1 max encoding, 0 counters, OS generic PMU]
     [110, rapl, "Intel RAPL", 4 events, 1 max encoding, 3 counters, uncore PMU]
     [114, perf_raw, "perf_events raw PMU", 1 events, 1 max encoding, 0 counters, OS generic PMU]
     [200, skl, "Intel Skylake", 83 events, 2 max encoding, 11 counters, core PMU]
...

In this example we the generic x86 PMU, the perf PMU that provides the hardware and software counters described in the perf_event_open() section above, the rapl PMU that provides power/energy measurements, the perf_raw PMU, and the micro-architecture-specific Skylake PMU.

Finally, there is the list of events. On my development system there are 284 of them. Here’s the first:

#-----------------------------
IDX      : 37748736
PMU name : ix86arch (Intel X86 architectural PMU)
Name     : UNHALTED_CORE_CYCLES
Equiv    : None
Flags    : None
Desc     : count core clock cycles whenever the clock signal on the specific core is running (not halted)
Code     : 0x3c
Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean)
Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean)
Modif-03 : 0x03 : PMU : [i] : invert (boolean)
Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean)
Modif-06 : 0x07 : PMU : [intx] : monitor only inside transactional memory region (boolean)
Modif-07 : 0x08 : PMU : [intxcp] : do not count occurrences inside aborted transactional memory region (boolean)
...

Here’s what the fields mean:

IDX – libpfm4’s internal ID for the event.
PMU name – Which PMU provides it.
Name – The event’s name
Equiv – An alternate name for event (if any).
Desc – A desciption.
Flags, Code – Values that specify the event.
Modif-* – Modifiers.
Umask-* – Masks to filter event. This event doesn’t have any.

The meaning and number of the modifiers varies from PMU to PMU. More information about each is available via man:

$ man libpfm_intel_x86_arch

Performance Counter Pitfalls¶

Performance counters can be tricky. Here are some potential pitfalls:

Not all combinations of hardware counters can be used at once. The details of this are byzantine. Trial-and-error is a good approach to figuring out what works.
CFiddle will print errors when performance counter configuration failed, but your experiments will still run. You’ll just get zeros.

Accessing Performance Counters¶

Enabling Performance Counters¶

Taking Performance Counter Measurements¶

Specifying Performance Counters To Measure¶

`perf_event_open()` Names¶

`libpfm4` Names¶

Performance Counter Pitfalls¶

CFiddle

Navigation

Related Topics

Accessing Performance Counters¶

Enabling Performance Counters¶

Taking Performance Counter Measurements¶

Specifying Performance Counters To Measure¶

perf_event_open() Names¶

libpfm4 Names¶

Performance Counter Pitfalls¶

`perf_event_open()` Names¶

`libpfm4` Names¶