Homework 3: Hardware Performance Counters

This assignment will familiarize you with various low-level performance counters on X86 processors. We will be using PAPI, a performance instrumentation framework, to help us measure and understand the behavior of CPU caches and the branch predictor. You should perform all measurements on the openlab machines.

NOTE: YOU CANNOT PUBLICLY RELEASE SOLUTIONS TO THIS HOMEWORK. It's ok to show your work to your future employer as a private github/gitlab repo, however any public release is prohibited.

Hardware performance counters

Note: The overview is partially based on an short answer by John Mc Calpin. Modern CPUs provide a collection of hardware performance counters that allow you to monitor various microarchitectural events in the CPU cores. The infrastructure for these counters is described in Chapter 18 of Volume 3 of the Intel Architectures Software Developer's Manual. The events that can be counted are mostly different on different processor models -- Chapter 19 of the same document briefly describes the events for each model. When HyperThreading is enabled, these counters typically only increment for events pertaining to the specific Logical Processor that is reading the counter. Exceptions are noted in the event descriptions in Chapter 19.

The details vary by processor model, but the core performance counters typically include events related to

  1. Overall instruction fetch/issue/dispatch(execution)/retirement. Some processors have additional events to count events related to "stalls" at various points in the pipeline
  2. Branch execution, conditional branches taken/not taken, and mispredicted conditional branches
  3. Load/store operations, including
    1. Translation Lookaside Buffer accesses/misses, and Page Table Walks
    2. Accesses to the L1 Data Cache, including hits, misses, and fills
    3. Accesses to the L2 Cache
    4. Accesses to the L3 Cache (where applicable)
  4. Floating-point operations can be counted by SIMD width and by operand width (Broadwell and newer cores)

On openlab we're running on the following CPU (at least on circinus-1):

Tue/01.06:/home/aburtsev
circinus-1:1002-/08:05>cat /proc/cpuinfo |grep "model name"
model name	: Intel(R) Xeon(R) CPU           X5680  @ 3.33GHz

I.e., Intel X5680, which is a Westmere CPU, we can find all supported performance counters in Chapter 19.11. You can see however that event descriptions are terse, and it can require significant research (or familiarity with the microarchitecture) to make sense of the descriptions. Nevertheless, let me encourage you: it's still possible and it's not as hard.

Consider an example: DTLB_LOAD_MISSES.ANY is counting a number of times the execution triggered a page table walk due to a TLB miss (remember the page walk requires fetching entries from all level of the page table (4 or 5 levels on 64bit systems to resolve the physical address of a specific virtual address that you use in registers.

PAPI

Since the counters change from one CPU architecture to another, it is convenient to use a generic library that abstracts away low-level details of performance counting on each CPU. One such library is PAPI.

PAPI is a library that provides a high-level interface for low-level performance counters uniform across all architectures (of course, some details are lost, and once in a while you might need to get back to measuring a specific counter that is normally abstracted away, but in most cases PAPI is sufficient).

Take a moment a read an paper that provides an introduction to the PAPI library

Entering the shell environment

For this assignment, we will be using a prepared shell environment on openlab which has PAPI set up for you. On an openlab machine, run the following command:
/home/zhaofel1/Public/cs250p/enter.sh 
You will now be able to use the PAPI library and its commands. To see what you can measure, run `papi_avail` to see a list of performance counters and their availability.

Performing measurements with PAPI

Let's write a simple program that performs branching based on a random number. Copy and paste the following sample code into `simple.c`:
#include <stdio.h>
#include <stdlib.h>

unsigned long rd_state = 14695981039346656037lu;
char rd_counter = 0;

// The FNV hash function, used here as a pseudo-random number generator.
unsigned long pseudo_random() {
	rd_state *= 1099511628211;
	rd_state ^= rd_counter++;
	return rd_state;
}

int main(int argc, char **argv) {
	size_t size = 1000000000;
	size_t x = 0;

	for (size_t i = 0; i < size; ++i) {
		size_t idx = pseudo_random() % size;
		x += memory[idx];
	}

    printf("Sum: %lu\n", x);

    return 0;
}

To PAPI instrumentation, initialize the PAPI library with PAPI_library_init(PAPI_VER_CURRENT);, then surround the code region that you want to measure with PAPI_hl_region_begin and PAPI_hl_region_end calls.

Behind the scenes, PAPI will use the RDPMC instruction to read the actual platform-specific performance counters at the start and end of our code region, much like the `RDTSC` exercise that we did in Homework 1. Compile the code with

gcc -lpapi -o simple simple.c
To select a set of events to measure, set the PAPI_EVENTS environment variable.

For example, to measure the number of L3 cache accesses and misses:

PAPI_EVENTS=PAPI_L3_TCA,PAPI_L3_TCM ./simple
Simple, right?

The results will be written to the papi_hl_output directory, with previous results automatically moved out of the way (the one without the trailing timestamp is the latest).

Playing with the counters

Let's start with some simple performance measurements to make sure we understand the hardware performance counters exported by PAPI.

Branch mispredictions

Lets start with a couple of examples that help us to understand branch prediction on modern CPUs
// Example 1: Branch always taken
void test_branchalways() {
	size_t size = 1000000000;
	size_t x = 0;

	PAPI_hl_region_begin("branchalways");
	for (size_t i = 0; i < size; ++i) {
		if (pseudo_random()) {
			x++;
		}
	}
	PAPI_hl_region_end("branchalways");

	printf("branchalways done: %lu/%lu\n", x, size);
}

// Example 2: Branch taken 50% of the time
void test_branchhalf() {
	size_t size = 1000000000;
	size_t x = 0;

	PAPI_hl_region_begin("branchhalf");
	for (size_t i = 0; i < size; ++i) {
		if (pseudo_random() % 128 < 64) {
			x++;
		}
	}
	PAPI_hl_region_end("branchhalf");

	printf("branchhalf done: %lu/%lu\n", x, size);
}

int main(int argc, char **argv) {
	int init = PAPI_library_init(PAPI_VER_CURRENT);
	if (init != PAPI_VER_CURRENT) {
		fprintf(stderr, "Bad PAPI version");
		return 1;
	}

	test_branchalways();
	test_branchhalf();

	return 0;
}
Here we implement two test functions test_branchalways() and test_branchhalf() to test two scenarios: one function is always taking a branch, and another takes it in 50% of cases. Let's compile and run the code. Your goal is to collect three counters: the total number of branches executed by the program, number of predicted and mispredicted branches.

First, lets take a look at what counters are available through PAPI

[cs250p:~]$ papi_avail 

We're interested in the following events

PAPI_BR_CN   0x8000002b  Yes   No   Conditional branch instructions
PAPI_BR_MSP  0x8000002e  Yes   No   Conditional branch instructions mispredicted
PAPI_BR_PRC  0x8000002f  Yes   Yes  Conditional branch instructions correctly predicted

To collect these three counters you can start your program like

[cs250p:~]$ PAPI_EVENTS="PAPI_BR_CN,PAPI_BR_MSP,PAPI_BR_PRC" ./a.out 

Submit: your measurements with an explanation for the counters you see.

Now lets estimate the effect of the mispredicted branches on the performance of the program. To measure how long it takes to execute each test you can either use rdtsc() function that you used in HW1 or simply take the number of cycles reported by PAPI (it's reported as "cycles":"XXX").

Now lets use two counters to measure the number of issued and committed branches: PAPI_TOT_IIS and PAPI_TOT_INS

Submit: the time and instruction measurements with an explanation.

Cache misses

Now lets take a look at a couple of tests that access memory in a sequential and random order.
struct cache_line {
	char data[64];
};

size_t test_read_seq(struct cache_line *memory, size_t size) {
	size_t x = 0;

	PAPI_hl_region_begin("read_seq");
	u_int64_t start = rdtsc(); 
	for (size_t i = 0; i < size; ++i) {
		x += memory[i].data[0];
	}
	PAPI_hl_region_end("read_seq");

	u_int64_t end = rdtsc(); 
	printf("read_seq done in %lld cycles\n", end - start);
	return x;
}
// Example 3: Random memory read
size_t test_read_rand(struct cache_line *memory, size_t size) {
	size_t x = 0;

	PAPI_hl_region_begin("read_rand");
	u_int64_t start = rdtsc(); 
	for (size_t i = 0; i < size; ++i) {
		size_t idx = pseudo_random() % size;
		x += memory[idx].data[0];
	}
	PAPI_hl_region_end("read_rand");

	u_int64_t end = rdtsc(); 
	printf("read_rand done in %lld cycles\n", end - start);
	return x;
}

Here we have a data structure struct cache_line which is 64 bytes in size, i.e., one cache line. And we're using the rdtsc() function from HW1.

You can run those test from you main function like

size_t size = 128 * 1024 * 1024;
struct cache_line *region1 = (struct cache_line*) malloc(size*sizeof(struct cache_line));
struct cache_line *region2 = (struct cache_line*) malloc(size*sizeof(struct cache_line));

test_read_seq(region1, size);
test_read_rand(region2, size);
If you run these tests you will see something like this on the screen
read_seq done in 3137134583 cycles
read_rand done in 14126511917 cycles

This means that sequential test spends around 23 cycles per one iteration of the loop (we're accessing 128*1024*1024 cache lines), while a random test spends 105 cycles.

If we collect the number of misses for the L1 and L2 caches we will see something like

"regions":[
  {
    "read_seq":{
      "region_count":"1",
      "cycles":"3137157209",
      "PAPI_L1_DCM":"134206944",
      "PAPI_L2_DCM":"5966632"
    }
  },
  {
    "read_rand":{
      "region_count":"1",
      "cycles":"14126501033",
      "PAPI_L1_DCM":"177229228",
      "PAPI_L2_DCM":"137481715"
    }
  }
]

In other words in the sequential test almost all memory accesses hit L2, while in the random test we see 137M L2 misses. Nevertheless, 23 cycles per one iteration of the loop seems a bit high for a workload that hits L2 (i.e., it should take 10 cycles). The reason is that the system bottlenecks on the throughput of the memory controller (i.e., the fastest it can ship cache lines to the core are at a speed of one cache line per 23 cycles).

Submit: What is the memory bandwidth the fully prefetched sequential test achieves?

Unfortunately, there is no reliable way to measure L3 misses on this system, so instead your assignment is to measure the number of TLB misses for both tests.

Submit: Look through available PAPI counters and measure the number of TLB misses on both tests.

Probing in hash table

For the assignment, let's do some simple measurement of the caching behavior in hash tables. In an open addressing hash table, hash collisions are resolved using probing, where we select a new index using different methods (linear, quadratic, double hashing).

Let's take a look at a simple hash table implementation hashtable.c.

In this hash table, each key-value pair takes up 32 bytes. The "workload" we perform here is simple: We create a list of random key-value pairs, insert them into the hash table, then look up the keys in a random order.

Submit: Use your knowledge of PAPI counters to understand performance of the hash table. First, report the number of cycles that are required to compute the hash function. Then, report the number of cycles per lookup for the hash table that fits into the last level L3 cache. Compare it with the performance of the hash table that does not fit in L3. Keep the hash table at 75% full. Explain the performance numbers you see.

Submit your work

What to submit: take a look at the provided template and fill in the answers to the questions we ask.

Submit your solution through Gradescope Gradescope CS250P Computer Architecture. You need to submit main.c and template.md. You can resubmit as many times as you wish.

Updated: November, 2021