250P Computer Architecture


Home
Homework 3: Hardware Performance Counters This assignment will familiarize you with various low-level performance counters on X86 processors. We will be using PAPI, a performance instrumentation framework, to help us measure and understand the behavior of CPU caches and the branch predictor. You should perform all measurements on the openlab machines. NOTE: YOU CANNOT PUBLICLY RELEASE SOLUTIONS TO THIS HOMEWORK. It's ok to show your work to your future employer as a private github/gitlab repo, however any public release is prohibited. Hardware performance counters Note: The overview is partially based on an short answer by John Mc Calpin. Modern CPUs provide a collection of hardware performance counters that allow you to monitor various microarchitectural events in the CPU cores. The infrastructure for these counters is described in Chapter 18 of Volume 3 of the Intel Architectures Software Developer's Manual. The events that can be counted are mostly different on different processor models -- Chapter 19 of the same document briefly describes the events for each model. When HyperThreading is enabled, these counters typically only increment for events pertaining to the specific Logical Processor that is reading the counter. Exceptions are noted in the event descriptions in Chapter 19. The details vary by processor model, but the core performance counters typically include events related to Overall instruction fetch/issue/dispatch(execution)/retirement. Some processors have additional events to count events related to "stalls" at various points in the pipeline Branch execution, conditional branches taken/not taken, and mispredicted conditional branches Load/store operations, including Translation Lookaside Buffer accesses/misses, and Page Table Walks Accesses to the L1 Data Cache, including hits, misses, and fills Accesses to the L2 Cache Accesses to the L3 Cache (where applicable) Floating-point operations can be counted by SIMD width and by operand width (Broadwell and newer cores) On openlab we're running on the following CPU (at least on circinus-1): Tue/01.06:/home/aburtsev circinus-1:1002-/08:05>cat /proc/cpuinfo \|grep "model name" model name : Intel(R) Xeon(R) CPU X5680 @ 3.33GHz I.e., Intel X5680, which is a Westmere CPU, we can find all supported performance counters in Chapter 19.11. You can see however that event descriptions are terse, and it can require significant research (or familiarity with the microarchitecture) to make sense of the descriptions. Nevertheless, let me encourage you: it's still possible and it's not as hard. Consider an example: `DTLB_LOAD_MISSES.ANY` is counting a number of times the execution triggered a page table walk due to a TLB miss (remember the page walk requires fetching entries from all level of the page table (4 or 5 levels on 64bit systems to resolve the physical address of a specific virtual address that you use in registers. PAPI Since the counters change from one CPU architecture to another, it is convenient to use a generic library that abstracts away low-level details of performance counting on each CPU. One such library is PAPI. PAPI is a library that provides a high-level interface for low-level performance counters uniform across all architectures (of course, some details are lost, and once in a while you might need to get back to measuring a specific counter that is normally abstracted away, but in most cases PAPI is sufficient). Take a moment a read an paper that provides an introduction to the PAPI library Entering the shell environment For this assignment, we will be using a prepared shell environment on openlab which has PAPI set up for you. On an openlab machine, run the following command: /home/zhaofel1/Public/cs250p/enter.sh You will now be able to use the PAPI library and its commands. To see what you can measure, run `papi_avail` to see a list of performance counters and their availability. Performing measurements with PAPI Let's write a simple program that performs branching based on a random number. Copy and paste the following sample code into `simple.c`: #include <stdio.h> #include <stdlib.h> unsigned long rd_state = 14695981039346656037lu; char rd_counter = 0; // The FNV hash function, used here as a pseudo-random number generator. unsigned long pseudo_random() { rd_state = 1099511628211; rd_state ^= rd_counter++; return rd_state; } int main(int argc, char argv) { size_t size = 1000000000; size_t x = 0; for (size_t i = 0; i < size; ++i) { size_t idx = pseudo_random() % size; x += memory[idx]; } printf("Sum: %lu\n", x); return 0; } To PAPI instrumentation, initialize the PAPI library with `PAPI_library_init(PAPI_VER_CURRENT);`, then surround the code region that you want to measure with `PAPI_hl_region_begin` and `PAPI_hl_region_end` calls. Behind the scenes, PAPI will use the RDPMC instruction to read the actual platform-specific performance counters at the start and end of our code region, much like the `RDTSC` exercise that we did in Homework 1. Compile the code with gcc -lpapi -o simple simple.c To select a set of events to measure, set the `PAPI_EVENTS` environment variable. For example, to measure the number of L3 cache accesses and misses: PAPI_EVENTS=PAPI_L3_TCA,PAPI_L3_TCM ./simple Simple, right? The results will be written to the `papi_hl_output` directory, with previous results automatically moved out of the way (the one without the trailing timestamp is the latest). Playing with the counters Let's start with some simple performance measurements to make sure we understand the hardware performance counters exported by PAPI. Branch mispredictions Lets start with a couple of examples that help us to understand branch prediction on modern CPUs // Example 1: Branch always taken void test_branchalways() { size_t size = 1000000000; size_t x = 0; PAPI_hl_region_begin("branchalways"); for (size_t i = 0; i < size; ++i) { if (pseudo_random()) { x++; } } PAPI_hl_region_end("branchalways"); printf("branchalways done: %lu/%lu\n", x, size); } // Example 2: Branch taken 50% of the time void test_branchhalf() { size_t size = 1000000000; size_t x = 0; PAPI_hl_region_begin("branchhalf"); for (size_t i = 0; i < size; ++i) { if (pseudo_random() % 128 < 64) { x++; } } PAPI_hl_region_end("branchhalf"); printf("branchhalf done: %lu/%lu\n", x, size); } int main(int argc, char argv) { int init = PAPI_library_init(PAPI_VER_CURRENT); if (init != PAPI_VER_CURRENT) { fprintf(stderr, "Bad PAPI version"); return 1; } test_branchalways(); test_branchhalf(); return 0; } Here we implement two test functions `test_branchalways()` and `test_branchhalf()` to test two scenarios: one function is always taking a branch, and another takes it in 50% of cases. Let's compile and run the code. Your goal is to collect three counters: the total number of branches executed by the program, number of predicted and mispredicted branches. First, lets take a look at what counters are available through PAPI [cs250p:~]$ papi_avail We're interested in the following events PAPI_BR_CN 0x8000002b Yes No Conditional branch instructions PAPI_BR_MSP 0x8000002e Yes No Conditional branch instructions mispredicted PAPI_BR_PRC 0x8000002f Yes Yes Conditional branch instructions correctly predicted To collect these three counters you can start your program like [cs250p:~]$ PAPI_EVENTS="PAPI_BR_CN,PAPI_BR_MSP,PAPI_BR_PRC" ./a.out Submit:* your measurements with an explanation for the counters you see. Now lets estimate the effect of the mispredicted branches on the performance of the program. To measure how long it takes to execute each test you can either use `rdtsc()` function that you used in HW1 or simply take the number of cycles reported by PAPI (it's reported as `"cycles":"XXX"`). Now lets use two counters to measure the number of issued and committed branches: `PAPI_TOT_IIS` and `PAPI_TOT_INS` Submit: the time and instruction measurements with an explanation. Cache misses Now lets take a look at a couple of tests that access memory in a sequential and random order. struct cache_line { char data[64]; }; size_t test_read_seq(struct cache_line memory, size_t size) { size_t x = 0; PAPI_hl_region_begin("read_seq"); u_int64_t start = rdtsc(); for (size_t i = 0; i < size; ++i) { x += memory[i].data[0]; } PAPI_hl_region_end("read_seq"); u_int64_t end = rdtsc(); printf("read_seq done in %lld cycles\n", end - start); return x; } // Example 3: Random memory read size_t test_read_rand(struct cache_line memory, size_t size) { size_t x = 0; PAPI_hl_region_begin("read_rand"); u_int64_t start = rdtsc(); for (size_t i = 0; i < size; ++i) { size_t idx = pseudo_random() % size; x += memory[idx].data[0]; } PAPI_hl_region_end("read_rand"); u_int64_t end = rdtsc(); printf("read_rand done in %lld cycles\n", end - start); return x; } Here we have a data structure `struct cache_line` which is 64 bytes in size, i.e., one cache line. And we're using the `rdtsc()` function from HW1. You can run those test from you main function like size_t size = 128 * 1024 * 1024; struct cache_line region1 = (struct cache_line) malloc(sizesizeof(struct cache_line)); struct cache_line region2 = (struct cache_line) malloc(sizesizeof(struct cache_line)); test_read_seq(region1, size); test_read_rand(region2, size); If you run these tests you will see something like this on the screen read_seq done in 3137134583 cycles read_rand done in 14126511917 cycles This means that sequential test spends around 23 cycles per one iteration of the loop (we're accessing 12810241024 cache lines), while a random test spends 105 cycles. If we collect the number of misses for the L1 and L2 caches we will see something like "regions":[ { "read_seq":{ "region_count":"1", "cycles":"3137157209", "PAPI_L1_DCM":"134206944", "PAPI_L2_DCM":"5966632" } }, { "read_rand":{ "region_count":"1", "cycles":"14126501033", "PAPI_L1_DCM":"177229228", "PAPI_L2_DCM":"137481715" } } ] In other words in the sequential test almost all memory accesses hit L2, while in the random test we see 137M L2 misses. Nevertheless, 23 cycles per one iteration of the loop seems a bit high for a workload that hits L2 (i.e., it should take 10 cycles). The reason is that the system bottlenecks on the throughput of the memory controller (i.e., the fastest it can ship cache lines to the core are at a speed of one cache line per 23 cycles). Submit: What is the memory bandwidth the fully prefetched sequential test achieves? Unfortunately, there is no reliable way to measure L3 misses on this system, so instead your assignment is to measure the number of TLB misses for both tests. Submit: Look through available PAPI counters and measure the number of TLB misses on both tests. Probing in hash table For the assignment, let's do some simple measurement of the caching behavior in hash tables. In an open addressing hash table, hash collisions are resolved using probing, where we select a new index using different methods (linear, quadratic, double hashing). Let's take a look at a simple hash table implementation hashtable.c. In this hash table, each key-value pair takes up 32 bytes. The "workload" we perform here is simple: We create a list of random key-value pairs, insert them into the hash table, then look up the keys in a random order. Submit: Use your knowledge of PAPI counters to understand performance of the hash table. First, report the number of cycles that are required to compute the hash function. Then, report the number of cycles per lookup for the hash table that fits into the last level L3 cache. Compare it with the performance of the hash table that does not fit in L3. Keep the hash table at 75% full. Explain the performance numbers you see. Submit your work What to submit: take a look at the provided template and fill in the answers to the questions we ask. Submit your solution through Gradescope Gradescope CS250P Computer Architecture. You need to submit `main.c` and `template.md`. You can resubmit as many times as you wish.

Updated: November, 2021