ICS 45C Spring 2022, Notes and Examples: Behind the Scenes

Background and motivation

Unless you're programming bare machine code using only ones and zeroes, every programming language presents an abstraction over what its target machine actually executes. In a way, every programming language provides you with a sort of virtual machine, even those that don't claim to, and you can understand that language by understanding what outputs and effects to expect from each individual language construct, without necessarily being cognizant of what's actually happening behind the scenes. Even assembly language, which is about as close as human beings normally get to programming raw machine code, abstracts away a fair amount of detail. And, in fact, even the machine instructions provided by your processor are actually an abstraction over even finer-grained implementation details.

One of the keys to understanding a programming language, then, is to understand what the abstraction layer provided by your language looks like. There are lots of questions you need to ask in order to fully understand that abstraction layer:

What are you allowed to say in the language?
In terms of the language's abstraction layer, what are the semantics of the various language constructs?
To the extent that it's important in the context of the work you're doing, what effect does the use of various language constructs have on the performance of the program?
How do you decompose a larger program into smaller ones? How do you achieve code reuse?
Who decides how values are organized in memory? Who decides when storage is allocated? Who decides when it's deallocated?
How do you interact with devices other than the processor and memory (e.g., peripheral devices like external storage, hardware like GPUs)?

The designers of C++ purposefully chose an abstraction layer that is much closer to the realities of the underlying machine than is typical nowadays, with the goal of allowing programmers to achieve certain aims: greater control over issues that directly affect program performance, the ability to write programs as low-level as those that interface directly with hardware and as high-level as application software, and the ability to target any of a variety of platforms while retaining the ability to achieve a high standard of performance on each one. So we have more fine-grained decisions to make, but that control allows us to achieve things that are sometimes difficult when we don't have that kind of control.

The relative complexity of C++'s abstraction layer places an extra burden on us when we learn and use C++; we have more to learn and understand than we might if we were programming in some other language. And since the complexities are motivated largely by the details of the underlying hardware and platform, we need to understand enough about that hardware and platform to be able to make effective decisions about them. That requires us to know more than we might otherwise need to know about things like:

Computer organization (e.g., processors, registers, the memory hierarchy)
The runtime organization of memory
The forces that determine the real costs of the things we're doing in our programs

While this course isn't focused directly on issues like these, we do need to understand enough about them for C++ to make sense and for us to use it effectively. You may have taken other courses in our curriculum that cover these kinds of issues in more detail, but since these are not prerequisite courses, I won't assume that you have. But certain details from those courses are included here, not gratuitously, but because they directly support things we'll need to understand about C++.

The Von Neumann architecture

When we compare today's computers to those from many decades ago, it's easy to assume that they're completely different. Today's computers are orders of magnitude faster, provide access to orders of magnitude more memory and external storage, and are capable of achieving things that were inconceivable even at the beginning of your lifetime. Comparing the abilities of even an inexpensive smartphone to what I would see from the first computer I used as a kid in the early 1980s, at the dawn of the personal computing revolution, still amazes me. Strangely enough, though, computers haven't changed as much as you might think. While the components are smaller and faster, the wires connecting them are shorter, they're packed more densely, and they're made quite differently, at a fundamental level, the architecture of computers isn't all that different. Some of the same principles that guided the design of computers in the 1950s still guide their design today.

From a fundamental perspective, a typical computer today still follows what's called the Von Neumann architecture. The Von Neumann architecture describes a computer that has, at its center, a central processing unit (CPU), which orchestrates the execution of programs. Programs are made up of instructions, each of which specifies something that the CPU needs to do. It stores both instructions and data in main memory, which you can think of as one giant array of bytes, with each cell having an address — quite often (though not always) counted in bytes, and quite often numbered sequentially starting from zero (similar to how cells of arrays or lists are indexed in languages like Python or Java). Instructions can modify both data and other instructions in main memory, which allows for programs that are self-modifying or automatically generated.

Near the CPU is a collection of registers, which you can think of as being like little pieces of scratch paper that allow the CPU to have very fast access to a relatively small number (often much fewer than 100) of relatively small values (often 32 or 64 bits each). Accessing a register is a few orders of magnitude faster than accessing main memory, but there aren't very many registers, so they have to be used judiciously in order to see a meaningful benefit. (Achieving performance has much to do with the cost of accessing and manipulating data, which is more expensive the farther away from the CPU it is.)

One of the registers is often called an instruction pointer, whose role is to keep track of the address in main memory where the next instruction to be executed resides. As instructions are executed, this register is updated accordingly; the way to influence a program's control flow is to change where the instruction pointer points.

The CPU, fundamentally, cycles through a sequence of events, which is sometimes called the Von Neumann cycle, for each instruction in a program. The Von Neumann cycle, conceptually, has five steps:

Read an instruction from memory
Decode the instruction (i.e., decide what it's supposed to do, and what data is needed in order to do it)
Read operands (e.g., if it's an ADD instruction, go find the two values that need to be added)
Execute the instruction, by performing the appropriate operation on the operands
Store any results, if needed (e.g., into registers or into main memory)

Compared to high-level programming langauges, CPU instructions are extremely fine-grained, like "treating the sequence of bytes in registers R1 and R2 as integers, add the integers together and store their sum in the register R3." Most C++ statements compile to several (or even many) CPU instructions, though some compile to just one.

In short, a program running on a Von Neumann machine is a sequence of instructions that perform gradual modification on a collection of memory. That's it. Everything else is details, albeit important details.

Implementing C++ features at the machine level

The job of a C++ compiler is twofold.

Verify that the program is syntactically and semantically legal, meaning that it is structured properly and it passes various static checks (e.g., static type checking) that are intended to prevent obviously erroneous programs from being able to run.
Emit machine instructions that have the same observable effect as the C++ program being compiled. In other words, to someone running the program, the program's output and effects should be what would be expected given the abstractions provided by C++.
- One minor, but interesting, point is that there's no law that says the machine instructions have to faithfully follow what the program says, as long as the observable effect is the same. This allows compilers to perform optimizations, which are intended to make programs run faster, use less memory, consume less power, and so on. For example, if a compiler can determine that some statement can't be reached, because, say, it's in an if statement whose condition must always be false, the compiler can simply not emit any code for that statement at all. Since the statement couldn't be reached, emitting no code has no effect on the outcome of running the program.

So one way to understand C++ is to understand how the abstractions provided by the language map to the underlying realities of the target machine. While we won't make that leap religiously for every feature, it is worth considering a few basic examples, especially if this is the first language you've used that is a close analogue to the underlying machine. If we want to be able to have fine-grained control over the performance of our programs, then we need to have a more thorough understanding of the issues that influence performance.

Implementing variables

We've seen previously that C++ requires us to declare variables before we're permitted to use them, and that their declaration requires us to specify not only a name, but also a type. One reason is so that the static type checker can verify that we're not misusing them (e.g., by storing strings in variables intended to store integers). But another very important reason is that a variable declaration does more than just saying that you want a variable; it tells the compiler that memory needs to be allocated to store that variable's value.

The type of the variable is vital, then, because it's the type of the variable that ultimately specifies how much memory you want. For example, we've seen previously that int, short, and long are all examples of integral data types, but that they differ (on the architecture of our ICS 45C VM) in terms of their sizes: an int is 32 bits, a short is 16 bits, and a long is 64 bits. Armed with this knowledge, the compiler can allocate the right amount of memory.

Furthermore, different types may also imply that different CPU instructions are appropriate. For example, adding two 16-bit values might require a different CPU instruction than a 64-bit value. Adding two integers will almost certainly require a different CPU instruction than adding two floating-point values. (Remember that we established that CPU instructions are quite fine-grained!) Only if the compiler knows the sizes and types of the values can it emit the instructions that will perform the best — not only in terms of speed, but in terms of giving the correct answer — in a given circumstance.

Implementing control structures

We've seen that the Von Neumann cycle, from a simplified perspective, requires fetching an instruction and executing it, then fetching the next instruction and executing it, and so on. But which instruction is the next instruction at any given time, anyway? A program's control flow is the order in which its instructions are executed. High-level languages give us control structures like if statements, loops, and function calls, and we use those for control flow in our programs. But CPU instructions are finer-grained than that, so the higher-level constructs provided by C++ are translated to those finer-grained instructions.

In a C++ function that uses no control structures, the control flow is obvious: One statement is executed, then the next statement is executed, and so on. In fact, the typical CPU default is the right one here; the default behavior on a CPU is to execute the instructions in the order they're stored in memory, so a C++ compiler can simply emit instructions for a straight-line sequence of C++ statements in that same sequence (i.e., the instructions for the first statement, then the instructions for the second, and so on).

Conditionality (e.g., if statements and switch statements) relies largely on a CPU's jump instructions and branch instructions, which are ways of saying things like "The next instruction to be executed is at address X, unconditionally" or "If the register R1 has the value zero, the next instruction to be executed is at address Y; otherwise, continue to the next instruction." The latter of these — conditionally jumping based on a zero or non-zero value — is a strong hint about why C++'s if statement was originally designed the way it was, with an integral conditional expression that is treated as true when the value is non-zero or false when the value is zero. (The more you learn about the underlying machine-level details, the more you'll see how much C++ was influenced by them.)

Loops, similarly, are reliant on jump and branch instructions. For example, a while loop might be implemented like this:

Instructions to evaluate the conditional expression
Conditionally branch to the instruction following the loop if the conditional expression is zero (i.e., when we don't want to continue looping)
Instructions to implement the body of the loop
Unconditionally jump back to the conditional expression, to prepare for the next loop iteration

Like if statements, the zero/non-zero behavior of the conditional expressions in loops are influenced directly by the typical CPU branch instruction that lets you branch on a zero or a non-zero value.

What happens when a function is called in C++

One of the most important abstractions provided by C++ is the abstraction of a function. With a very terse bit of syntax, you can call a function, passing it arguments, then get back a result, secure in the knowledge that the program will pick up where it left off after the function is finished running. But just as typical CPUs don't implement control structures directly, they don't implement function calls either, so it becomes necessary to emit instructions that provide the same effect. And it's important to understand that those instructions have a cost, though, as we'll see, there are ways to minimize or eliminate that cost in cases where it's too big of a penalty to pay.

Technically, there is flexibility in the C++ Standard with regard to how function calls are implemented. As with everything else, anything is legal, so long as the observable effect of the function call is preserved. But there are techniques that are fairly standard, and it's good to understand how these techniques work behind the scenes.

Calling conventions

Functions are individual, separate pieces of code, which are typically compiled into individual, separate sequences of CPU instructions. So the first thing that's necessary to give one function the ability to call another is for the two functions to agree on how they'll collaborate with one another. These agreements are called calling conventions.

Calling conventions define exactly how this collaboration will take place, and both functions have to use the same conventions in order to make the collaboration work. It's trickier than it sounds, because there are a few things that need to be handled. From the perspective of two functions, which we'll say are the caller (the function doing the calling) and the callee (the function being called), here are the questions that need to be answered.

Where will the callee find the arguments passed from the caller?
Where will the caller find the return value when the callee returns?
When the callee is finished, how does it know how to jump back to precisely where the caller left off?
The number of registers on a CPU is quite limited. Who ensures that the registers — which might have been in use by the caller to store temporary results of calculations in progress — are put back the way they were when the callee is finished, so the caller can continue in the same state it was in previously?

As it turns out, there isn't a single convention in use everywhere, but there are commonalities that are very important to understand in order to program effectively in C++, so we should focus our attention on them briefly.

The run-time stack and stack pointers

C++ has a function-calling mechanism that is quite similar to the one used in many high-level languages. In particular, there is the explicit notion that function calls unwind, when they're complete, in the opposite of the order they were originally made. For example, if a function a() calls a function b(), and that function b() calls another function c(), when c() returns, control should return to the point within b() just after its call to c(). Thought more simply, whenever a function returns, it returns to the function that was called most recently.

As luck would have it, a study of classic data structures tells us how best to organize this information. When you store many values that are best manipulated in a last-in-first-out order, what you need is a stack. And, indeed, that's exactly what is generally used to implement this abstraction; in a running C++ program, there is a run-time stack, which is stored in main memory. When a function is called, information about it is pushed on to the run-time stack, so that the most-recently-called function's information is on the top at any given time. When a function returns, information about it is popped from the run-time stack, revealing the information about the function that called it. At any given time, the address of the top of the run-time stack is stored in a special register called the stack pointer.

So, one way to understand how calling conventions work is to understand who pushes and pops what and when.

A basic calling convention

There is no one callling convention for C++ — and even our compiler supports more than one of them — but what follows is a basic example that captures the essence of how it typically works.

In general, the run-time stack can be said to store activation records, one for each function that is currently active. So in a C++ program that begins in the main() function, let's suppose main() calls a() and that a() calls b(). While b() is running, the run-time stack would look like this:

TOP OF STACK
activation record for b()	← stack pointer
activation record for a()
activation record for main()
BOTTOM OF STACK

The stack pointer would keep track of where the top of the stack is, which would be the top of b()'s activation record. Returning from b() would cause the stack pointer to point to a()'s activation record instead, since b()'s would no longer be necessary. A subsequent call of another function from a() would see a new activation record placed where b()'s was previously. Generally, calls to functions result in the pushing of a new activation record; returns from functions result in the popping of the one on the top.

A given activation record needs to keep track of the data necessary for a function to be called and a result to be returned to the caller. Activation records are laid out somewhat differently in different circumstances — this is one of the ways that calling conventions differ — but an example follows. Let's assume this example is the activation record for b(), which is at the top of the run-time stack:

temporary #n	← stack pointer
temporary #(n-1)
...
temporary #1
local variable #n
local variable #(n-1)
...
local variable #1
saved registers	← frame pointer
return pointer
return value
argument #1
argument #2
...
argument #n

There are a few things depicted here:

The arguments are the ones passed from the caller to the callee. Note that different functions support different numbers and types of arguments, but that the compiler will know, for any particular function, how many there are and what their types are. The caller will push these values to the run-time stack before jumping into the function's code. (The reason why they might be pushed in reverse order is because C++ supports functions that take a variable number of arguments; pushing them in reverse allows it to be clear where the first argument is, even if it's unknown how many there are.)
The return value is an area on the run-time stack in which the return value could be stored.
The return pointer would point to the instruction in the caller that should be executed immediately after the callee returns (i.e., "Where did we leave off?").
The saved registers would be pushed in combination by the caller and the callee — the caller knows which registers need to be protected, while the callee knows which registers it intends to use. Any registers used by both the callee and the caller would need to be saved before the call, so their values can be restored when we return from the function.
The local variables are the variables declared within the function. They would be allocated on the run-time stack, and their types would determine their size.
The temporaries are the temporary values of calculations in progress. While it's better to store these in registers when possible, the values might be too large to fit in a register, or there might not be enough registers to store all of them; in that case, they often get stored temporarily on the run-time stack (and even come and go while the function runs).
The stack pointer points to the top of the activation record. This is necessary so that new data can be pushed on the stack or popped from it, yet it will always be clear where the top is.
The frame pointer is a kind of anchor in the middle of the activation record, used for two reasons: (1) The number of temporaries might change over time, so not everything is a known distance from the top of the stack; and (2) The number of parameters might be different for different calls, so not everything is a known distance from the bottom of the activation record. But the way this activation record is arranged, everything is at a known distance from the frame pointer (either X bytes above it or X bytes below it) at any given time.

The act of calling a function, then, is for the caller to push some things on the run-time stack (the arguments, the return pointer, some saved registers) and then for the callee to push others (other saved registers, local variables). While the callee runs, temporaries might be pushed and/or popped. When the callee ends, the process is unwound (i.e., the things that were pushed previously are now popped).

Of course, as you might imagine, there's some work to be done, so the act of calling a function isn't particularly cheap — though "cheap" here is a relative measurement, i.e., if the function is going to do a lot of the work, the cost of the call, relatively speaking, is lower. We'll see later in the course how to avoid this cost for short, simple functions, while still preserving the ability to write short, simple functions (for clarity).

Performance optimizations in the Von Neumann architecture (Optional)

(This section is optional reading, if you're interested in more details about the Von Neumann architecture, and how you can use that knowledge to write programs that are more performant. This is beyond the scope of this course, but is germane to the topic, and seeds your further research and learning going forward.)

We've seen how the Von Neumann architecture has formed the basis of typical computers for decades. Much of the reason why today's computers are so much more capable than their distant predecessors has to do with physics: smaller components that are more tightly packed and can communicate with each other faster. But it would be disingenuous to say that computers haven't changed in any way other than getting smaller; there are things the "vanilla" Von Neumann architecture doesn't do very well, and a lot of work over the past few decades has gone into working around those problems.

While one doesn't need to be an expert in computer architecture to write programs, knowing more about it does improve your ability to optimize a program. A naively-implemented Von Neumann architecture has serious bottlenecks in its design, which are the primary sources of performance problems. To achieve the best results, it helps to understand what those bottlenecks are, the hardware workarounds used to improve them, and where these workarounds require cooperation from a programmer. In this way, you'll improve your ability to write programs that run faster, use less memory, consume less power, and so on.

The memory hierarchy

One key constraint on a CPU's performance is how fast it can get access to the data that it needs. If, for example, an instruction is intended to add two values together, it first has to fetch the two values to be added; depending on where those values are, that may take substantially more or less time. In general, the gating factor is how far the data is away from the CPU. To combat this performance bottleneck, data is arranged in a memory hierarchy, with data that will be needed more often stored closer, and data that will be needed less often stored farther away.

The closest location where data can be stored is in a CPU's registers. Unfortunately, there aren't very many registers, and they aren't very large, so they have to be used judiciously. One of the things a C++ compiler decides is what values are best stored in registers, on the basis of how often (and how soon) you'll be using them.

The CPU has a large block of main memory connected to it, but, unfortunately, access to main memory is orders of magnitude slower than accessing registers. So a middle ground called cache was added in between, which is located closer to the processor and can be accessed faster. Cache is not as fast as registers, but substantially faster than main memory. Typically, caches store values that have been recently accessed, due to a property that is quite common of real programs: Recently-accessed values have a greater chance of being accessed again than others.

It should be pointed out, too, that there are places that you can store data that are even farther away and slower than main memory: disks and networks. In general, the same rules apply: farther-away storage is available in larger amounts than storage that's closer to the processor, but is accessed much more slowly.

Why C++ programmers need to understand that the memory hierarchy exists is specifically because they have to decide where data should be stored, and the order in which data is accessed. Should you store data in a file or leave it in main memory? What if you need to store so much data that it doesn't all fit into main memory; what should be stored elsewhere (e.g., in a file on disk)? What if it's too big to fit on disk; which parts should be stored on networked storage?

Caches can have a profound effect on performance. Because of the way data is loaded into caches — in small blocks, as opposed to individual bytes — algorithms that access data sequentially, for example, tend to perform significantly better than algorithms with an equivalent number of data accesses that are farther apart. One simple example is a short program I wrote, which we'll see later this quarter, that creates a two-dimensional table of integers and initializes it in two ways:

By initializing each element of the first row, then each element of the second row, and so on.
By initializing each element of the first column, then each element of the second column, and so on.

The end result was the same in both cases: The two-dimensional table had the same set of values stored in it. But one of them took many times longer to run than the other, because it made poor use of cache. Elements in different columns are potentially far away from each other, while adjacent elements in the same row are directly adjacent in memory, so initializing entire rows before moving on to subsequent rows has better locality of reference (i.e., we're more often accessing locations in memory that are very near locations we've recently accessed).

Making the best use of available hardware

Individual CPU instructions use only some parts of a CPU at any given time, since the different stages of the Von Neumann cycle do different things. Originally, that meant that large parts of the hardware simply remained dormant in between uses. But it was later discovered that it's possible to execute more than one instruction at a time on a single CPU, as long as the instructions require different parts of the CPU (e.g., they are at different stages of their execution, or they do radically different things).

One technique enabled by this discovery is pipelining, where a processor can be in different stages of executing different instructions simultaneously. Conceptually, we could be imagine simultaneously running parts of five instructions, given the five-step Von Neumann cycle, since each step in the cycle uses different hardware:

Storing the results of instruction 1, which has just completed execution
Executing instruction 2, for which the operands have just been read
Reading the operands for instruction 3, which has just been decoded
Decoding instruction 4, which has just been fetched from memory
Reading instruction 5 from memory

This is a great idea that can dramatically speed up a CPU's ability to execute a program, not unlike using many people arranged in an assembly line to build a car — each doing a part of the job in sequence — rather than a single person. But, like an assembly line in a factory, there are some ways that pipelines are limited, and it's important to know about them if you're aiming to achieve the best performance possible.

First of all, pipelines can stall, because some instructions can't begin until the instructions immediately preceding them have finished. For example, if instruction 1 stores its result in register R1 and instruction 2 uses that value, instruction 2 can't start until instruction 1 is finished executing. The deeper a pipeline is — and, in practice, they're often quite a lot deeper than just five stages — the greater the cost. Compilers do a lot to work around this limitation — remember that C++ compilers can generate any code that has the same observable effect as what you wrote, so they can reorder instructions to avoid pipeline stalls, for example, as long as they have the same result — but we sometimes have to write our code more carefully to avoid these problems.

When a CPU encounters a branch instruction, there's an immediate problem: What will the next instruction be? The CPU can't be certain until the branch instruction has been executed to the point where it's decided whether the branch will be taken or not (e.g., by fetching a value from a register and comparing it to zero, comparing the values of two registers, or whatever the instruction calls for), so it's not clear which instruction should be the next one in the pipeline to be started. However, CPUs will quite often take a guess, using a technique called branch prediction. Some CPUs allow compilers to encode a reasonable guess (e.g., in a loop, always guess that we'll stay in the loop, since that's true much more often than not in practice); other times, the CPU will use heuristics. Whenever the CPU guesses wrong, it can abandon the partial calculations it was making on the wrong branch, then continue where it should be. This can be costly, but with a good prediction scheme, at least that cost isn't borne often.

CPUs nowadays also use another technique called out-of-order execution, which is to say that they are permitted to execute instructions in a different order than the order they're written in the program. One reason they do this is to avoid pipeline stalls (e.g., if instruction 1 stores a value, instruction 2 uses it, and the next several instructions 3-10 don't use it and are otherwise independent, the CPU could do better by executing instruction 1, then instructions 3-10, and finally instruction 2). Of course, they're only permitted to do this when the observable result would be the same, so they have to be aware of where the dependencies between the instructions are (i.e., which instructions need the results of which others).

Branch prediction and out-of-order execution sounds like a low-level detail that you'd never need to be concerned with, but it's actually quite important, particularly in programs that run on multi-core CPUs that share caches and/or memory. To write code that makes proper use of more than one core simultaneously, it's sometimes necessary to take steps to manually prevent certain kinds of out-of-order execution; compilers can handle some of it, but, as it turns out, not nearly all of it. The most recent C++ standard includes library support for tools to make this job easier, but it's exacting stuff!