ICS 45C Spring 2022
Notes and Examples: Illuminating the Dark Corners


Program bugs that hide from us

As you've no doubt seen during your programming adventures, it can sometimes be difficult to figure out why our programs don't behave the way we expect them to. Part of this is psychological: When the meaning of the code we wrote differs from what we thought we wrote, it's often hard to see that difference. (This is sometimes because we have a fundamental misunderstanding about how the language or a library works, but we don't realize it. Other times, it's just simple mistake-making and our brain playing tricks on us, but those tricks can be hardest ones to get past; we see what we think we see.)

Think about the fairly complex calculation you did in Project #1, which involved several uses of trigonometric functions, consideration of things like the difference between latitude and longitude directions (e.g., 30N wasn't the same thing as 30S), and a careful combination of those things to yield a result. If, after implementing that calculation, you got the wrong result, you may have found yourself at what felt like a bit of an impasse; your program was giving you the wrong answer, but when you compared the code you wrote to the formula in the project write-up, you didn't see any difference.

The cause of a program bug is sometimes utterly self-evident the moment you see it. If you've formatted your output improperly, for example, by printing a different number of decimal digits than you wanted, you'll know exactly where to go to fix it. But as programs become more complex, and most especially when the things that programs do are invisible (i.e., they don't show up in the program's output, but are things that happen behind the scenes), it can be difficult to diagnose and fix them. If all you know is "My great-circle distance was wrong in Project #1," but you have no other information, you have nothing definitive that leads you from the symptom back to the cause.

If you did get stuck on Project #1 with an incorrect result from your formula, the best next step you could have taken would have been this.

There's a pretty good chance that this would lead to a more specific symptom. Instead of knowing that the entire calculation was incorrect, you could instead find out which part of the calculation was incorrect. And now you'd have a better idea what you should be focusing your energies on investigating. Finally, once you've figured out the problem and fixed it, you should remove that temporary "debug" output; it's served its purpose already, and when you have too much of it strewn throughout a program, it ceases to make sense.

So, in short, the art of debugging a program sometimes revolves around making things visible that are normally invisible. Visible bugs are usually the easiest ones to fix, because they have a tangible symptom; it's the invisible ones that can be the biggest challenge, because half the battle is making them visible. And it would be nice if we could do that without having to add temporary output to our programs; for thornier problems, it would sometimes be better if we could walk around inside of our programs with a flashlight as they run and see these things for ourselves.

For that, we'll need additional tools, though the good news is that the ICS 45C VM has those tools installed already. This set of notes is aimed at helping you to understand how to use them.


The C++ standard and the meaning of "undefined behavior"

The most recent release of the C++ standard is called C++17 (in reference to 2017, the year of its completion). To be clear, when we talk about the "C++ standard," we're actually talking about a written document that is developed by a large committee of interested parties — compiler implementers, professional software engineers who use C++, researchers and academics, teachers and trainers, and so on — with the goal of forming a complete agreement about three things:

The overarching goal of the document is that there be a single understanding of what C++ is. That understanding becomes important in light of the fact that there are multiple compilers that all aim to implement the same language; three popular alternatives today are Clang (which we use on the ICS 45C VM), GCC, and Microsoft Visual C++. Similarly, there are a number of implementations of the C++ Standard Library. A C++ program conforming to the standard should compile on all three of these compilers alongside any of these library implementations and generate the same output for a given input, except where the standard explicitly allows them to be different (e.g., when a difference in the size of the int type might lead to a different result).

Because of the size and complexity of the language, the C++17 standard weighs in at over 1,600 dense, carefully-worded pages, with a decided focus on minimizing the ambiguity of natural language and making the definitions absolutely clear. The document is not what you would call a "fun read" and isn't a great way to learn C++ if you don't already know anything about it, but if you want to know some esoteric detail of how C++ is supposed to work, the standard can probably answer your question if you study it rigorously enough.

With as much detail, as much complexity, and as many features as exist in C++, there are naturally scenarios that nobody thought of, despite a large committee of experts spending years going through the standard with a fine-tooth comb. When issues like these are discovered, they are usually ironed out in a future version of the standard; for example, C++17 addressed some issues left behind in C++14 and C++11.

However, there is another set of scenarios for which the C++ standard is intentionally silent. Rather than specifying what should happen in these cases, the standard says "In these cases, anything can happen." One simple example is dereferencing a null pointer. While it's not uncommon for a program to crash when you dereference a null pointer, that's not set in stone; the C++ standard does not specify what happens, which means a C++ compiler can generate a program that does anything it wants in this case and still be considered a legal, conforming C++ compiler. And, to be clear, I really do mean "anything it wants," including doing nothing, crashing, using values from unallocated memory and continuing on as if nothing was wrong, and so on.

At first blush, that sounds like a strange way to write a language standard. Why not nail down the behavior for every scenario you've thought of, especially for something as straightforward as dereferencing a null pointer?

Undefined behavior

Recall from the Course Introduction notes that C++ was designed to meet certain overarching goals, the most notable of which are performance-related. Above all else, C++ was intended to provide tools for writing programs that use as few resources — time, memory, and so on — as possible. There's nobility in that goal, of course, but it's not as simple as it sounds, particularly because C++ programs can be compiled for a wide variety of processors and operating systems, from the beefiest multicore server CPUs to lightweight embedded processors. What's fast on one platform might be significantly slower on another, so the specifics of how each language feature is to be implemented are not specified in the standard. Every implementation of C++ implements pointers, but they're not all implemented the same way; a null pointer isn't necessarily a bunch of zero bits, for example, because that wouldn't necessarily be the right choice for every platform (e.g., a particular platform might see "zero" as a perfectly valid address).

This is why the standard remains silent on specifically what happens when you do things that compile but are incorrect at run-time, such as dereferencing a null pointer or accessing an array element beyond its boundaries. Everyone can agree that a program that exhibits these characteristics is incorrect, but to specify precisely how those programs will fail also precludes implementers from making choices that would be more appropriate — and more performant — in the cases where those programs succeed. One of the foremost design goals of C++ is what's sometimes called the zero-overhead principle, which implies that features shouldn't have a cost unless they're used. In light of that, it's best not to charge someone an unavoidable penalty when they don't make a mistake, just so they have a softer landing when they do; better to provide tools that allow someone to fashion themselves a softer landing if they're willing and able to pay the cost, but not to make those tools the default.

(I should point out that there are certainly reasonable arguments why a programming language should make these kinds of things clearer, and that they should provide good tools for diagnosing and fixing mistakes, but, for better or worse, C++ decidedly swings toward performance at the cost of pretty much everything else. The way C++ behaves is not a bug, so much as it's a conscious choice; designing a programming language requires making trade-offs.)


Making problems visible when they're obscured by default

Undefined behavior in a C++ program can make it difficult to realize that you have a problem, much less diagnose what it is. For example, consider this function from the Single-Dimension Arrays notes that we saw previously.

void zeroFill(int* a, unsigned int size)
{
    for (unsigned int i = 0; i < size; ++i)
    {
        a[i] = 0;
    }
}

Is this function correct? For the most part, yes, but only if you give it reasonable inputs. Suppose you wrote the following function separately.

void foo()
{
    int* a = new int[50];
    zeroFill(a, 60);
    delete[] a;
}

This foo() function definitely has a problem, but the compiler won't say anything about it, as the problem is simply beyond the reach of what our compiler is able to reason about; the type system doesn't provide it with enough information. So, what's the problem?

Sadly, the compiler has nothing whatsoever to say about all of this, because the type system doesn't allow us to specify that there's a subtle requirement limiting the values that can be given to zeroFill()'s parameters: size can't be larger than the number of elements that a points to. (And, what's more, the type of a doesn't specify that number of elements, anyway!) Without any way to check this, the compiler dutifully compiles this code with no errors or warnings; nothing is technically illegal from its perspective. But accessing array cells outside of their boundaries is undefined behavior. So what do you suppose happens if you run the program? Let's write a quick main() function and try it.

int main()
{
    foo();
    return 0;
}

I tried compiling and running this on the ICS 45C VM. The result might surprise you: Nothing adverse happened! The program not only compiled successfully, but it also ran to completion with no errors or other effects. It neither crashed nor reported an error; it just started and ended.

Interestingly, things were different when I tried the same thing with a statically-allocated array instead. I replaced the foo() function with this instead.

void foo()
{
    int a[50];
    zeroFill(a, 60);
}

Then when I ran the program on the ICS 45C VM, it crashed with this error message instead.

./run: line 43:  2713 Segmentation fault      (core dumped) $SCRIPT_DIR/out/bin/a.out.$WHAT_TO_RUN

This leads to a couple of interesting questions that we'll consider in turn.

Why were the two versions different?

One thing you should be curious about is why the two versions of the program behaved so differently. It's certainly true that they're both incorrect, and they're even incorrect in more or less the same way: They attempt to write 60 integers into an array of size 50. (More specifically, they attempt to write into cells of the array indexed 50 through 59, when those cells do not exist.) Despite the fact that this is considered undefined behavior in C++, undefined doesn't necessarily mean that it does nothing; it means that different C++ implementations may do different things. But our implementation does something particular: It fills in 60 integers (240 bytes) with zeroes, including 10 integers (40 bytes) that lie just beyond the end of the array.

In the version with a dynamically-allocated array, what lies just beyond the end of the array? There's a pretty good chance that it's unallocated memory, since the array is the only thing we ever dynamically allocated. (There are sometimes data structures that surround dynamically-allocated objects, so we may have actually corrupted something, but since we never dynamically allocated anything else, there was no opportunity for the corruption to cause problems afterward. And, of course, in a longer-running program with more dynamic allocation, we could have potentially written over whatever values had been allocated just beyond our array, which would have become a problem later when we tried to use those values.)

Meanwhile, in the version with a statically-allocated array, the array was on the run-time stack. This means that we not only filled the entire array with zeroes, but we also filled in 40 additional bytes on the run-time stack — which likely contained things like our local variables, the return address from our function, old values of registers, and so on — with zeroes. Subsequently, with our run-time stack corrupted, things went wrong in a way that caused the program to crash, probably as soon as we tried to use the now-erroneous values on the stack (such as when one of our functions returned).

Part of what makes bugs like this difficult to diagnose is that they aren't always visible. And when they are visible, they don't always exhibit the same symptom. This is a tough situation to be in, but, as we'll see, there are tools that can help in cases like this.

Why these bugs are so important to find and fix — sooner rather than later — is because they can result in outcomes much worse than programs that simply crash. For example, overwriting the run-time stack might result in an activation record whose return pointer now points somewhere else, which might cause the program to jump back to code other than the function that called it; carefully-constructed erroneous input might exploit a bug to overwrite the run-time stack in a way that causes it to jump back to code that's included in that input, meaning that a malicious user can cause a program to do whatever the user wants, with whatever access rights the program has. Imagine a program that takes its input from the Internet via a socket, or a program that unzips a Zip archive that's been erroneously constructed to trigger a bug in its unzip function so that unzipping it causes files to be deleted rather than created. So, we want our memory-related issues to be flushed out and fixed as early as possible, because they represent a very deep risk indeed.

What does "segmentation fault" mean?

A segmentation fault is an attempt to access memory that our program is not allowed to access. There are actually many possible causes for a segmentation fault in a C++ program running on the ICS 45C VM, though the simplest one is dereferencing a null pointer. (That's one possibility that explains what the crashing version of our program above was doing behind the scenes. We corrupted our run-time stack by writing zeroes on top of things located outside of the array, which means that we may have overwritten the return address of our function with a null address. When an attempt was then made to return from that function, we were trying to jump to the instruction located at null, leading us to a location in memory that we weren't allowed to access.)

Unfortunately, the error message doesn't tell us much information that we can actually use. What we saw when our program crashed was this.

./run: line 43:  2713 Segmentation fault      (core dumped) $SCRIPT_DIR/out/bin/a.out.$WHAT_TO_RUN

But this is actually a lot less useful than you might think. I ran the program using the ./run script in my project directory on the ICS 45C VM, so the line 43 that's being reported by the error message isn't a line of code in my C++ program; it's a line of code in the ./run script. The 2713 is even less useful: It's something called a process identifier or pid, a unique number given by the operating system to each invocation of a program. (So if we ran the same program again, it would have a different pid.)

Why we need to illuminate the dark corners

Now imagine that you'd run a much larger program and saw that same cryptic error message.

./run: line 43:  2713 Segmentation fault      (core dumped) $SCRIPT_DIR/out/bin/a.out.$WHAT_TO_RUN

If this is our only clue about what went wrong, we're in for a difficult time trying to debug it. All we know is that maybe we've got a null pointer somewhere that we didn't expect, or some kind of memory corruption somewhere, or maybe a pointer that we never initialized but tried to use anyway. But the key is that we'll have almost no idea where or why, especially if the program didn't generate any other output before it crashed — so we'd have no idea about how far it had progressed before things went awry.

Imagine instead that undefined behavior occurs in a large program and that it doesn't crash because of it. Instead, you've silently corrupted some data, but the program continues merrily executing as though nothing was wrong. Now you may not even realize you have a problem until much later, when you have strange output that you don't understand, but with the symptom so far removed from the cause, you'll have a very difficult time trying to work out what happened.

Fortunately, all is not lost. Even though C++ programs lack the usual set of safety nets to which you might have become accustomed during your work in other programming languages — C++ programs can have undefined behavior and continue without crashing or notifying you — there are tools that can monitor the behavior of your C++ programs while they run. While those tools come at the cost of performance, since the monitoring requires additional time and bookkeeping, that tradeoff is often perfectly acceptable when you're developing a program and want to diagnose problems. Some tools can report problems to you in detail as they occur; others can allow you to "pause" your program and ask questions about it, then move it forward gradually and ask more questions, so you can see things, such as the values of variables, that would otherwise be invisible because they're not part of your program's output. Now that we're dynamically allocating memory, managing arrays, and so on, we'll need these tools; there are simply too many mistakes we can make that will turn into perplexing results otherwise. We need to be able to illuminate the dark corners.

To support our work, the ICS 45C VM has some of these tools installed and available, and it's high time we learned a little bit about how to use them.


Valgrind (Memcheck)

The first tool that will be handy for us is called Valgrind. Valgrind is actually a whole collection of tools for monitoring a program and watching for different kinds of issues that can be indicative of problems; however, we'll only be using one of Valgrind's tools, which is called Memcheck. Memcheck monitors a C++ program while it runs — watching as memory allocation and deallocation happens, as pointers are followed, arrays are indexed, and so on — and reports various errors as they're detected. Some examples of things that Memcheck can detect are these.

How to run Memcheck and assess its output

All of the project templates on the ICS 45C VM already support the Memcheck tool, so you can run your programs under Memcheck by using the same ./run script you use normally; the only difference is that you need to include an additional parameter to the script. For example, where you might normally use this command to run your application:

./run app

you would instead use this slightly longer command:

./run --memcheck app

When you use the --memcheck parameter, you're asking for the program to run, but for Memcheck to watch its progress step by step and report on any problems it finds. So let's take a look at what it would say about this program that we saw previously, with one small tweak added so that it'll genreate some output.

#include <iostream>

void zeroFill(int* a, unsigned int size)
{
    for (unsigned int i = 0; i < size; ++i)
    {
        a[i] = 0;
    }
}

void foo()
{
    int* a = new int[50];
    zeroFill(a, 60);
    delete[] a;
}

int main()
{
    std::cout << "Hello Boo!" << std::endl;
    foo();
    return 0;
}

When I ran this program under Memcheck on the ICS 45C VM, here's what I saw.

==3151== Memcheck, a memory error detector
==3151== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3151== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==3151== Command: /home/ics45c/projects/darkcorners/out/bin/a.out.app
==3151==
Hello Boo!
==3151== Invalid write of size 4
==3151==    at 0x401167: zeroFill(int*, unsigned int) (main.cpp:7)
==3151==    by 0x4011A5: foo() (main.cpp:14)
==3151==    by 0x401233: main (main.cpp:21)
==3151==  Address 0x52b1d48 is 0 bytes after a block of size 200 alloc'd
==3151==    at 0x483774F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3151==    by 0x4909A69: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3151==    by 0x401193: foo() (main.cpp:13)
==3151==    by 0x401233: main (main.cpp:21)
==3151==
==3151==
==3151== HEAP SUMMARY:
==3151==     in use at exit: 0 bytes in 0 blocks
==3151==   total heap usage: 2 allocs, 2 frees, 72,904 bytes allocated
==3151==
==3151== All heap blocks were freed -- no leaks are possible
==3151==
==3151== For counts of detected and suppressed errors, rerun with: -v
==3151== ERROR SUMMARY: 10 errors from 1 contexts (suppressed: 0 from 0)

It looks like there's a lot to unpack there, but it's not as complicated as it looks, once you know what you're looking at. What you'll see is this.

So, let's examine what we saw from Memcheck more closely when we ran the program above.

The first thing we see after the preamble is the line of output emanating from our code: Hello Boo!. That's because this is the first thing our program does, so there haven't been any memory-related errors yet. (It's not a general rule that you'll see your output first and then Memcheck's; you'll see things as they happen.)

After that, we see a memory-related error.

==3151== Invalid write of size 4
==3151==    at 0x401167: zeroFill(int*, unsigned int) (main.cpp:7)
==3151==    by 0x4011A5: foo() (main.cpp:14)
==3151==    by 0x401233: main (main.cpp:21)
==3151==  Address 0x52b1d48 is 0 bytes after a block of size 200 alloc'd
==3151==    at 0x483774F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3151==    by 0x4909A69: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3151==    by 0x401193: foo() (main.cpp:13)
==3151==    by 0x401233: main (main.cpp:21)

While that's a little bit intimidating, it's actually got a lot of useful information in it. The error itself is "Invalid write of size 4," which means that Memcheck detected that we were writing four bytes somewhere that we shouldn't have been writing anything. This was followed by a backtrace, which is a description of where we were in our program when the problem occurred. Ours consists of these three lines:

==3151==    at 0x401167: zeroFill(int*, unsigned int) (main.cpp:7)
==3151==    by 0x4011A5: foo() (main.cpp:14)
==3151==    by 0x401233: main (main.cpp:21)

The first of these is the line of code where the problem occurred, namely main.cpp:7 (i.e., line 7 in the file called main.cpp), which is in the function zeroFill(int*, unsigned int). The second of these is the line of code from which zeroFill(int*, unsigned int) was called, namely on line 14 of main.cpp in the function foo(). The third of these is the line of code from which foo() was called, namely on line 21 of main.cpp in the main function.

(Note that everywhere you see something that's written in the style 0x4011A5, a hexadecimal number prefixed with 0x, you're seeing a memory address. You'll rarely need to read those addresses specifically.)

After that, we see a little more information that's helpful: We're being told what's invalid about the write (i.e., why Memcheck thought this was a problem).

==3151==  Address 0x52b1d48 is 0 bytes after a block of size 200 alloc'd
==3151==    at 0x483774F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3151==    by 0x4909A69: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3151==    by 0x401193: foo() (main.cpp:13)
==3151==    by 0x401233: main (main.cpp:21)

The key information is in the first line: The address where we were writing our four bytes is "0 bytes after a block of size 200 alloc'd", which means it is immediately following a block of 200 bytes that was allocated properly. If we're curious, we're also being told where that 200 bytes was allocated — on line 13 of main.cpp in our foo() function. (Where it says malloc and operator new in our backtrace is where foo() calls into the C++ Standard Library to perform the allocation. Even though those C++ Standard Library functions aren't our functions, they're still functions, so they still show up in the backtrace.) Here's what we allocated on line 13 of main.cpp:

    int* a = new int[50];

And, gradually, the picture is becoming clearer and clearer. Where do all of these numbers come from? Why four bytes? Why 200?

Suddenly, the error message becomes clear: We're writing to the cell just beyond the end of our array. And, indeed, when we look at line 7 of main.cpp, where our error emanated from, here's what we see.

    a[i] = 0;

This is a write of a four-byte int into a cell in the array. We know i must be 50, because of the amounts written in the Memcheck error message. Of course, that particular line isn't where the bug is, but now that we understand the symptom, we would be able to work our way back to the bug, which is that we're filling the array beyond its boundary.

After our program ended, Memcheck then printed some summary information.

==3151== HEAP SUMMARY:
==3151==     in use at exit: 0 bytes in 0 blocks
==3151==   total heap usage: 2 allocs, 2 frees, 72,904 bytes allocated
==3151==
==3151== All heap blocks were freed -- no leaks are possible
==3151==
==3151== For counts of detected and suppressed errors, rerun with: -v
==3151== ERROR SUMMARY: 10 errors from 1 contexts (suppressed: 0 from 0)

The heap summary tells us the overall state of the heap when the program ended. In our case, there was no memory "in use at exit" (i.e., there was nothing that got allocated and was not subsequently deallocated). In total, there were two memory allocations ("2 allocs") that were both deallocated ("2 frees"); the total size of those memory allocations was 72,904 bytes. (That might seem strange. Our array was 200 bytes; where did the other 72,704 bytes come from? The answer is that the C++ Standard Library does some allocation behind the scenes.)

Finally, we see in the error summary that there were a total of ten errors. That may seem strange; why did we only see one of them? It's because it was the same error emanating from the same line of code ten times in a row — we wrote to cells indexed 50 through 59 in an array of size 50 — so Memcheck only reported the error once, but counted it ten times.

Memory leaks

Suppose that we modified our program above so that the foo() function looked like this instead.

void foo()
{
    int* a = new int[50];
    zeroFill(a, 60);
}

In particular, we've removed the deletion of the dynamically-allocated array. As we've seen, that will result in a memory leak, because we now have 200 bytes of memory that we haven't deleted, but to which we no longer have a pointer; we can't get to it anymore, so we can't ever delete it. This, too, is a problem with no immediately visible symptom, and if we ran our program without a tool like Memcheck, we might never know we had this problem. But Memcheck will make this problem visible, as well, by reporting it in its heap summary. If we ran this new program under Memcheck, we would see a heap summary that looks like this.

==3336== HEAP SUMMARY:
==3336==     in use at exit: 200 bytes in 1 blocks
==3336==   total heap usage: 3 allocs, 2 frees, 73,928 bytes allocated
==3336==
==3336== 200 bytes in 1 blocks are definitely lost in loss record 1 of 1
==3336==    at 0x483774F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3336==    by 0x4909A69: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3336==    by 0x4012A3: foo() (main.cpp:13)
==3336==    by 0x401363: main (main.cpp:20)
==3336==
==3336== LEAK SUMMARY:
==3336==    definitely lost: 200 bytes in 1 blocks
==3336==    indirectly lost: 0 bytes in 0 blocks
==3336==      possibly lost: 0 bytes in 0 blocks
==3336==    still reachable: 0 bytes in 0 blocks
==3336==         suppressed: 0 bytes in 0 blocks
==3336==
==3336== For counts of detected and suppressed errors, rerun with: -v
==3336== ERROR SUMMARY: 11 errors from 2 contexts (suppressed: 0 from 0)

Now we see that there is one block of 200 bytes "in use at exit"; that's our array of 50 integers. We're also shown where that allocation occurred: On line 13 of main.cpp in our foo() function.

Also, now that there are memory leaks, we see something called a leak summary, which counts the total number of bytes and blocks that were still allocated when the program ended, while also categorizing those blocks of memory into one of five types.

For the most part, you won't need to think too carefully about these categorizations, other than to know that you don't need to worry about suppressed leaks, and that everything else reported is something you should be concerned about and will need to fix.

Other kinds of memory-related errors that Memcheck can detect

Memcheck monitors various kinds of memory usage and reports on the things it sees that are indicative of program errors. We've seen already that it can report on accessing arrays outside of their boundaries and memory leaks; what else can it find?

Use of uninitialized values

One very useful mistake that Memcheck can catch is the use of values that have been uninitialized. Remember that using an uninitialized value is considered undefined behavior in C++, which means that it's technically legal (in the sense that a program can compile with this problem) but that you can't count on what the outcome will be. In some circumstances, compilers will warn you about these things, such as in this example.

void foo()
{
    int a;
    std::cout << a << std::endl;
}

The compiler on the ICS 45C VM will generate a warning in this case — and, in fact, since our warnings are configured to become errors, this program won't compile successfully on the ICS 45C VM — but this is technically a legal C++ program. But our compiler won't catch every instance of this kind of thing. Consider, instead, this example.

void foo()
{
    int a[50];
    std::cout << a[0] << std::endl;
}

We've statically allocated an array of 50 integers, never set any of their values, but then tried to obtain the first one and print it. This program both compiles and runs on the ICS 45C VM, but there's no guarantee, generally, about what this program will do. It turns out, when I tried it, that I got the output 0, but that's not something we can count on. And, what's worse, we're not going to know we have a problem; the program merrily prints out an integer value and moves on.

Using Memcheck, though, the problem gets reported to us as an error.

==3514== Conditional jump or move depends on uninitialised value(s)
==3514==    at 0x5105296: vfprintf (vfprintf.c:1637)
==3514==    by 0x51D2CA8: __vsnprintf_chk (vsnprintf_chk.c:63)
==3514==    by 0x48F0981: ??? (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3514==    by 0x48F0756: std::__1::num_put<char, std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > >::do_put(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, std::__1::ios_base&, char, long) const (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3514==    by 0x48CF1B6: std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(int) (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3514==    by 0x4013AF: foo() (main.cpp:13)
==3514==    by 0x4013F3: main (main.cpp:20)
==3514==  Uninitialised value was created by a stack allocation
==3514==    at 0x401394: foo() (main.cpp:12)

Again, this is a bit of an intimidating error message when you first see it, but now that we've seen one of these before, we have an idea how we might be able to read it.

Note that line 12 of main.cpp is this one:

    int a[50];

and that line 13 is this one:

    std::cout << a[0] << std::endl;

Putting all of this together, we can reach a reasonable conclusion about what happened. On line 13, we used a value from the array of 50 integers allocated on line 12, but that value had never been initialized. It was used by something in the C++ Standard Library, which is consistent with what we're seeing, because what we did with the value was try to write it to std::cout.

Deallocation when it's not allowed

A key rule we've seen in C++ is that we must dynamically deallocate everything that we dynamically allocate; what's more, we must only deallocate it once. If we forget to deallocate something, Memcheck will report it as a memory leak. But what happens if we deallocate the same memory twice?

void foo(int* a)
{
    delete a;
    delete a;
}

Suppose that we call foo() and we pass it a pointer to a dynamically-allcoated integer. What happens then? This is technically undefined behavior in C++, which means that anything can happen. In practice, I've seen this cause program crashes, and I've also seen this be silently ignored. When I tried this on the ICS 45C VM, it silently succeeded; no errors or warnings from the compiler, and no program crash. When I ran this under Memcheck, on the other hand, the problem became evident.

==3638== Invalid free() / delete / delete[] / realloc()
==3638==    at 0x483897B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3638==    by 0x401447: foo(int*) (main.cpp:13)
==3638==    by 0x401472: main (main.cpp:21)
==3638==  Address 0x52b1db0 is 0 bytes inside a block of size 4 free'd
==3638==    at 0x483897B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3638==    by 0x401429: foo(int*) (main.cpp:13)
==3638==    by 0x401472: main (main.cpp:21)
==3638==  Block was alloc'd at
==3638==    at 0x483774F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3638==    by 0x4909A69: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libc++.so.1.0)
==3638==    by 0x40146A: main (main.cpp:20)

Generally, the error message here indicates an invalid deletion. A backtrace tells us where the invalid deletion occurred. We're also told a couple of other useful things.

Note that this isn't the only kind of deallocation you aren't allowed to do. Another potential mistake is deleting stack-allocated memory.

void foo(int* a)
{
    delete a;
}

void bar()
{
    int x;
    foo(&x);
}

When I tried this on the ICS 45C VM without Memcheck, it compiled and ran, but I got a mysterious-looking error message with no useful explanation.

munmap_chunk(): invalid pointer
./run: line 43:  3681 Aborted                 (core dumped) $SCRIPT_DIR/out/bin/a.out.$WHAT_TO_RUN

On the other hand, when I ran the same code under Memcheck, I got a much more useful error message.

==3691== Invalid free() / delete / delete[] / realloc()
==3691==    at 0x483897B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3691==    by 0x401469: foo(int*) (main.cpp:12)
==3691==    by 0x401480: bar() (main.cpp:18)
==3691==    by 0x4014A3: main (main.cpp:23)
==3691==  Address 0x1fff00031c is on thread 1's stack
==3691==  in frame #2, created by bar() (main.cpp:22)

In particular, I'm being told that the deletion was invalid because the pointer I passed to delete points to something on the run-time stack. Usefully, it even tells me what line of code that object was declared on — line 22 is where I declared the local variable x in bar().

What to do if you get overwhelmed with Memcheck output

One good trick to keep in mind is this: When I get several or many errors from Memcheck, I quickly feel overwhelmed. But my general technique for handling that is to consider only the first error message, figure out and fix the root cause of that error, and then try my testing again. What I've found is that Memcheck will often "over-report" problems, in the sense that one mistake in my code leads to many error messages from Memcheck (not only the primary problem, but many downstream problems caused by the primary one), similar to how one small, innocent mistake in a program can sometimes lead to 100 or more errors from a compiler.

So I'd suggest that you use the same technique, as well. When you run Memcheck and find that it reports many errors, focus on the first one and ignore the others. That will let you focus on one thing at a time, but also will prevent you from spending time running down rabbit holes that have nothing useful at the bottom of them; if the root cause of every one of the first twelve errors is really the same thing, better to go hunting for that root cause once instead of twelve times.

The "clean bill of health" you should be looking for

One thing you're trying to determine when you run Memcheck is whether there are any issues you should be concerned about, so it's worth pointing out what a "clean bill of health" looks like. How do you know there's nothing to worry about?

But wait! There's more!

Indeed, there are other things that Memcheck is capable of finding, but they're mainly not issues that affect our work in this course. If you want to take a deeper look at what Memcheck can do, full details can be found in Memcheck's user manual here.


The LLDB debugger

Whereas Valgrind's Memcheck tool does a nice job of monitoring a program while it runs and reporting on the problems it's specifically trained to see, sometimes what you want to be able to do is interactively ask questions about a program while it runs. Not all difficult-to-diagnose issues are memory-related issues; sometimes, we have problems in our intricatively-woven logic that we need to unravel.

One way to do that is to instrument your program with debug output — print the values you want to see to std::cout, then run your program and see what it says. And that can be a useful technique, but it requires you to know ahead of time what you want to see, to change your program to add the instrumentation, and to bear the risk that this may change a program's behavior enough that the bug you're looking for disappears or changes character, especially when the bug is something more esoteric, such as undefined behavior or an issue that's timing-related.

An alternative is to use a tool designed to let you ask questions more interactively while your program runs. For this purpose, a debugger can be a wonderfully useful tool. The ICS 45C VM includes a debugger called LLDB, which is part of the same set of tools from which our compiler, Clang, arises. (The entire set of tools is called LLVM.)

First, we should clear up a misnomer: A debugger doesn't actually debug anything. You debug programs by using one. The job of a debugger is to make visible the inner workings of your program — the values of variables, the contents of the run-time stack, and so on — along with the ability to pause your program at opportune times and then inch forward slowly, so you can see the effect of individual lines of your code as they run, asking questions about its "paused" state before inching it forward some more. (Some debuggers even let you "rewind" a program backward instead of just pausing and moving forward — a technique that's cleverly called time-travel debugging — though ours doesn't have that particular ability.)

A debugger isn't a detective, but it gives you the tools to be one yourself. And that's the first rule of debugging: When your program is behaving a particular way, there's always a reason for it, so your goal is to gather enough evidence to explain the cause of the problem, rather than just guessing indiscriminately about what it might be. Debuggers help you to gather that evidence, but it'll be up to you to decide what evidence you need. What you have is a symptom, and what you need is a trail of evidence that leads you back to the cause.

Before we can gather that evidence with a debugger, though, we need to know how to use it. How do we specify where we want our program to pause? What questions can we ask about it? How do we let our program run again once we've gotten our answers?

Starting the debugger on the ICS 45C VM

You've seen previously that each of your project directories on the ICS 45C VM contains a ./run script, which you can use to run your program after you've successfully compiled and linked it. Additionally, there is a ./debug script, which instead runs the LLDB debugger. You can debug any of the three programs in each project directory — app, exp, or gtest — by issuing any of these three commands in the Linux shell.

./debug app
./debug exp
./debug gtest

(If you just issue the command ./debug without an argument, it'll default to debugging your app program.)

Unlike when you run your program under Memcheck, launching the debugger doesn't actually start your program; it only starts the debugger and tells it to associate itself with whichever program you want to debug. The reason for this is simple: Being able to use the debugger to ask questions about the current state of a program requires that program to be "paused," so the debugger starts out by allowing you to set things up; if the program just started running, it might end before you had a chance to interject.

What you'll see at startup is something like this.

(lldb) target create "/home/ics45c/projects/example/out/bin/a.out.app"
Current executable set to '/home/ics45c/projects/example/out/bin/a.out.app' (x86_64).
(lldb) 

Interacting with LLDB requires you to type individual commands. Each command is made up of text that you would type on a single line. Whenever you see (lldb), that's a prompt at which you can enter a command. (Why you see it twice at the beginning is that the first command — the one that associates the debugger with the program you wanted to debug — has already been entered for you.)

So, to make use of the debugger, we'll first need to learn some of the commands that we can give to it.

Analyzing the cause of a program crash

Suppose that we start with the following C++ program, written in a file called main.cpp.

#include <iostream>

void zeroFill(int* a, int n)
{
    for (unsigned int i = 0; i < n; ++i)
    {
        a[i] = 0;
    }
}

int main()
{
    int* a = nullptr;
    zeroFill(a, 10);

    return 0;
}

If we compile and run this program on the ICS 45C VM, all we see is the cryptic error message we saw in a previous example.

./run: line 43:  7743 Segmentation fault      (core dumped) $SCRIPT_DIR/out/bin/a.out.$WHAT_TO_RUN

Given that this is a short program, we might be able to infer the cause of the problem by reading the code. But if we couldn't, what could we do next? One good next step might be to use the debugger, because rather than just showing an error message and crashing, it will stop at the point where the crash occurred and let us ask questions about the program's state. Let's try it.

We'd first issue the command in the Linux shell to run the debugger. (Let's suppose that this is the app program in our project directory.)

./debug app

Next, LLDB would start up and we'd be able to type a command. The first command we'll learn is run, which is how you tell the debugger that you want your program to start running. What we would see, from there, is our program starting, followed by details about the ensuing crash.

(lldb) run
Process 7782 launched: '/home/ics45c/projects/debugcrash/out/bin/a.out.app' (x86_64)
Process 7782 stopped
* thread #1, name = 'a.out.app', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
    frame #0: 0x0000000000401137 a.out.app`zeroFill(a=0x0000000000000000, n=10) at main.cpp:8
   5    {
   6        for (unsigned int i = 0; i < n; ++i)
   7        {
-> 8            a[i] = 0;
   9        }
   10   }
   11

So, what do we see here? As is often the case with the output of tools like this, what we see can be intimidating when we don't know how to read it, but it's actually a lot more useful than it looks. Let's go through it piece by piece.

From this, then, some things are apparent. The crash occurred because of a segmentation fault, which occurred on line 8, in which we were trying to do this.

a[i] = 0;

We can see that a had the value nullptr at the time of the crash. If we were curious about i's value, we could ask LLDB.

(lldb) print i
(unsigned int) $0 = 0

What we see there are a few things: the type of i (unsigned int), a shorthand name for the value we asked for ($0, which makes it easy for us to ask for the same value again by issuing the command print $0), and the value itself (0).

So, at the time of our crash, i had the value 0. Looking at the loop, that tells us that the first loop iteration crashed; there were no successful ones.

We can also see the values of all of the local variables and parameters in a stack frame at once.

(lldb) frame variable
(int *) a = 0x0000000000000000
(int) n = 10
(unsigned int) i = 0

Of course, we now know that a had the value nullptr, and that this was the immediate cause of our crash. But where did a get its value from? It was passed as a parameter to zeroFill, which means it came from the function that called zeroFill. How can we find out more about that? By looking elsewhere on the run-time stack.

(lldb) thread backtrace
* thread #1, name = 'a.out.app', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: 0x0000000000401137 a.out.app`zeroFill(a=0x0000000000000000, n=10) at main.cpp:8
    frame #1: 0x0000000000401175 a.out.app`main at main.cpp:16
    frame #2: 0x00007ffff75a309b libc.so.6`__libc_start_main(main=(a.out.app`main at main.cpp:14), argc=1, argv=0x00007fffffffe4f8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe4e8) at libc-start.c:308
    frame #3: 0x000000000040104a a.out.app`_start + 42

When we ask for a "backtrace," we're asking to see all of the frames on the run-time stack. What we see is that zeroFill was called by our main function, which was in turn called by some other code that's part of C's standard library.

Suppose we wanted to know more about what was going on in that main function at the time of the crash. We could find out by switching frames.

(lldb) frame select 1
frame #1: 0x0000000000401175 a.out.app`main at main.cpp:16
   13   int main()
   14   {
   15       int* a = nullptr;
-> 16       zeroFill(a, 10);
   17
   18       return 0;
   19   }

This shows us what line of code within main was executing at the time of the crash (line 16, which called zeroFill), as well as some of the code around that line. We can also interrogate the values of parameters and local variables, the same way we did before.

(lldb) print a
(int *) $1 = 0x0000000000000000

Just by switching frames and printing the values of variables, we can figure out a lot about what's going on at any given time in a program. When analyzing a crash, we can find out the details of the program's state at the time of the crash; that's a powerful thing to be able to do. There's not a lot else we can do with this particular program — since it's already crashed — but there's more to the story, which we can see if we consider another example. But, for now, we can quit our current LLDB session, since we've gotten what we came for.

(lldb) quit
Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y

Breakpoints and stepping

Program crashes aren't the only times when we might like to pause a program and interrogate its current state. We'd also like to be able to pause programs in midstream, ask questions about their state, and then continue where we left off. LLDB can do that, too, but it requires us to have one more fundamental concept: How do we tell LLDB where to pause the program? The answer lies in a feature called breakpoints. Setting a breakpoint tells LLDB that it should pause when it reaches a certain point in a program, so you can issue commands.

Let's consider the following C++ program as an example, which is intended to wait for the user to type two decimal numbers as input, and then calculate the length of the hypotenuse of a right triangle with the two inputs treated as the lengths of its two other sides. (I've left a small bug in the program, which you may easily see, but let's imagine that you didn't notice. We've all had easy bugs elude us when we're tired, distracted, pressured, or otherwise unfit to find them.)

#include <cmath>
#include <iostream>

double hypotenuseLength(double side1Length, double side2Length)
{
    double side1Squared = side1Length * side1Length;
    double side2Squared = side2Length + side2Length;

    return std::sqrt(side1Squared + side2Squared);
}

int main()
{
    double side1Length;
    double side2Length;

    std::cin >> side1Length >> side2Length;
    std::cout << hypotenuseLength(side1Length, side2Length) << std::endl;

    return 0;
}

If we compiled and ran this program, we might have the following interaction with it.

3 4
4.12311

Our knowledge of mathematics tells us that this is the wrong answer; the length of the hypotenuse should be 5. If we glanced at our program and didn't find the bug, we'd have no easy way to figure it out; all the program does is quietly print the wrong answer.

One technique we could use at this point would be to add debug output to our program. Rather than doing that, though, let's see if we can get LLDB to help us find our mistake instead.

A glance at the code for reading input and writing output tells us that the likeliest cause of our problem is a bug in hypotenuseLength, so we'll start with that hypothesis. We might, then, like to see what happens, in detail, when hypotenuseLength runs. To do that, we need to launch the debugger, tell it to pause our program when it gets into that function, and then start our program running.

(lldb) breakpoint set --name hypotenuseLength
Breakpoint 1: where = a.out.app`hypotenuseLength(double, double) + 18 at main.cpp:7, address = 0x00000000004011d2

The breakpoint set command, generally, adds a new breakpoint. You can have as many breakpoints at any given time as you'd like. The breakpoints are numbered, beginning at 1, so we see that this one was numbered 1; subsequent ones we set will be numbered progressively higher. We see, also, that the breakpoint is associated with a function and a line of our source code. (The "address" is just a representation of where, in memory, this code is; that will generally not be something we're concerned with.)

We can set breakpoints on functions or on lines of our source code. A few alternative ways to set the same breakpoint would have been these.

(lldb) b hypotenuseLength
(lldb) breakpoint set --file main.cpp --line 7
(lldb) b main.cpp:7

You can see which breakpoints you've set previously by using the command breakpoint list.

(lldb) breakpoint list
Current breakpoints:
1: name = 'hypotenuseLength', locations = 1
  1.1: where = a.out.app`hypotenuseLength(double, double) + 18 at main.cpp:7, address = a.out.app[0x00000000004011d2], unresolved, hit count = 0

Notice here that, in addition to being told where the breakpoint is, we're also seeing how many times it's been "hit," which can help us track more complicated scenarios. Also, notice that our breakpoint is said to have one location; some breakpoints actually get triggered in more than one place, though we're unlikely to see that happen in this course.

You can also delete an existing breakpoint using the command breakpoint delete. Knowing its number is the key to choosing the right one. Deleting this breakpoint would be done like this.

(lldb) breakpoint delete 1
1 breakpoints deleted; 0 breakpoint locations disabled.

Of course, now we've lost our breakpoint, so let's put it back and then move on.

(lldb) b hypotenuseLength
Breakpoint 2: where = a.out.app`hypotenuseLength(double, double) + 18 at main.cpp:7, address = 0x00000000004011d2

Now that we've told LLDB where we want our program to be paused, we can start it running. Since the first thing our program does is read input from std::cin, we'll need to type it.

(lldb) run
Process 8481 launched: '/home/ics45c/projects/debugbreak/out/bin/a.out.app' (x86_64)
3 4
Process 8481 stopped
* thread #1, name = 'a.out.app', stop reason = breakpoint 2.1
    frame #0: 0x00000000004011d2 a.out.app`hypotenuseLength(side1Length=3, side2Length=4) at main.cpp:7
   4
   5    double hypotenuseLength(double side1Length, double side2Length)
   6    {
-> 7        double side1Squared = side1Length * side1Length;
   8        double side2Squared = side2Length + side2Length;
   9
   10       return std::sqrt(side1Squared + side2Squared);

As soon as we typed our input, we hit our breakpoint. (The "stop reason" is now listed as "breakpoint 2.1".) Our program has been paused and, at this point, all of the commands we learned about previously can be brought to bear.

(lldb) frame variable
(double) side1Length = 3
(double) side2Length = 4
(double) side1Squared = 4.1641961990551685E-184
(double) side2Squared = 4.9406564584124654E-324

Initially, we see that our two parameters, side1Length and side2Length, have the values 3 and 4, respectively. The two local variables, side1Squared and side2Squared have much stranger-looking values, but that's because they've yet to be initialized; lines 7 and 8, which initialize them, haven't run yet. (When we break on line 7, we break before line 7 runs.)

From here, it would be nice to inch our way through this code one line at a time. How we do that is to do something called stepping. There are three kinds of steps we might like to take, generally.

Which of these kinds of steps we want to take is purely a matter of what questions we want to be able to ask. In our case, there's no code in hypotenuseLength that we'd want to step into — we can feel pretty certain that our bug is not in std::sqrt — so our best bet here is thread step-over.

(lldb) thread step-over
Process 8481 stopped
* thread #1, name = 'a.out.app', stop reason = step over
    frame #0: 0x00000000004011e1 a.out.app`hypotenuseLength(side1Length=3, side2Length=4) at main.cpp:8
   5    double hypotenuseLength(double side1Length, double side2Length)
   6    {
   7        double side1Squared = side1Length * side1Length;
-> 8        double side2Squared = side2Length + side2Length;
   9
   10       return std::sqrt(side1Squared + side2Squared);
   11   }

Our program ran a little further and then paused again. (The "stop reason" is now listed as "step over".) Now we're on the next line of our code — line 8 instead of line 7. And, of course, our variables' values will have changed.

(lldb) frame variable
(double) side1Length = 3
(double) side2Length = 4
(double) side1Squared = 9
(double) side2Squared = 4.9406564584124654E-324

The side1Squared variable now has a value, and it looks to be the right value: the square of side1Length. We can step over one more time to see the effect of the next line.

(lldb) thread step-over
Process 8481 stopped
* thread #1, name = 'a.out.app', stop reason = step over
    frame #0: 0x00000000004011f0 a.out.app`hypotenuseLength(side1Length=3, side2Length=4) at main.cpp:10
   7        double side1Squared = side1Length * side1Length;
   8        double side2Squared = side2Length + side2Length;
   9
-> 10       return std::sqrt(side1Squared + side2Squared);
   11   }
   12
   13
(lldb) frame variable
(double) side1Length = 3
(double) side2Length = 4
(double) side1Squared = 9
(double) side2Squared = 8

At this point, we can look at the value of the variable side2Squared and see that it's 8, but that the value of side2Length is 4. Shouldn't side2Squared be 16 (i.e., 4 times 4)? (This, by the way, is why simple but clear variable names can help. It's easier to see that side2Squared is wrong given its name; if the variable had been called s2, we'd have had a harder time seeing the logic error we've made.)

At that point, we look at how side2Squared got its value and our mistake becomes evident; we should have multiplied instead of added!

    double side2Squared = side2Length + side2Length;

So we've gotten what we needed from LLDB. We can let our program finish, just for fun, by issuing the command continue, which means that we want the program to pick up where it left off and keep running — either until it ends or hits another breakpoint.

(lldb) continue
Process 8481 resuming
4.12311
Process 8481 exited with status = 0 (0x00000000)
(lldb) quit

Note that when the program ends, we're given a "status". That status is the exit code returned by our main function. (In this case, that's 0, because our main function ended by saying return 0;.)

Note, also, that we still had to quit LLDB when our program ended, but this time we weren't warned that a program was still running. Once we said quit, that was it.

More details about LLDB

Once you understand some of the basic principles behind using LLDB, all that's left is learning what commands are available. Those commands are your palette of choices; when you want to know the answer to a question, you'll need to figure out which commands will give you what you're looking for. For example, watchpoints let you pause a program not when you reach a line of code, but when the value of a variable changes. You can evaluate more complex expressions, as opposed to just obtaining the values of variables. You can change the values of variables. You can bail out of a function and return your own chosen value from it, to see how that might change the things that happen downstream. There's more, too.

A good first step is to read through a list of commands that are available. A nicely put-together "cheat sheet" of commands is available at the following link.

You aren't likely to find that you'll need nearly all of them, but it's worth skimming through them to see what's available. Later, when you find yourself wanting to know how to ask a particular question, you might remember that you saw a command that does precisely what you want, then you can go look it up on the cheat sheet. The more you use LLDB, the less often you'll find yourself needing to look things up.

(One thing you'll notice, too, is that a lot of the commands we've learned have shorthands. For example, thread backtrace can also be done by issuing the shorthand command bt.)