ICS 32A Fall 2023, Notes and Examples: Testing

On the need for testing

When we write a program, our central goal is for the program to meets its requirements. Stated simply, a program's requirements are what we want the program to do: what problem we want it to solve, what outputs we want it to yield given various inputs, and so on. As you've worked through Project 1, you'll already have seen how complex a set of requirements can be, even for a program that might seem much simpler than a lot of the ones you use every day. Because Project 1 was designed in a way that would allow us to grade its correctness automatically, it was necessary to codify, in explicit detail, the various inputs the program could accept, as well as the specific formatting of the outputs it should generate in response, even in cases where the input was erroneous, the files and directories being searched couldn't be accessed, and so on. Being able to accept and tame the complexity of programs of non-trivial size is something that separates experienced, professional programmers from early-stage hobbyists and students; this is one of the important underlying themes of a lot of what you'll learn in this course (and in the next few years).

How do we know that our programs meet their requirements?

If our main goal is writing a program that meets its requirements, then we're led to the next obvious question: How do we know that our programs have actually met them? At what point do we conclude that our program is done? Thinking at a smaller scale, at what point do we conclude that a function is done?

While mathematics provides us with proof techniques that might be leveraged to demonstrate completeness and correctness in an absolute sense, the popularity of these kinds of techniques in real-world software development is fairly limited. However, even in the absence of these kinds of formalisms, we can get a long way by testing our program to see how it handles various inputs and whether it generates the correct outputs for them. To be clear, testing is also a somewhat formal activity, rather than one that's approached in a cavalier manner. When testing, we want to apply a methodology similar to the scientific method, which is to say that we want to do the following things:

Formulate a hypothesis about how we expect our program or function to behave for a given set of inputs. (We apply our understanding of the program's requirements to determine what the appropriate output will be. If we haven't done that already, how can we know what to expect?)
Run the program or function with those inputs and observe the outputs to see if they match our expectations. If so, we know that our program or function handles the tested scenario properly; if not, we can figure out why, make the necessary adjustments, and run our experiment again.

So, what tests should we run? How many should we run? How do we know whether we've done enough?

Test cases

Suppose we wanted to write a function that takes two parameters, a list and an arbitrary "search" value, then returns a list that contains everything in the given list except the search value. A first attempt at implementing the function might look like this.


def remove_from(the_list: list, value) -> list:
    new_list = []

    for element in the_list:
        if element != value:
            new_list.append(element)

    return new_list

But, of course, the key question here is whether our function is complete and correct. Let's use testing to answer that question.


>>> x = [1, 3, 5, 7, 9, 11, 13]
>>> remove_from(x, 3)
    [1, 5, 7, 9, 11, 13]
>>> x
    [1, 3, 5, 7, 9, 11, 13]
>>> remove_from(x, 7)
    [1, 3, 5, 9, 11, 13]
>>> x
    [1, 3, 5, 7, 9, 11, 13]

If you give it a value, it returns a new list with that value removed, but with the original list untouched. So far, so good. What else would you want to verify about it? In general, what you'd want to do is develop a collection of test cases. A test case is a complete scenario that you want to verify: What you would do to set it up, what inputs you would give to remove_from, and what results you would expect afterward. Note that all of these parts are important; it's not a test case unless we know what the expected outcome is!

So, in the case of the remove_from function, what test cases do we need?

What we did above is a nice start. Put some elements into the list, remove one of them, then verify that the result contains everything from the original list except the one you asked to be removed.

But the key in deciding what tests you need is to think carefully. What are the things that might go wrong? What are the aspects of the function's behavior that you haven't thought through carefully enough? That leads to some more ideas.

Put five values into a list and remove the first one, making sure that the resulting list contains the remaining four. (Boundary conditions like these tend to be where problems lie, because they can be messed up, even when the function usually works fine.)
Put five values into a list and remove the last one. (As long as we're considering one boundary, we should consider the other, because they each represent a different mistake we could have made.)
Put one value into a list and remove it, checking that the resulting list is empty. (This is a boundary of a different kind, but one well worth considering; what happens if the list becomes empty after the removal?)

Are we done? Almost, but there are a couple of other things we haven't thought of. What if the value we're trying to remove isn't in the list at all? Our current implementation will simply return a list that's equivalent to the one we started with:


>>> x = [1, 3, 5, 7]
>>> remove_from(x, 8)
    [1, 3, 5, 7]

But is that what we want? (One of the positive things that testing does is make us think carefully about situations we hadn't considered yet.) Let's suppose that we instead want the function to raise an exception in this case (i.e., it's an error to remove things that aren't there already). First, we'd need to update our function's implementation accordingly.


def remove_from(the_list: list, value) -> list:
    new_list = []
    found = False

    for element in the_list:
        if element != value:
            new_list.append(element)
        else:
            found = True

    if not found:
        raise ValueError('value not found in list')

    return new_list

Now that we've updated our function, we have a little more work to do. The tests we ran previously might no longer pass; we might have made a change that invalidated one of them. So, we'd need to run those again, to make sure things are still the way we left them. (If that sounds like a task that would best be automated, you're right; we'll come back to that idea shortly, then return to it in more depth, with more full-featured tools, later in this course.)

Then, we'd add some additional test cases to verify the new behavior; let's make sure that it's an error to attempt to remove things that aren't in the list. There are a couple of interesting variants of that idea.

Put four values into a list and then try to remove an value that isn't there. It should raise a ValueError.
Trying to remove anything from an empty list should also raise a ValueError.

Finally, there's one more thing to consider. What if the value we're trying to remove is in the list more than once?


>>> x = [1, 3, 5, 7, 1, 3, 5, 7, 1, 3, 5, 7]
>>> remove_from(x, 5)
    [1, 3, 7, 1, 3, 7, 1, 3, 7]

Our current implementation removes all of the values that match the search value. (We might not have thought about that one way or another; again, testing reveals questions that we need to answer about our own design.) Suppose that we're happy with that choice; if so, we'd verify it with a couple of additional test cases.

Put repeated copies of a few values into a list. Remove one of the values and verify that the resulting list has no occurrences of the search value.
Build a list in which all of its elements are equal to each other. Remove that same value and verify that the resulting list is empty.

Additionally, now that we realize that our function removes all occurrences of the search value, maybe that suggests that our name should be more specific. The name remove_from doesn't make clear whether we remove duplicate values, but the name remove_all would do a better job of conveying that. (Another thing that testing does is make us consider the usefulness of our designs, because we have to use them in order to test them.) So, we'll update the name of our function accordingly.


def remove_all(the_list: list, value) -> list:
    new_list = []
    found = False

    for element in the_list:
        if element != value:
            new_list.append(element)
        else:
            found = True

    if not found:
        raise ValueError('value not found in list')

    return new_list

All in all, what seems like a pretty simple function — remove a value from a list — requires eight different tests before we feel comfortable that it's working properly. Each of the tests is simple and straightforward, but the combination of those tests is powerful: It covers essentially all of the differing ways that the function needs to behave.

Categories of test cases

As we've seen, we can categorize our test cases, which helps us to think about which ones we might still need.

Normal cases. A normal case exercises some ordinary circumstance using valid input. (Notice in the example above that we had many fewer of these than the others; normal cases often don't turn out to be where the trouble lies.)
Error cases. An error case verifies that the function correctly handles invalid input or other error conditions. (In our case, we wanted our function to raise an exception when attempting to remove elements that weren't already in the list.)
Boundary cases. A boundary case is what it sounds like: It's on the boundary between the normal and error cases, where the function might barely work correctly or barely fail. (This includes situations like removing the first or last element, the list being empty before or afterward, and so on.) We focus a fair amount of effort on these because they're where so many likely mistakes are.

How many test cases are necessary?

It's important to realize that testing is an exercise in quality rather than quantity. What we're trying to do is cover the spectrum of interesting possibilities, which means that multiple tests that are based around the same idea are worth less than multiple tests that differ in some fundamental way from each other.

The assert statement in Python

Now that we know how to think carefully about what test cases we need, the next question is how we automate them. It would be better if we could write these down in a way that would make them easy to run automatically. That way, after every change to our function, we could re-run all of our tests to make sure it still behaves the way we expected — and, if our expectations have changed, we'll be aware of it, and might need to adjust our tests accordingly.

A simple tool for that kind of test automation is Python's assert statement. In its simplest form, the assert statement contains one expression, which is evaluated for its truthiness. If the expression is truthy, nothing happens; if the expression is falsy, an exception called an AssertionError is raised instead.


>>> assert 10 > 8
>>> assert 10 < 8
    Traceback (most recent call last):
      File "<pyshell#1>", line 1, in <module>
        assert 10 < 8
    AssertionError

Additionally, an assert statement can be given a second "parameter" of sorts, which is an error message that will be displayed if it fails.


>>> assert 10 < 8, 'because arithmetic is weird sometimes'
    Traceback (most recent call last):
      File "<pyshell#1>", line 1, in <module>
        assert 10 < 8, 'because arithmetic is weird sometimes'
    AssertionError: because arithmetic is weird sometimes

To "assert" something means "to state a fact or belief." An assert statement in a Python program isn't much different; its job is to let us state something that we believe to be true in the context of our program, with that belief being held so strongly that we want the program to fail if we're wrong about it.

And, indeed, that makes for a nice way to automate the testing of our functions. If we include assert statements after the functions we write in Python scripts, which state our beliefs about how those functions are supposed to behave, then two good things happen:

Those beliefs are documented, which is to say that human readers of our program will be able to use them to better understand the program's meaning.
Those beliefs will be evaluated, because those assert statements will execute when our Python script does, and if any of them fail, the script will fail with an error message before it does anything else — and we can be the ones to write the error message, too.