Testing Software

In this lecture we will discuss testing in general, and then discuss how to
perform unit (modules and classes are units) testing in Python. The standard
Python library supplies a module named unittest; it defines a class named
TestCase from which we can create subclasses to perform unit testing. My driver
module, which you have imported and used for testing your programs with bsc.txt
files, is a quick and dirty way to do unit tests. The actual unittest class is
more elegant, powerful, and comprehensive; but it is more heavyweight and
requires more work to write than the batch self-checks, when testing simple
code. There are even unit testing frameworks for testing GUIs.

------------------------------------------------------------------------------

Testing

Testing is the process of running software looking for errors (meaning actively
trying to make the program fail by testing it in many -even unexpected- ways):
failure of the program to produce correct output from some correct input. Once
testing shows the presence of a bug, debugging begins (the process of fixing the
errors found during testing). 

Professional software testers acquire great skill and intuition at thinking-up
"good" inputs on which to test programs. They are valued members of a product
team. For example, Microsoft employs about one tester for each programmer.
Sometimes these testers work in teams separate from the programmers; at other
times a tester will pair up with a programmer. When the programmer finishes
some part of the code, the tester begins testing it while the programmer
proceeds to the next part of the code. If the tester finds any bugs, the
programmer must fix them before continuing. As you can imagine, programmers
often dislike testers because the latter are always pointing out mistakes made
by the programmers :(

But, it is better to have the mistake pointed out by your coworker than by your
boss (or a customer). No programmer wants to believe that his/her code contains
errors; but they all do contain errors. Some would argue that the programmer,
intimate with the code he/she has written, is the best person to test it. But,
having a programmer test his/her own code might be bad from a psychological
point of view: he/she might not test the code as rigorously, because he/she
doesn't really want to find any errors. Having a separate tester helps address
this shortcoming.

But even this approach can cause problems: if a programmer knows an independent
tester will be examining his/her code after it is written, the programmer may
become lazy, writing code carelessly, knowing it is someone else's job to spot
problems. Thus, there is a real tangle of incentives when writing and testing
code. How Microsoft produces software (an overview accessible to students in
this course) is discussed in a book written by Cusumano and Shelby: "Microsoft
Secrets: How the World's Most Powerful Software Company Creates Technology,
Shapes Markets, and Manages People", Free Press, 1995.

In Agile programming methods (which includes Extreme Programming, which
includes Pair Programming) programming is test-driven. BEFORE doing any coding,
a programmer or tester develops an extensive suite of tests that the code must
pass. So, the tests are based on the specification of the code to be written,
not the code itself. Only then is the code written: and the programmer's
progress is judged by the number of tests in the suite that the code passes.
Whenever the code is modified, it must re-pass all these tests. We will study
unit testing below, which works for functions (in modules) and class units. For
this course, to save time, I have provided tests in the form of batch
self-check files; although you have missed something if you haven't written
your own tests (often just in the script; more on this topic below).

There are two general categories of testing. In black-box testing, testers
write test-cases based only on the specifications for what the code is supposed
to accomplish; they are not allowed to look at the code itself. In white-box
testing (maybe it would be better to call it transparent-box testing), testers
write test-cases based both on knowledge of the specifications and the code
itself: certain kinds of tests might suggest themselves if the tester examines
the code (say based on the boolean test in if/while statements). Of course,
black-box tests can be developed before or while the code is written, but
white-box tests can be developed only after the code is written. One useful
form of white-box testing ensures that the tests "cover" (execute) every line
of code: we can use "line profilers" to find any lines of code executed 0 times
and write tests to ensure they are executed.

Industry testers often write/use long scripts when they regression test
programs: each time a program is changed, the tester executes the same script to
ensure that no new bugs were introduced (the code must still work as it always
has). Then the script is extended for the new features being tested. Much of
the work in regression testing can be automated: often the result of such tools
is either a message confirming that all tests were passed, or a list of outputs
(and their inputs) that differed between the original program and the one now
being tested.

Finally, integration tests determine whether software components, written and
tested separately (in unit tests), work together correctly in a program. It is
much easier to test/debug each component by itself, than in a system comprising
many components. In such systems, even simple bugs can manifest themselves in
hard to understand situations. Many features added to programming languages at
the end of the 1990s were designed to simplify software integration.

A famous quote by the Computer Scientist Edsgar Dijkstra about testing.

  Testing shows the presence, not the absence of bugs.

By this he means, testing can show the presence of bugs (if the tests fail),
but not the absence of bugs: even if all the tests succeed, there can still be
bugs in the code, just not bugs caught by the tests. If I know exactly what
testing inputs that you will use, I can write code that works exactly for those
inputs (and no others) so the code will pass all the tests (see below).

When I discuss debugging in ICS-31, I tell students 

1) Job #1 in debugging is finding the simplest input on which a program 
produces an error.

2) Job #2 in debugging is finding the LOCATION of the error.

At that point, it should be obvious what code is incorrect, and we hope not too
difficult to determine the correction. Sometimes the location of the error is
the line in Python that raises an exception (such a line is where the error is
manifest); other times the error appears earlier, but only becomes apparent (we
say the  bug becomes "manifest") on that line. In still other cases a program
raises no exceptions but produces an incorrect result (imagine an incorrect
formula that adds instead of multiplies).

These errors are the hardest to debug: I suggest find the "half-way" point in
the program and printing the intermediate results (including data structures
there) to check whether they are correct. If correct at the halfway point, just
debug the last half of the program; if incorrect at the halfway point, first
debug the first half of the program. Apply this approach repeatedly/recursively.
This is like using "binary searching" to debug a program.

In Programming Assignment #1, I required you to write code that "traced" your
program, to illustrate how you can instrument your code to help you understand
what it does and help you find possible bugs. Students often avoid doing this in
the hope their program will run correctly the first time, and thus save themself
the time needed to write instrumented code. If after 3 quarters you still think
maybe your program will run correct the first time, change your major :)

----------

Correctness by Testing

Instructors in 45C complain that students entering this course don't know how
to think about/create test-cases when writing their software. The blame points
at me because when classes got huge, I had to automate my grading tools, which
resulted in the batch self-check system. I provide a sequence of checks that are
large (but still imperfect). Although I tell students to do their own testing
in a script, and when they have confidence that their program is correct use
the bsc files to test it, I understand that students often go straight to the
bsc tests (which I think can delay their debugging and certainly hurts their
understanding of how to think about writing test cases). 

I should probably do more what Alex does in ICS-32: provide only the most
rudimentary tests that the students will be graded on, and hide the actual
tests I will use until I actually grade the students code. The downside is
that students have to spend more time (not just solving the problems, but also
writing tests) and if they write weak tests, they won't get good feedback about
errors in their code, and therefore won't spend time debugging it, and get bad
grades.

Another approach some instructors use is to hide the test cases but allow
students to run these tests blindly, with the system reporting back how many
tests failed, but not what those tests were. In this way the student knows
his/her code is incorrect (and at what level) without knowing the test cases
on which it fails.

To show you the weakness of testing (when students know the tests), imagine I
wrote the following tests for a student-written "sort" function (not using
Python's sort function to test the students' code).

e-->sort([])-->[]
e-->sort([4, 1, 2, 3])-->[1, 2, 3, 4]
e-->sort([8, 5, 3, 1, 4])-->[1, 3, 4, 5, 8]

Knowing these three tests, a student could write his/her sort function as

def sort(alist):
    if alist == []:
        return []
    if alist == [4, 1, 2, 3]:
        return [1, 2, 3, 5]
    if alist == [8, 5, 3, 1, 4]:
        return [1, 3, 4, 5, 8]

Which obviously isn't a valid sort function, but passes all the tests! If the
function really tries to sort, these are reasonable tests, but the previous
function doesn't really try to sort: it is designed only to "get the right
answers for the tests". This is why I often change small string/int values in
the tests I actually run for grading, when it is easy to do so: e.g.,
substituting 'Anne' for 'Ann'.

Maybe I could change the test to include a fourth test, in which the order of
the values in the list isn't predetermined (so the code cannot check for special
inputs). It is more difficult to write, requiring multiple lines.

c-->x = [i for i in irange(1,100)]
c-->random.shuffle(x)
==-->sort(x)-->[i for i in irange(1,100)]

To pass these tests, the student could change the sort function to be 

def sort(alist):
    if alist == []:
        return []
    if alist == [4, 1, 2, 3]:
        return [1, 2, 3, 5]
    if alist == [8, 5, 3, 1, 4]:
        return [1, 3, 4, 5, 8]
    else:
        return [i for i in irange(1,len(alist))] # assumes list with values 1-N

and still pass all the tests.

Probably the best test would use a special function; the 1-line nature of
bsc-files would make this function difficult to write in a bsc file.

def build_random_sorted(n):
    if n == 0:
        return []
    x = [random.random()]
    for _ in range(n-1):
        x.append(x[-1] + random.random())
    return x

Which returns a list of non-decreasing random values: each the previous value
plus a random amount, so never decreasing. Calling build_random_sorted(5)
might return

[0.5969099841860014, 1.3209321937435152, 1.6490822517985229,
 2.4046998993705424, 2.861823100498464]

Then I could write the batch self-check test

c-->original = build_random_sorted(100)
c-->shuffled = list(original)
c-->random.shuffle(shuffled))
==-->sort(shuffled)-->original

which finally would be difficult to "spoof" in the ways shown above.

Basically, knowing all the tests to be used on code can encourage the students
to not think about their code, and how it must work for all cases, therefore
resulting in less learning by the student and code that may not work in various
cases. Of course, I must balance the time it takes to write your code with the
extra time it would take to come up with good tests, in a class that already
teaches a lot of material, and takes a lot of time to do assignments.

----------

------------------------------------------------------------------------------

The unittest class

To test software, we must write both the tests and the software. Typically a
programmer should understand the problem first, and then write the tests based
on this understanding of the problem, and then write the code. Of course, the
programmer can also write the code first, but it is better if the programmer
can continually check the code he/she is writing against the suite of tests
he/she has written: he/she then knows how much progress is being made towards
passing all the tests. Although, the test might still be insufficient.

For a first simple example we will discuss testing a sort function. The function
won't care what it is sorting, so we will test it on list of integers. There
are two specifications that sorting functions must pass:

1) Ordered    : the values in the list appear in non-decreasing order
2) Permutation: the sorted list has the same values as the original list

Why are both these specifications necessary? A function that puts 0s in all
positions in a list is ordered but not a permutation (so isn't sorting the
list).

A function that shuffles the values in the list (swaps them randomly) is a
permutation but only rarely would it be ordered (so isn't sorting the list).

While this is a bit of overkill, here is a complete class that tests the
standard list.sort function. This is module sorting1.py in the download for
this lecture.

import unittest

class Sorting(unittest.TestCase):
    
    def setUp(self):
        self.original = [4, 1, 2, 5, 3] # Could build randomly ordered list
        self.sorted   = list(self.original)
        list.sort(self.sorted)          # test whether this sort function works
                                        # same as self.sorted.sort()
    def test_order(self):
        self.assertTrue(self._is_ordered(), 'List is not in order')
    
    def test_permutation(self):
        self.assertCountEqual(self.original,self.sorted,
                              'List is not a permutation of the original')
    
    def _is_ordered(self):
        for i in range(len(self.sorted)-1):
            if self.sorted[i] > self.sorted[i+1]:
                return False
        return True 

Here is an overview of what is happening in this module: First, we import the
unittest module. Then we define the Sorting class, which is a class derived
form unittest.TestCase (a class in unittest). Sorting inherits many methods,
some of which (the assertXXX methods) we will discuss in more detail below.

The standard form of a typically unittest is a setUp method (we can omit this
method, but if it appears it must appear with exactly this name, in the correct
case -upper case "U", lower case everything else): it overrides a setUp method
that is defined in TestCase that does nothing) followed by a series of methods
whose names start with "test" (test_order, test_permutation). There are other
special methods we can override, but don't need to for this simple example.
This class also defines a helper function: _is_ordered, NOT starting with the
word "test".

To run the test that is this class, we will right click this file (in the text
editor) and select the "Run as" and then the "Python unit-test" option (instead
of "Python Run" which we have always chosen before).

What Python does in this case is call unittest.main() automatically. This
function finds all the methods in the class whose names start with "test" and
calls those methods, but first, before calling each method, it calls setUp. The
Performance class operated similarly: it ran setup code untimed, and then it
timed the real code (the specified number of times). Test can be "destructive",
because setUp is called before each test.
 
So for this class it calls setUp and then runs test_ordered and then runs setUp
again and calls test_permutation. It calls the methods (and reports their
results) in alphabetical order (it constructs a list of function to run and
then runs them in sorted order).

The setUp method creates two attribute names: self.original which is a specific
5-list that is not ordered and self.sorted which is that same list; then setUp
calls list.sort on self.sorted to sort it: we are testing this sorting function.
We could specify self.original as any list of comparable values, including
creating a random list of value, even one using the build_random_sorted function
discussed above.

So, Python calls setUp and then the test_order method, which calls assertTrue
(a method inherited by Sorting, defined in unittest.TestCase) evaluating whether
the helper method self._is_ordered returns True: if so this test passes; if not
the test fails. We will see how failed tests are handled soon. Then Python
calls setUp again, and then the test_permutation method, which calls
assertCountEqual (a method inherited by Sorting, defined in unittest.TestCase)
evaluating whether its first argument has the same values, appearing the same
number of times (what a permutation means) as its second argument. At this point
the results of the test appear in the PU: PyUnit tab near the Console tab
(typically at the bottom or Eclipse.

The console also shows some less complete testing information.

Because the list.sort method is correct, both of these assertions are True.
There are two different ways a test can fail

  1) The code raises an unexpected exception when it shouldn't: see the red x
  2) The code fails a test (some assertion in a testing method:  see the blue x

IMPORTANT: in unittest, test your code with self.assert... not just the assert
statement we know.

Note that if any test raises an unexpected exception, Python marks the test as
failing and moves on to the next test (it doesn't terminate testing; the batch
self-checks operate similarly). In this way, regardless of exceptions, we can
run all the tests independently.

Look at the picture in the unittest.pdf accompanying this lecture. The heading
Sorting1 shows the result of running the test described above. Here is a key
to this picture.

All the information is displayed in a "Pu PyTest" tab. To the right of this
tab are the following icons

Show       : toggle it to show all tests/only failed tests
Rerun      : rerun all the tests
Error rerun: rerun only the failed tests
             (more focus, less time, but what if change cause old test to fail?)
Stop run   : stop running the current test
(Ignore the pencil icon)
History    : examine recent test runs (restores appearance at end of that test)

The next line indicates that it has finished all tests: 2 tests out of 2; for
long tests, it will show the testing progress: 1/n, 2/n, ... n/n. Next it shows
unexpected exceptions (red x) 0 and failed assertions (blue x) 0. The green
line is a progress bar, showing all testing is done: it is green because all
tests succeeded (it turns red if any failed).

The next line shows the total testing time (so fast here it records 0.00). For
long tests, this line will show which test it is currently performing; when
testing is finished it shows the total time.

  Interesting side-note. You can use this little timer to perform performance
  tests on the the sort function. You can also import cProile and profile the
  testing.

Finally, there is a list of all the tests (sortable by any column): each line
is numbered, says whether that line's test was OK or failed, names the test
run, and indicates its file. Using advanced functions in unittest, it is
possible to run tests in other files. Not a topic we will cover. Eclipse uses
the space to the right of this information to describe failed tests (see below).

So that is unittest in a nutshell. If you replace line 8 by self.sorted = [1, 0]
and rerun the test, both the test_order and test_permutation method will fail
(see it as the Sorting1 Failed picture in the .pdf). Or you can just comment-out
this line and only the test_ordered method will fail.

  So be careful. If you specify the wrong answer in an assertion, the assertion
  fails not because the code is incorrect, but because your test is incorrect.

Notice the 2 to the right of the blue x (failed tests) and the red progress
bar. In the list I have highlighted the second failed test (test_permutation)
on the right it shows the line whose assertion failed (including the error
message). It also tries to show the REASON for the failure (based on the
assertCountEquals) by showing all the values where the counts differed (not for
0 and 1, but for 2, 3, and 4).

Here is a table of the most useful assertions and what they test. A last string
argument can be added to each, which will be printed if there is a failure).
Note that for assertTrue/assertFalse the REASON will just say what the boolean
was; but for assertEquals, if the values aren't equal, the REASON will show the
both of the unequal values: generally a failed assert will try to show all
relevant information/values in the error message. These are the main tools you
have to check for correctness.

Assertion                   |   Test
----------------------------+----------------------------------
assertTrue(x)	  	    | bool(x) is True   
assertFalse(x)		    | bool(x) is False   
assertEqual(a, b)	    | a == b   
assertNotEqual(a, b)	    | a != b   
assertCountEqual(a, b)	    | a and b have the same elements and the same
                            |  number of each, regardless of their order 
assertIs(a, b)		    | a is b
assertIsNot(a, b)	    | a is not b
assertIsNone(x)		    | x is None
assertIsNotNone(x)	    | x is not None
assertIn(a, b)		    | a in b
assertNotIn(a, b)	    | a not in b
assertIsInstance(a, b)	    | isinstance(a, b)
assertNotIsInstance(a, b    | not isinstance(a, b)
assertMultiLineEqual(a, b)  | strings
assertSequenceEqual(a, b)   | sequences, and are equal
assertListEqual(a, b)       | lists, and are equal
assertTupleEqual(a, b)      | tuples, and and equal
assertSetEqual(a, b)        | sets/frozensets, and are equal
assertDictEqual(a, b)       | dicts, and are equal

There is one assertion that deals with requiring an exception be raised. Calling

  assertRaises(exception,f,*args,**kargs)

calls f(*args,**kargs) and fails if it doesn't raise the required exception.
For example, if f('a',b) should raise the AssertionError exception, we would
check it by assertRaise(AssertionError,f,'a',b).

Also related is

  assertRaisesRegex(exception,re,f,*args,**kargs)

which does the same thing, but also checks the exception message against the
regular expression re, and also fails if there is no match. In addition, the
following assertions just work on regular expressions.

assertRegex   (s, re) 	    |	regex.search(s)
assertNotRegex(s, re)	    |	not regex.search(s)

Finally, these assertions deal with relation quantities

Assertion                   |   Test
----------------------------+----------------------------------
assertAlmostEqual(a, b)	    | round(a-b, 7) == 0   (the same to the 7th decimal)
assertNotAlmostEqual(a, b)  | round(a-b, 7) != 0   
assertGreater(a, b)	    | a > b
assertGreaterEqual(a, b)    | a >= b
assertLess(a, b)      	    | a < b
assertLessEqual(a, b)       | a <= b

OK, that is a big laundry list, but here it is in one place.

Before going on to a bigger example, any print functions executed in a test
method appear to the right of the test when that method is selected in the 
PyUnit tab (with either the heading ==ERRORS== or ==CAPTURED OUTPUT== (if there
are no errors). It is very useful to put such debugging-print statements in
failing tests, to help us further understand the nature of the failure.

------------------------------------------------------------------------------

Enhanced Sorting Example (sorting2.py)

In the enhanced version, I wrote three other "sorting" methods that fail in
"interesting" ways.

Notice the global name sorter, which is used in the class, and is bound to
the sorting function we want to test. The test_large_scale method test 100
random lists, each of size_to_sort. The test_order/test_permutation now include
print statements: look at the resulting output compartmentalized for each test
(whether it passes or not).

------------------------------------------------------------------------------

Larger example for priority queue

The courselib includes a class named PriorityQueue. You can read the
documentation for this class. To summarize here, we can put values in a
priority queue when it is constructed or by using the add function. The remove
method removes the highest/largest value, so values come out from highest to
lowest. The supporting methods are clear (which removes all values), peek (which
returns the current highest value but doesn't remove it), is_empty (which is a
boolean: True if there are no values in the priority queue, False if there is
at least one), and size (the number of values in the priority queue).

The pq module is a test for each of these methods in the priority queue. The
logic is a bit complex (remember bigger values come out first), but this gives
a more reasonable idea about how classes are tested (compared to just one
function for sorting).

The unnittest module had many more interesting and advanced functions: there
are many more sophisticated things we can do when testing classes. This lecture
is just an introduction to the topic, which is documented thoroughly in Section
26.3 of the Python online library documentation.