Testing Software In this lecture we will discuss testing in general, and then discuss how to perform unit (modules and classes are units) testing in Python. The standard Python library supplies a module named unittest; it defines a class named TestCase from which we can create subclasses to perform unit testing. My driver module, which you have imported and used for testing your programs with bsc.txt files, is a quick and dirty way to do unit tests. The actual unittest class is more elegant, powerful, and comprehensive; but it is more heavyweight and requires more work to write than the batch self-checks, when testing simple code. There are even unit testing frameworks for testing GUIs. ------------------------------------------------------------------------------ Testing Testing is the process of running software looking for errors (meaning actively trying to make the program fail by testing it in many -even unexpected- ways): failure of the program to produce correct output from some correct input. Once testing shows the presence of a bug, debugging begins (the process of fixing the errors found during testing). Professional software testers acquire great skill and intuition at thinking-up "good" inputs on which to test programs. They are valued members of a product team. For example, Microsoft employs about one tester for each programmer. Sometimes these testers work in teams separate from the programmers; at other times a tester will pair up with a programmer. When the programmer finishes some part of the code, the tester begins testing it while the programmer proceeds to the next part of the code. If the tester finds any bugs, the programmer must fix them before continuing. As you can imagine, programmers often dislike testers because the latter are always pointing out mistakes made by the programmers :( But, it is better to have the mistake pointed out by your coworker than by your boss (or a customer). No programmer wants to believe that his/her code contains errors; but they all do contain errors. Some would argue that the programmer, intimate with the code he/she has written, is the best person to test it. But, having a programmer test his/her own code might be bad from a psychological point of view: he/she might not test the code as rigorously, because he/she doesn't really want to find any errors. Having a separate tester helps address this shortcoming. But even this approach can cause problems: if a programmer knows an independent tester will be examining his/her code after it is written, the programmer may become lazy, writing code carelessly, knowing it is someone else's job to spot problems. Thus, there is a real tangle of incentives when writing and testing code. How Microsoft produces software (an overview accessible to students in this course) is discussed in a book written by Cusumano and Shelby: "Microsoft Secrets: How the World's Most Powerful Software Company Creates Technology, Shapes Markets, and Manages People", Free Press, 1995. In Agile programming methods (which includes Extreme Programming, which includes Pair Programming) programming is test-driven. BEFORE doing any coding, a programmer or tester develops an extensive suite of tests that the code must pass. So, the tests are based on the specification of the code to be written, not the code itself. Only then is the code written: and the programmer's progress is judged by the number of tests in the suite that the code passes. Whenever the code is modified, it must re-pass all these tests. We will study unit testing below, which works for functions (in modules) and class units. For this course, to save time, I have provided tests in the form of batch self-check files; although you have missed something if you haven't written your own tests (often just in the script; more on this topic below). There are two general categories of testing. In black-box testing, testers write test-cases based only on the specifications for what the code is supposed to accomplish; they are not allowed to look at the code itself. In white-box testing (maybe it would be better to call it transparent-box testing), testers write test-cases based both on knowledge of the specifications and the code itself: certain kinds of tests might suggest themselves if the tester examines the code (say based on the boolean test in if/while statements). Of course, black-box tests can be developed before or while the code is written, but white-box tests can be developed only after the code is written. One useful form of white-box testing ensures that the tests "cover" (execute) every line of code: we can use "line profilers" to find any lines of code executed 0 times and write tests to ensure they are executed. Industry testers often write/use long scripts when they regression test programs: each time a program is changed, the tester executes the same script to ensure that no new bugs were introduced (the code must still work as it always has). Then the script is extended for the new features being tested. Much of the work in regression testing can be automated: often the result of such tools is either a message confirming that all tests were passed, or a list of outputs (and their inputs) that differed between the original program and the one now being tested. Finally, integration tests determine whether software components, written and tested separately (in unit tests), work together correctly in a program. It is much easier to test/debug each component by itself, than in a system comprising many components. In such systems, even simple bugs can manifest themselves in hard to understand situations. Many features added to programming languages at the end of the 1990s were designed to simplify software integration. A famous quote by the Computer Scientist Edsgar Dijkstra about testing. Testing shows the presence, not the absence of bugs. By this he means, testing can show the presence of bugs (if the tests fail), but not the absence of bugs: even if all the tests succeed, there can still be bugs in the code, just not bugs caught by the tests. If I know exactly what testing inputs that you will use, I can write code that works exactly for those inputs (and no others) so the code will pass all the tests (see below). When I discuss debugging in ICS-31, I tell students 1) Job #1 in debugging is finding the simplest input on which a program produces an error. 2) Job #2 in debugging is finding the LOCATION of the error. At that point, it should be obvious what code is incorrect, and we hope not too difficult to determine the correction. Sometimes the location of the error is the line in Python that raises an exception (such a line is where the error is manifest); other times the error appears earlier, but only becomes apparent (we say the bug becomes "manifest") on that line. In still other cases a program raises no exceptions but produces an incorrect result (imagine an incorrect formula that adds instead of multiplies). These errors are the hardest to debug: I suggest find the "half-way" point in the program and printing the intermediate results (including data structures there) to check whether they are correct. If correct at the halfway point, just debug the last half of the program; if incorrect at the halfway point, first debug the first half of the program. Apply this approach repeatedly/recursively. This is like using "binary searching" to debug a program. In Programming Assignment #1, I required you to write code that "traced" your program, to illustrate how you can instrument your code to help you understand what it does and help you find possible bugs. Students often avoid doing this in the hope their program will run correctly the first time, and thus save themself the time needed to write instrumented code. If after 3 quarters you still think maybe your program will run correct the first time, change your major :) ---------- Correctness by Testing Instructors in 45C complain that students entering this course don't know how to think about/create test-cases when writing their software. The blame points at me because when classes got huge, I had to automate my grading tools, which resulted in the batch self-check system. I provide a sequence of checks that are large (but still imperfect). Although I tell students to do their own testing in a script, and when they have confidence that their program is correct use the bsc files to test it, I understand that students often go straight to the bsc tests (which I think can delay their debugging and certainly hurts their understanding of how to think about writing test cases). I should probably do more what Alex does in ICS-32: provide only the most rudimentary tests that the students will be graded on, and hide the actual tests I will use until I actually grade the students code. The downside is that students have to spend more time (not just solving the problems, but also writing tests) and if they write weak tests, they won't get good feedback about errors in their code, and therefore won't spend time debugging it, and get bad grades. Another approach some instructors use is to hide the test cases but allow students to run these tests blindly, with the system reporting back how many tests failed, but not what those tests were. In this way the student knows his/her code is incorrect (and at what level) without knowing the test cases on which it fails. To show you the weakness of testing (when students know the tests), imagine I wrote the following tests for a student-written "sort" function (not using Python's sort function to test the students' code). e-->sort([])-->[] e-->sort([4, 1, 2, 3])-->[1, 2, 3, 4] e-->sort([8, 5, 3, 1, 4])-->[1, 3, 4, 5, 8] Knowing these three tests, a student could write his/her sort function as def sort(alist): if alist == []: return [] if alist == [4, 1, 2, 3]: return [1, 2, 3, 5] if alist == [8, 5, 3, 1, 4]: return [1, 3, 4, 5, 8] Which obviously isn't a valid sort function, but passes all the tests! If the function really tries to sort, these are reasonable tests, but the previous function doesn't really try to sort: it is designed only to "get the right answers for the tests". This is why I often change small string/int values in the tests I actually run for grading, when it is easy to do so: e.g., substituting 'Anne' for 'Ann'. Maybe I could change the test to include a fourth test, in which the order of the values in the list isn't predetermined (so the code cannot check for special inputs). It is more difficult to write, requiring multiple lines. c-->x = [i for i in irange(1,100)] c-->random.shuffle(x) ==-->sort(x)-->[i for i in irange(1,100)] To pass these tests, the student could change the sort function to be def sort(alist): if alist == []: return [] if alist == [4, 1, 2, 3]: return [1, 2, 3, 5] if alist == [8, 5, 3, 1, 4]: return [1, 3, 4, 5, 8] else: return [i for i in irange(1,len(alist))] # assumes list with values 1-N and still pass all the tests. Probably the best test would use a special function; the 1-line nature of bsc-files would make this function difficult to write in a bsc file. def build_random_sorted(n): if n == 0: return [] x = [random.random()] for _ in range(n-1): x.append(x[-1] + random.random()) return x Which returns a list of non-decreasing random values: each the previous value plus a random amount, so never decreasing. Calling build_random_sorted(5) might return [0.5969099841860014, 1.3209321937435152, 1.6490822517985229, 2.4046998993705424, 2.861823100498464] Then I could write the batch self-check test c-->original = build_random_sorted(100) c-->shuffled = list(original) c-->random.shuffle(shuffled)) ==-->sort(shuffled)-->original which finally would be difficult to "spoof" in the ways shown above. Basically, knowing all the tests to be used on code can encourage the students to not think about their code, and how it must work for all cases, therefore resulting in less learning by the student and code that may not work in various cases. Of course, I must balance the time it takes to write your code with the extra time it would take to come up with good tests, in a class that already teaches a lot of material, and takes a lot of time to do assignments. ---------- ------------------------------------------------------------------------------ The unittest class To test software, we must write both the tests and the software. Typically a programmer should understand the problem first, and then write the tests based on this understanding of the problem, and then write the code. Of course, the programmer can also write the code first, but it is better if the programmer can continually check the code he/she is writing against the suite of tests he/she has written: he/she then knows how much progress is being made towards passing all the tests. Although, the test might still be insufficient. For a first simple example we will discuss testing a sort function. The function won't care what it is sorting, so we will test it on list of integers. There are two specifications that sorting functions must pass: 1) Ordered : the values in the list appear in non-decreasing order 2) Permutation: the sorted list has the same values as the original list Why are both these specifications necessary? A function that puts 0s in all positions in a list is ordered but not a permutation (so isn't sorting the list). A function that shuffles the values in the list (swaps them randomly) is a permutation but only rarely would it be ordered (so isn't sorting the list). While this is a bit of overkill, here is a complete class that tests the standard list.sort function. This is module sorting1.py in the download for this lecture. import unittest class Sorting(unittest.TestCase): def setUp(self): self.original = [4, 1, 2, 5, 3] # Could build randomly ordered list self.sorted = list(self.original) list.sort(self.sorted) # test whether this sort function works # same as self.sorted.sort() def test_order(self): self.assertTrue(self._is_ordered(), 'List is not in order') def test_permutation(self): self.assertCountEqual(self.original,self.sorted, 'List is not a permutation of the original') def _is_ordered(self): for i in range(len(self.sorted)-1): if self.sorted[i] > self.sorted[i+1]: return False return True Here is an overview of what is happening in this module: First, we import the unittest module. Then we define the Sorting class, which is a class derived form unittest.TestCase (a class in unittest). Sorting inherits many methods, some of which (the assertXXX methods) we will discuss in more detail below. The standard form of a typically unittest is a setUp method (we can omit this method, but if it appears it must appear with exactly this name, in the correct case -upper case "U", lower case everything else): it overrides a setUp method that is defined in TestCase that does nothing) followed by a series of methods whose names start with "test" (test_order, test_permutation). There are other special methods we can override, but don't need to for this simple example. This class also defines a helper function: _is_ordered, NOT starting with the word "test". To run the test that is this class, we will right click this file (in the text editor) and select the "Run as" and then the "Python unit-test" option (instead of "Python Run" which we have always chosen before). What Python does in this case is call unittest.main() automatically. This function finds all the methods in the class whose names start with "test" and calls those methods, but first, before calling each method, it calls setUp. The Performance class operated similarly: it ran setup code untimed, and then it timed the real code (the specified number of times). Test can be "destructive", because setUp is called before each test. So for this class it calls setUp and then runs test_ordered and then runs setUp again and calls test_permutation. It calls the methods (and reports their results) in alphabetical order (it constructs a list of function to run and then runs them in sorted order). The setUp method creates two attribute names: self.original which is a specific 5-list that is not ordered and self.sorted which is that same list; then setUp calls list.sort on self.sorted to sort it: we are testing this sorting function. We could specify self.original as any list of comparable values, including creating a random list of value, even one using the build_random_sorted function discussed above. So, Python calls setUp and then the test_order method, which calls assertTrue (a method inherited by Sorting, defined in unittest.TestCase) evaluating whether the helper method self._is_ordered returns True: if so this test passes; if not the test fails. We will see how failed tests are handled soon. Then Python calls setUp again, and then the test_permutation method, which calls assertCountEqual (a method inherited by Sorting, defined in unittest.TestCase) evaluating whether its first argument has the same values, appearing the same number of times (what a permutation means) as its second argument. At this point the results of the test appear in the PU: PyUnit tab near the Console tab (typically at the bottom or Eclipse. The console also shows some less complete testing information. Because the list.sort method is correct, both of these assertions are True. There are two different ways a test can fail 1) The code raises an unexpected exception when it shouldn't: see the red x 2) The code fails a test (some assertion in a testing method: see the blue x IMPORTANT: in unittest, test your code with self.assert... not just the assert statement we know. Note that if any test raises an unexpected exception, Python marks the test as failing and moves on to the next test (it doesn't terminate testing; the batch self-checks operate similarly). In this way, regardless of exceptions, we can run all the tests independently. Look at the picture in the unittest.pdf accompanying this lecture. The heading Sorting1 shows the result of running the test described above. Here is a key to this picture. All the information is displayed in a "Pu PyTest" tab. To the right of this tab are the following icons Show : toggle it to show all tests/only failed tests Rerun : rerun all the tests Error rerun: rerun only the failed tests (more focus, less time, but what if change cause old test to fail?) Stop run : stop running the current test (Ignore the pencil icon) History : examine recent test runs (restores appearance at end of that test) The next line indicates that it has finished all tests: 2 tests out of 2; for long tests, it will show the testing progress: 1/n, 2/n, ... n/n. Next it shows unexpected exceptions (red x) 0 and failed assertions (blue x) 0. The green line is a progress bar, showing all testing is done: it is green because all tests succeeded (it turns red if any failed). The next line shows the total testing time (so fast here it records 0.00). For long tests, this line will show which test it is currently performing; when testing is finished it shows the total time. Interesting side-note. You can use this little timer to perform performance tests on the the sort function. You can also import cProile and profile the testing. Finally, there is a list of all the tests (sortable by any column): each line is numbered, says whether that line's test was OK or failed, names the test run, and indicates its file. Using advanced functions in unittest, it is possible to run tests in other files. Not a topic we will cover. Eclipse uses the space to the right of this information to describe failed tests (see below). So that is unittest in a nutshell. If you replace line 8 by self.sorted = [1, 0] and rerun the test, both the test_order and test_permutation method will fail (see it as the Sorting1 Failed picture in the .pdf). Or you can just comment-out this line and only the test_ordered method will fail. So be careful. If you specify the wrong answer in an assertion, the assertion fails not because the code is incorrect, but because your test is incorrect. Notice the 2 to the right of the blue x (failed tests) and the red progress bar. In the list I have highlighted the second failed test (test_permutation) on the right it shows the line whose assertion failed (including the error message). It also tries to show the REASON for the failure (based on the assertCountEquals) by showing all the values where the counts differed (not for 0 and 1, but for 2, 3, and 4). Here is a table of the most useful assertions and what they test. A last string argument can be added to each, which will be printed if there is a failure). Note that for assertTrue/assertFalse the REASON will just say what the boolean was; but for assertEquals, if the values aren't equal, the REASON will show the both of the unequal values: generally a failed assert will try to show all relevant information/values in the error message. These are the main tools you have to check for correctness. Assertion | Test ----------------------------+---------------------------------- assertTrue(x) | bool(x) is True assertFalse(x) | bool(x) is False assertEqual(a, b) | a == b assertNotEqual(a, b) | a != b assertCountEqual(a, b) | a and b have the same elements and the same | number of each, regardless of their order assertIs(a, b) | a is b assertIsNot(a, b) | a is not b assertIsNone(x) | x is None assertIsNotNone(x) | x is not None assertIn(a, b) | a in b assertNotIn(a, b) | a not in b assertIsInstance(a, b) | isinstance(a, b) assertNotIsInstance(a, b | not isinstance(a, b) assertMultiLineEqual(a, b) | strings assertSequenceEqual(a, b) | sequences, and are equal assertListEqual(a, b) | lists, and are equal assertTupleEqual(a, b) | tuples, and and equal assertSetEqual(a, b) | sets/frozensets, and are equal assertDictEqual(a, b) | dicts, and are equal There is one assertion that deals with requiring an exception be raised. Calling assertRaises(exception,f,*args,**kargs) calls f(*args,**kargs) and fails if it doesn't raise the required exception. For example, if f('a',b) should raise the AssertionError exception, we would check it by assertRaise(AssertionError,f,'a',b). Also related is assertRaisesRegex(exception,re,f,*args,**kargs) which does the same thing, but also checks the exception message against the regular expression re, and also fails if there is no match. In addition, the following assertions just work on regular expressions. assertRegex (s, re) | regex.search(s) assertNotRegex(s, re) | not regex.search(s) Finally, these assertions deal with relation quantities Assertion | Test ----------------------------+---------------------------------- assertAlmostEqual(a, b) | round(a-b, 7) == 0 (the same to the 7th decimal) assertNotAlmostEqual(a, b) | round(a-b, 7) != 0 assertGreater(a, b) | a > b assertGreaterEqual(a, b) | a >= b assertLess(a, b) | a < b assertLessEqual(a, b) | a <= b OK, that is a big laundry list, but here it is in one place. Before going on to a bigger example, any print functions executed in a test method appear to the right of the test when that method is selected in the PyUnit tab (with either the heading ==ERRORS== or ==CAPTURED OUTPUT== (if there are no errors). It is very useful to put such debugging-print statements in failing tests, to help us further understand the nature of the failure. ------------------------------------------------------------------------------ Enhanced Sorting Example (sorting2.py) In the enhanced version, I wrote three other "sorting" methods that fail in "interesting" ways. Notice the global name sorter, which is used in the class, and is bound to the sorting function we want to test. The test_large_scale method test 100 random lists, each of size_to_sort. The test_order/test_permutation now include print statements: look at the resulting output compartmentalized for each test (whether it passes or not). ------------------------------------------------------------------------------ Larger example for priority queue The courselib includes a class named PriorityQueue. You can read the documentation for this class. To summarize here, we can put values in a priority queue when it is constructed or by using the add function. The remove method removes the highest/largest value, so values come out from highest to lowest. The supporting methods are clear (which removes all values), peek (which returns the current highest value but doesn't remove it), is_empty (which is a boolean: True if there are no values in the priority queue, False if there is at least one), and size (the number of values in the priority queue). The pq module is a test for each of these methods in the priority queue. The logic is a bit complex (remember bigger values come out first), but this gives a more reasonable idea about how classes are tested (compared to just one function for sorting). The unnittest module had many more interesting and advanced functions: there are many more sophisticated things we can do when testing classes. This lecture is just an introduction to the topic, which is documented thoroughly in Section 26.3 of the Python online library documentation.