Empirical Analysis of Algorithms In the two previous lectures we learned about complexity classes and how to analyze algorithms (mostly Python statements/functions) to find their complexity classes. These analyses were mostly done by just looking at code (aka "static analysis") and were independent of any technology: i.e., independent of which Python interpreters we used and the speed of the computers running our code. We also learned that given the complexity class of a runnable Python function, we can approximate its running time by T(N) = c*complexity_class(N), where complexity_class(N) is its complexity class: e.g., N, N Log N, N**2, etc. We can then run this function on a reasonably-sized problem (with N not too small, so the discarded lower-order terms are small enough to really ignore) and measure the amount of time it takes (T). Finally, we can solve for the constant in the time equation: c = T(N)/complexity)class(N) by plugging in the measured T(N) and complexity_class(N). Then, for large N, we can use the formula T(N) = c*complexity_class(N) with the computed c, to approximate the amount of time this function requires to solve a problem of any large-sized N. Such analysis (by running code) is called "dynamic analysis" or "empirical analysis". We also saw that we could approximate the complexity class by running the code on input sizes that double and then plot the results, looking for a match against standard doubling signatures. In the first part of this lecture we will examine how to time functions on the computer (rather than using an external timer) and we will write some Python code to help automate this task. Given such tools, and the ability to chart the time required for various-sized problems, we can also use this data to infer the complexity class of a function without ever seeing its code. Yet we can still develop a formula T(N) to approximate the amount of time this function requires to solve a problem of any large-size N, without even looking at the code. Generally in this lecture, we will explore using the computer (dynamic/empirical analysis) to better help us understand the behavior of algorithms that might be too complex for us to understand by only static analysis (we might not even have the algorithm in the form of source code, so we cannot examine it, an only can run it). In the second section, we will switch scale and use a Profiling module/tool named cProfile (and its associated module pstats) to run programs and determine which functions are consuming the most time. Once we know this, we will an attempt to improve the performance of the program by optimizing only those functions: those that are taking significant time. Finally, in the third section we will explore how Python uses HASHING in sets, frozen sets, and dicts to achieve a constant time complexity class, O(1) for many operations. We will close the loop by using dynamic analysis to verify this O(1) complexity class. Hashing is more closely studied in ICS-45C and especially in ICS-46. ------------------------------------------------------------------------------ Understanding Sorting and N Log N Algorithms: We previous discussed that the fastest algorithm for sorting a list is in the complexity class O(N Log N). This is true not only for the actual algorithm used in the Python sort method (operating on lists), but for all possible algorithms. In this section, without using this knowledge, we will time the sorting function and infer its complexity class. In the process we will build a sophisticated timing tool for timing sorting and other algorithms (and use it a few times later in this lecture). Here are Python statements that generate a list of a million integers, shuffle them into a random order, and then sort them. alist = [i for i in range(1_000_000)] #Python >= 3.6 allows _ to clarify numbers random.shuffle(alist) alist.sort() Let's first discuss how to time this code. There is a time module in Python that supplies access to various system clocks and conversions among various time formats. I have written a Stopwatch class (in the stopwatch.py module, which is part of courselib) that makes clocks easier to use for timing code. The four main methods for objects constructed from Stopwatch classes are start, stop, read, and reset; each should be intuitive when operating a physical/software Stopwatch. You can read the complete documentation for this class by following the "Course Library Reference" link and then clicking the Stopwatch link. The code itself (if you are interested in reading it) is accessible in Eclipse by disclosing python.exe in the "PyDev Package Explorer", then disclosing "System Libs" and then "workspace/courselib", and finally clicking on stopwatch.py to see the code. You can see all the courselib modules using this mechanism. Here is a complete script that uses a Stopwatch to time only the alist.sort() part of the code above. Notice that it imports the gc (garbage collection) module to turn off garbage collection during the timing, so this process will not interfere with our measurements. We need to turn it back on afterwards. ---------- Objects, Memory, and Garbage (collection) When we construct an object in Python, we allocate part of the computer's memory to store the object's attributes/state. The following loop will eventually consume all the memory Python can use (it runs for a few seconds on my computer). It tries to store a million factorials in a list: ultimately it raises a MemoryError exception. alist = [0] for i in range(1_000_000): alist += alist print(i) 12 to 12 --->Update: For Python 3.7 this takes a very long time (~5 minutes) to raise a --->MemoryError exception. I am trying to understand why, and find code that --->will raise it faster. It pretty quickly prints the numbers 1-24 and then --->more slowly prints numbers up to 32. It exhausts all of memory (99%) quickly --->but seems to continue executing after that, with memory dipping every so --->often, accompanied by a big increase in disk usage (you can watch the CPU ---> and Memory usage in the Task Manager). I conjecture that when Python runs --->our of memory, it starts using the hard disk as additional storage, but --->eventually it gives up on that approach, as memory use skyrockets (doubles). --->Memory is reclaimed only after the process is terminated in Eclipse with the --->red square. "Garbage" is objects constructed by Python, which can no longer be referred to. If we wrote x = [i for i in range(1_000_000_000)] then x would refer to a list object that stored a huge amount of memory. If we then wrote x = 1 the list object that x used to refer to would become garbage (because x no longer refers to this object and there are no other names that we can use to reach this object). It is memory that is unreachable. If we had instead written x = [i for i in range(1_000_000_000)] y = [0, 1, 2, x] x = 1 Now, the object x formerly referred to can be referred to by y[3] so we can reach it from some name and therefore it is NOT GARBAGE. Garbage collection is a way for the computer to find/reclaim memory that is garbage. Typically Python allocates memory for objects until it finds it has no more memory to allocate; then it does garbage collection to try to find more. If it succeeds this process continues until it runs out of memory again, and then repeats. If it ever cannot find enough free memory to allocate an object, even after garbage collection (as in the first example) it raises an exception. Finally, there are ways to tell Python how much of the computer's memory it can use to allocate the objects it constructs. You will learn more about garbage and garbage collection in ICS-45C (using C++, a language that does not have automatic garbage collection) and ICS-46. Better terms would be recyclables (not garbage) and recycling (not garbage collection), because memory is never thrown away, but is continually recycled when it can be reused. ---------- import random,gc from stopwatch import Stopwatch #setup alist = [i for i in range(1_000_000)] random.shuffle(alist) s = Stopwatch() gc.disable() #timing s.start() alist.sort() s.stop() gc.enable() #results print('Time =',s.read()) We would like to develop a Performance tool that minimizes what we have to do to time the code: a tool that also gives us interesting information about multiple timings. The tool I developed for this lecture is based on the timeit.py module supplied with Python (see section 27.5 in the standard library documentation). First, we show an example use of the tool, then its code. Here is the entire script that uses the tool. alist = [i for i in range(100_000)] p = Performance(lambda:alist.sort(), lambda:random.shuffle(alist),100,'Sorting') p.evaluate() p.analyze() It first statement creates a list of 100,000 numbers. Then it constructs a Performance object with 4 arguments (1) A parameterless lambda of the code to execute and time (2) A parameterless lambda of the setup code to execute (but not time) before the lambda in part (1) is called (3) The number of times to measure the code's execution: how many times to do step 2 followed by timing step 1 (4) A short title (printed by the analyze method, which prints the analysis) The actual __init__ function for Performance looks like; def __init__(self,code,setup=lambda:None,times_to_measure=5,title='Generic'): So, in the above call we are timing a call to alist.sort(), which mutates the list; before timing each call it sets up (not part of the timing) with a call to random.shuffle(alist), which mutates the list; it will do 100 timings (of the setup -untimed- followed by code -timed); the title when information is printed is "Sorting". Calling p.evaluate() does all the timings and collects the information. It returns a 2-list: a 3-tuple of the (minimum time, average time, maximum time), followed by a list of all the (100 in this case) timings. It also saves this information as part of the state of the Performance object, which is used for analysis, if we call the analyze function. Calling p.analyze() prints the following result on my computer. It consists of the title; the number of timings; the average time, the minimum time, the maximum time, and the span (a simple approximation to clustering: (max-min)/avg; and a histogram of the timings (how many fall in the range .0404-.0406 bin, .0406-.0408, bin, etc.) with an 'A' at the top of the stars indicating the bin for the average time. Notice that although the span says the range of values was 5.9% of the average, we can see that most of the timings are clustered very close to the average (which itself is near the minimum time), although there are a few timings that are "much bigger": .043 vs. average of .041. Sorting Analysis of 100 timings avg = 0.041 min = 0.040 max = 0.043 span = 5.9% Time Ranges 4.04e-02<>4.06e-02[ 58.4%]|************************************************** 4.06e-02<>4.08e-02[ 20.8%]|*****************A 4.08e-02<>4.11e-02[ 6.9%]|***** 4.11e-02<>4.13e-02[ 5.0%]|**** 4.13e-02<>4.16e-02[ 5.0%]|**** 4.16e-02<>4.18e-02[ 0.0%]| 4.18e-02<>4.20e-02[ 1.0%]| 4.20e-02<>4.23e-02[ 1.0%]| 4.23e-02<>4.25e-02[ 0.0%]| 4.25e-02<>4.28e-02[ 1.0%]| 4.28e-02<>4.30e-02[ 1.0%]| Using this tool, I ran a series of sorting experiments doubling the length of the list to sort each time. Here was the script: from goody import irange import random from performance import Performance for i in irange(0,9) : size = 100_000 * 2**i alist = [i for i in range(size)] p = Performance(lambda : alist.sort(), lambda : random.shuffle(alist),10,'\n\nSorting '+str(size)+' values') p.evaluate() p.analyze() Here is the raw data produced by running this code using Python 3.7 on my new (in 2019) computer. ------------------------------------------------------------------------------ Sorting 100000 values Analysis of 10 timings avg = 0.01542 min = 0.01505 max = 0.01651 span = 9.5% Time Ranges 1.50e-02<>1.52e-02[ 40.0%]|************************************************** 1.52e-02<>1.53e-02[ 10.0%]|************ 1.53e-02<>1.55e-02[ 30.0%]|*************************************A 1.55e-02<>1.56e-02[ 0.0%]| 1.56e-02<>1.58e-02[ 10.0%]|************ 1.58e-02<>1.59e-02[ 0.0%]| 1.59e-02<>1.61e-02[ 0.0%]| 1.61e-02<>1.62e-02[ 0.0%]| 1.62e-02<>1.64e-02[ 0.0%]| 1.64e-02<>1.65e-02[ 0.0%]| 1.65e-02<>1.67e-02[ 10.0%]|************ Sorting 200000 values Analysis of 10 timings avg = 0.03662 min = 0.03462 max = 0.04495 span = 28.2% Time Ranges 3.46e-02<>3.57e-02[ 60.0%]|************************************************** 3.57e-02<>3.67e-02[ 20.0%]|****************A 3.67e-02<>3.77e-02[ 0.0%]| 3.77e-02<>3.88e-02[ 10.0%]|******** 3.88e-02<>3.98e-02[ 0.0%]| 3.98e-02<>4.08e-02[ 0.0%]| 4.08e-02<>4.19e-02[ 0.0%]| 4.19e-02<>4.29e-02[ 0.0%]| 4.29e-02<>4.39e-02[ 0.0%]| 4.39e-02<>4.50e-02[ 0.0%]| 4.50e-02<>4.60e-02[ 10.0%]|******** Sorting 400000 values Analysis of 10 timings avg = 0.09267 min = 0.08984 max = 0.09464 span = 5.2% Time Ranges 8.98e-02<>9.03e-02[ 10.0%]|**************** 9.03e-02<>9.08e-02[ 0.0%]| 9.08e-02<>9.13e-02[ 10.0%]|**************** 9.13e-02<>9.18e-02[ 0.0%]| 9.18e-02<>9.22e-02[ 0.0%]| 9.22e-02<>9.27e-02[ 30.0%]|**************************************************A 9.27e-02<>9.32e-02[ 10.0%]|**************** 9.32e-02<>9.37e-02[ 10.0%]|**************** 9.37e-02<>9.42e-02[ 20.0%]|********************************* 9.42e-02<>9.46e-02[ 0.0%]| 9.46e-02<>9.51e-02[ 10.0%]|**************** Sorting 800000 values Analysis of 10 timings avg = 0.21788 min = 0.21384 max = 0.22209 span = 3.8% Time Ranges 2.14e-01<>2.15e-01[ 40.0%]|************************************************** 2.15e-01<>2.15e-01[ 0.0%]| 2.15e-01<>2.16e-01[ 0.0%]| 2.16e-01<>2.17e-01[ 0.0%]| 2.17e-01<>2.18e-01[ 10.0%]|************A 2.18e-01<>2.19e-01[ 0.0%]| 2.19e-01<>2.20e-01[ 10.0%]|************ 2.20e-01<>2.20e-01[ 10.0%]|************ 2.20e-01<>2.21e-01[ 10.0%]|************ 2.21e-01<>2.22e-01[ 10.0%]|************ 2.22e-01<>2.23e-01[ 10.0%]|************ Sorting 1600000 values Analysis of 10 timings avg = 0.48776 min = 0.48294 max = 0.49364 span = 2.2% Time Ranges 4.83e-01<>4.84e-01[ 10.0%]|************************* 4.84e-01<>4.85e-01[ 0.0%]| 4.85e-01<>4.86e-01[ 20.0%]|************************************************** 4.86e-01<>4.87e-01[ 20.0%]|************************************************** 4.87e-01<>4.88e-01[ 10.0%]|*************************A 4.88e-01<>4.89e-01[ 10.0%]|************************* 4.89e-01<>4.90e-01[ 10.0%]|************************* 4.90e-01<>4.91e-01[ 10.0%]|************************* 4.91e-01<>4.93e-01[ 0.0%]| 4.93e-01<>4.94e-01[ 0.0%]| 4.94e-01<>4.95e-01[ 10.0%]|************************* Sorting 3200000 values Analysis of 10 timings avg = 1.08046 min = 1.07263 max = 1.09409 span = 2.0% Time Ranges 1.07e+00<>1.07e+00[ 10.0%]|**************** 1.07e+00<>1.08e+00[ 20.0%]|********************************* 1.08e+00<>1.08e+00[ 30.0%]|************************************************** 1.08e+00<>1.08e+00[ 10.0%]|****************A 1.08e+00<>1.08e+00[ 10.0%]|**************** 1.08e+00<>1.09e+00[ 0.0%]| 1.09e+00<>1.09e+00[ 0.0%]| 1.09e+00<>1.09e+00[ 0.0%]| 1.09e+00<>1.09e+00[ 10.0%]|**************** 1.09e+00<>1.09e+00[ 0.0%]| 1.09e+00<>1.10e+00[ 10.0%]|**************** Sorting 6400000 values Analysis of 10 timings avg = 2.38002 min = 2.35988 max = 2.41731 span = 2.4% Time Ranges 2.36e+00<>2.37e+00[ 20.0%]|************************************************** 2.37e+00<>2.37e+00[ 20.0%]|************************************************** 2.37e+00<>2.38e+00[ 0.0%]| 2.38e+00<>2.38e+00[ 20.0%]|**************************************************A 2.38e+00<>2.39e+00[ 20.0%]|************************************************** 2.39e+00<>2.39e+00[ 0.0%]| 2.39e+00<>2.40e+00[ 10.0%]|************************* 2.40e+00<>2.41e+00[ 0.0%]| 2.41e+00<>2.41e+00[ 0.0%]| 2.41e+00<>2.42e+00[ 0.0%]| 2.42e+00<>2.42e+00[ 10.0%]|************************* Sorting 12800000 values Analysis of 10 timings avg = 5.46846 min = 5.19780 max = 5.88381 span = 12.5% Time Ranges 5.20e+00<>5.27e+00[ 20.0%]|********************************* 5.27e+00<>5.34e+00[ 0.0%]| 5.34e+00<>5.40e+00[ 30.0%]|************************************************** 5.40e+00<>5.47e+00[ 10.0%]|****************A 5.47e+00<>5.54e+00[ 10.0%]|**************** 5.54e+00<>5.61e+00[ 10.0%]|**************** 5.61e+00<>5.68e+00[ 0.0%]| 5.68e+00<>5.75e+00[ 0.0%]| 5.75e+00<>5.82e+00[ 0.0%]| 5.82e+00<>5.88e+00[ 10.0%]|**************** 5.88e+00<>5.95e+00[ 10.0%]|**************** Sorting 25600000 values Analysis of 10 timings avg = 12.00993 min = 11.52789 max = 12.52565 span = 8.3% Time Ranges 1.15e+01<>1.16e+01[ 10.0%]|**************** 1.16e+01<>1.17e+01[ 0.0%]| 1.17e+01<>1.18e+01[ 10.0%]|**************** 1.18e+01<>1.19e+01[ 30.0%]|************************************************** 1.19e+01<>1.20e+01[ 10.0%]|****************A 1.20e+01<>1.21e+01[ 10.0%]|**************** 1.21e+01<>1.22e+01[ 10.0%]|**************** 1.22e+01<>1.23e+01[ 0.0%]| 1.23e+01<>1.24e+01[ 0.0%]| 1.24e+01<>1.25e+01[ 10.0%]|**************** 1.25e+01<>1.26e+01[ 10.0%]|**************** Sorting 51200000 values Analysis of 10 timings avg = 26.22217 min = 25.98514 max = 26.48482 span = 1.9% Time Ranges 2.60e+01<>2.60e+01[ 40.0%]|************************************************** 2.60e+01<>2.61e+01[ 10.0%]|************ 2.61e+01<>2.61e+01[ 0.0%]| 2.61e+01<>2.62e+01[ 0.0%]| 2.62e+01<>2.62e+01[ 0.0%]|A 2.62e+01<>2.63e+01[ 10.0%]|************ 2.63e+01<>2.63e+01[ 0.0%]| 2.63e+01<>2.64e+01[ 0.0%]| 2.64e+01<>2.64e+01[ 0.0%]| 2.64e+01<>2.65e+01[ 30.0%]|************************************* 2.65e+01<>2.65e+01[ 10.0%]|************ ------------------------------------------------------------------------------ We can summarize this data as follows: N | Time | Ratio | Predicted | %Error -----------+--------+-------+-----------+-------- 100,000 | 0.015 | | 0.025 | 64 200,000 | 0.037 | 2.5 | 0.052 | 41 400,000 | 0.093 | 2.5 | 0.110 | 18 800,000 | 0.218 | 2.3 | 0.232 | 6 1,600,000 | 0.488 | 2.2 | 0.488 | 0 (predictions based on this run) 3,200,000 | 1.080 | 2.3 | 1.023 | 5 6,400,000 | 2.380 | 2.2 | 2.141 | 10 12,800,000 | 5.468 | 2.3 | 4.472 | 18 25,600,000 | 12.010 | 2.2 | 9.323 | 22 51,200,000 | 26.222 | 2.2 | 19.405 | 26 ---------- For comparison, here is the same summary information from my previous computer (new in 2012) running an earlier version of Python (I did not record which). Note that the times are about twice as long, but the ratios are about the same. Just what we would expect from a slower technology running an algorithm in a complexity class that hasn't changed. N | Time | Ratio | Predicted | %Error -----------+--------+-------+-----------+-------- 100,000 | 0.037 | | 0.048 | 29 200,000 | 0.080 | 2.2 | 0.101 | 26 400,000 | 0.178 | 2.2 | 0.213 | 20 800,000 | 0.416 | 2.3 | 0.450 | 8 1,600,000 | 0.945 | 2.3 | 0.945 | 0 (predictions based on this run) 3,200,000 | 2.145 | 2.3 | 1.982 | 8 6,400,000 | 4.853 | 2.3 | 4.150 | 15 12,800,000 | 10.925 | 2.3 | 8.660 | 21 25,600,000 | 24.578 | 2.2 | 18.055 | 27 51,200,000 | 56.953 | 2.3 | 37.576 | 34 ---------- I sorted lists from 100 thousand to 51.2 million values, doubling the length every time, whose sizes are listed in the first column. The average times (from 10 experiments each) are listed in the second column. I computed the ratio of T(2N)/T(N) for each N (after the first) and the ratio was always bigger than 2 by a small amount that generally gets smaller. This indicates that the complexity class is slightly higher than O(N). As we discussed, it is actually O(N Log N), and this is the signature for O(N Log N): slightly bigger than 2. Using O(N Log N) as the complexity class and using N = 1,600,000 I solved for the constant in the formula T(N) = c * N Log N and got 1.48E-08 (for logarithms base 2: I have to choose a base, but I can use any), so we can approximate the time taken to sort as T(N) = 1.48*10^-8 * N Log N. Given this approximation, the next columns shows the times predicted for that size N, and the percent error between the predicted and real time (which grows as N gets farther away -in both directions- from 1,600,000). The errors are not bad: even a 100% error means that we have still predicted the time within a factor of of 2, and here the worst error was about 64%, when N was smallest. Here is the actual code for the Performance class. You will see that although the constructor specifies times_to_measure, we can omit/override this value when calling evaluate() by passing the number of times to test the code. Likewise with the title and the analyze method (which also allows specification of the number of bins to use in the histogram of times created). import gc from stopwatch import Stopwatch from goody import frange class Performance: def __init__(self,code,setup=lambda:None,times_to_measure=5,title='Generic'): self._code = code self._setup = setup self._times = times_to_measure self._evaluate_results = None self._title = title def evaluate(self,times=None): results = [] s = Stopwatch() times = times if times != None else self._times for _ in range(times): self._setup() s.reset() gc.disable() s.start() self._code() s.stop() gc.enable() results.append(s.read()) self._evaluate_results = [(min(results),sum(results)/times,max(results))] + [results] return self._evaluate_results def analyze(self,bins=10,title=None): if self._evaluate_results == None: print('No results from calling evaluate() to analyze') return def print_histogram(bins_dict): count = sum(bins_dict.values()) max_for_scale = max(bins_dict.values()) for k,v in sorted(bins_dict.items()): pc = int(v/max_for_scale*50) extra = 'A' if k[0] <= avg < k[1] else '' print('{bl:.2e}<>{bh:.2e}[{count: 5.1f}%]|{stars}'.format(bl=k[0],bh=k[1],count=v/count*100,stars='*'*pc+extra)) (mini,avg,maxi),times = self._evaluate_results incr = (maxi-mini)/bins hist = {(f,f+incr):0 for f in frange(mini,maxi,incr)} for t in times: for (min_t,max_t) in hist: if min_t<= t < max_t: hist[(min_t,max_t)] += 1 print(title if title != None else self._title) print('Analysis of',len(times),'timings') print('avg = {avg:.3f} min = {min:.3f} max = {max:.3f} span = {span:.1f}%'. format(min=mini,avg=avg,max=maxi,span=(maxi-mini)/avg*100)) print('\n Time Ranges ') print_histogram(hist) This class is in the performance.py module in the empirical project folder: a download that accompanies this lecture. So, you can run your own experiments by importing this module wherever is code that you want to time. ------------------------------------------------------------------------------ Heights of Random Binary Search Trees: A Dynamic Analysis Let's empirically examine the heights of binary search trees constructed at random: values are added into binary search trees in random orders. We know that the maximum/worst-case height for a binary search tree with size N is N-1; the minimum/best case is a height of about (Log2 N)-1. We will write code below to perform a specified number of experiments, each building a random binary search tree of a specified size: for each experiment we will collect the height of the tree produced and ultimately plot a histogram of all the heights. To run these experiments, we need to access the TN class and the height, add, and add_all functions, which are all written in the tree project folder (examined when we discussed trees). I copied that code into the randomtrees.py module, but we could have imported it. Here is the code to prompt the user for the experiment and compute a histogram of all the different tree heights. import prompt,random,math from goody import irange from collections import defaultdict experiments = prompt.for_int('Enter # of experiments to perform') size = prompt.for_int('Enter size of tree for each experiment') hist = defaultdict(int) alist = [i for i in range(size)] for exp in range(experiments): if exp % (experiments//100) == 0: print('Progress: {p:d}%'.format(p =int(exp/(experiments//100)))) random.shuffle(alist) hist[ height(add_all(None,alist)) ] += 1 print_histogram('Binary Search Trees of size '+str(size),hist) print('\nminimum possible height =',math.ceil(math.log2(size)-1),' maximum possible height =',size-1) For 10,000 experiments run on binary search trees of size 1,000, this code printed the following results (after computing for a few minutes). For a 1,000 node tree, the minimum possible height is 9 and the maximum possible height is 999. The heights recorded here are all between about 1.5 times the minimum and about 3 times the minimum (which is true for much larger random binary search trees as well; see the next analysis). Binary Search Trees of size 1000 Analysis of 10,000 experiments avg = 21.0 min = 16 max = 31 16[ 0.0%]| 17[ 1.0%]|** 18[ 5.4%]|************ 19[ 14.3%]|********************************* 20[ 21.2%]|************************************************* 21[ 21.5%]|**************************************************A 22[ 16.1%]|************************************* 23[ 9.8%]|********************** 24[ 5.7%]|************* 25[ 2.9%]|****** 26[ 1.3%]|** 27[ 0.5%]|* 28[ 0.1%]| 29[ 0.1%]| 30[ 0.0%]| 31[ 0.0%]| minimum possible height = 9 maximum possible height = 999 Note because the 16, 30, 31 bins are printed, they were not 0 (there were randomly constructed trees with those heights), although there were so few trees of these heights that their percentages (to one decimal place) were 0. There were no trees at heights less than 16 or greater than 31 (or they too would have been printed). The print_histogram method called in the code above is shown below def print_histogram(title,bins_dict): print(title) count = sum(bins_dict.values()) min_bin = min(bins_dict.keys()) max_bin = max(bins_dict.keys()) max_for_scale = max(bins_dict.values()) print('Analysis of {count:,} experiments'.format(count=count)) w_sum = 0 for i in bins_dict: w_sum += i*bins_dict[i] avg = w_sum/count print('\navg = {avg:.1f} min = {min} max = {max}\n'.format(avg=avg,min=min_bin,max=max_bin)) for i in irange(min_bin,max_bin): pc = int(bins_dict[i]/max_for_scale*50) extra = 'A' if int(avg+.5) == i else '' print('{bin:4}[{count: 5.1f}%]|{stars}'.format(bin=i,count=bins_dict[i]/count*100,stars='*'*pc+extra)) The results below are for 10,000 experiments run on binary search trees of size 100,000. This code took about 5 hours to run. Notice too that almost all the trees are between 2 and 3 times the minimum possible tree; none are near the maximum of 99,9999. Binary Search Trees of size 100000 Analysis of 10,000 experiments avg = 39.6 min = 34 max = 53 34[ 0.1%]| 35[ 0.7%]|* 36[ 4.2%]|********** 37[ 11.1%]|**************************** 38[ 16.9%]|****************************************** 39[ 19.7%]|************************************************** 40[ 17.1%]|*******************************************A 41[ 12.5%]|******************************* 42[ 8.0%]|******************** 43[ 4.9%]|************ 44[ 2.4%]|****** 45[ 1.4%]|*** 46[ 0.7%]|* 47[ 0.3%]| 48[ 0.2%]| 49[ 0.1%]| 50[ 0.1%]| 51[ 0.0%]| 52[ 0.0%]| 53[ 0.0%]| minimum possible height = 16 maximum possible height = 99999 All this code is in the randomtrees.py module in the empirical project folder: a download that accompanies this lecture. So, you can run your own experiments. If you know a lot about math and probability, you compute the height of a random binary search tree, and it will closely agree with this empirical result. ------------------------------------------------------------------------------ Profiling Programs: Performance Summary of all Functions at the Program Level Profilers are modules that run other modules and collect information about their execution: typically information about how many times their functions are called, and the time spent inside each function: both the individual time and the cumulative time (which also includes the amount of time spent inside the functions they call). For example, in def f(...): ...code 1, no function calls g(...) ...code 2, no function calls The INDIVIDUAL time for f includes the amount of time spent in code 1 and code 2; the CUMULATIVE time for f also includes the amount of time spent in function g; of course g will have its own individual and cumulative times too. Such information is useful for focusing our attention on the functions taking a significant amount of time, so that we can optimize/rewrite them in the hope of significantly improving the speed of our entire program (and not wasting our time optimizing/rewriting functions that do not significantly affect the running time of the entire program). Although the programs that we wrote this quarter were sophisticated, they ran on only a small amount of data and executed quickly. The ideal program to profile is one that uses a lot of data and takes a long time to run. In my ICS-31 class, students wrote a program that performs "shotgun assembly" on DNA strands. The program isn't huge (only about 50 lines of code), but in the largest problem I have students solve, the input is 1,824 DNA fragments that are each 50-100 bases long, so the input is hundreds of thousands of bases. My solution program took almost about 1.5 minutes to run, before printing the resulting 10,000 base DNA strand that it built from all these overlapping DNA fragments. This program is a module that defines functions followed by a script that calls these functions. To run it using the profiler module, we need to move the statements in the script into their own function (which I generically called perform_task). Then, we add the following import at the top, and add the following function call at the bottom to profile the function/program. import cProfile ...all the functions in the module, plus perform_task (the script in a function) cProfile.run('perform_task()') When run, the DNA assembly program performs its job and produces the correct output (and in the process, prints information into the console). Then the profiler prints the following information in the console, which shows in the top line that it profiled 234 million function calls over 107 seconds: so overall Python called 2.19 million functions/second. The data shown here (and all the data shown below later) is always sorted by one of the three headings. This data is sorted by the ASCII values in the strings produced by filename:lineno (function). The columns (no matter how they are sorted) represent ncalls : the number of times the function was called tottime: the total time spent in just that function, NOT INCLUDING the time spent in the other functions that it calls (although some built-in functions it calls cannot be timed separately) This as a bad name; I'd prefer individual time; but it is total time. cumtime: the cumulative time spent in that function, INCLUDING the time spent in the other functions that it calls So, as illustrated above, if function f performed some computation and called function g, the tottime for f would NOT include the time spent in g, but the cumtime would include this time. So it should always be the case that tottime <= cummtime. Examine the information shown below, but we will look at parts of it more selectively soon. Note that the sum of all the tottime data is the running time, but many cumtime data have the same value as the total running time (or close): the , exec, perform_task, and assemble all show a cumulative time equal to the running time, because cumtime for them is counted in the functions they call. 233,525,220 function calls in 106.705 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 106.705 106.705 :1() 1 0.000 0.000 0.000 0.000 codecs.py:164(__init__) 1 0.000 0.000 0.000 0.000 codecs.py:238(__init__) 6 0.000 0.000 0.000 0.000 cp1252.py:18(encode) 14 0.000 0.000 0.000 0.000 cp1252.py:22(decode) 4671030 3.541 0.000 4.028 0.000 goody.py:17(irange) 3627 1.311 0.000 1.768 0.000 listlib.py:16(remove) 2 0.000 0.000 0.000 0.000 locale.py:555(getpreferredencoding) 4671030 39.505 0.000 103.272 0.000 profilednamaker.py:10(max_overlap) 1 0.000 0.000 0.001 0.001 profilednamaker.py:19(read_fragments) 1 0.001 0.001 0.001 0.001 profilednamaker.py:20() 1814 1.650 0.001 104.921 0.058 profilednamaker.py:23(choose) 1 0.010 0.010 106.699 106.699 profilednamaker.py:33(assemble) 1658018 0.152 0.000 0.152 0.000 profilednamaker.py:42() 1656204 0.155 0.000 0.155 0.000 profilednamaker.py:43() 1 0.004 0.004 106.705 106.705 profilednamaker.py:49(perform_task) 7 0.000 0.000 0.000 0.000 profilednamaker.py:55() 7 0.000 0.000 0.000 0.000 profilednamaker.py:56() 194186759 58.607 0.000 58.607 0.000 profilednamaker.py:6(overlaps) 2 0.000 0.000 0.000 0.000 {built-in method _getdefaultlocale} 14 0.000 0.000 0.000 0.000 {built-in method charmap_decode} 6 0.000 0.000 0.000 0.000 {built-in method charmap_encode} 1 0.000 0.000 106.705 106.705 {built-in method exec} 22001998 1.140 0.000 1.140 0.000 {built-in method len} 4671030 0.630 0.000 0.630 0.000 {built-in method min} 2 0.000 0.000 0.000 0.000 {built-in method open} 3 0.000 0.000 0.000 0.000 {built-in method print} 1813 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1824 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects} By calling cProfile.run('perform_task()','profile') we direct the run function to not print its results on the console, but instead to write them into the file named 'profile' (or any other file name we want to use). Then we can use the pstats module, described below, to read this data file and print it in simplified (more easy to read and use) forms. Here is a script that uses pstats to show just the top 10 lines of the data above, when sorted by ncalls, cumtime, and tottime. import pstats # Get data from stored file named 'profile' p = pstats.Stats('profile') # uncomment the line below to print all the the information above # strip_dirs removes directory information, but leaves file names # no argument to print_stats() prints all statistics #p.strip_dirs().sort_stats(-1).print_stats() # An argument to print_stats() prints that many lines, in decreasing order p.strip_dirs().sort_stats('calls').print_stats(10) p.strip_dirs().sort_stats('cumulative').print_stats(10) p.strip_dirs().sort_stats('time').print_stats(10) The three results this script prints are 233,525,220 function calls in 106.705 seconds Ordered by: call count List reduced from 31 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 194186759 58.607 0.000 58.607 0.000 profilednamaker.py:6(overlaps) 22001998 1.140 0.000 1.140 0.000 {built-in method len} 4671030 3.541 0.000 4.028 0.000 goody.py:17(irange) 4671030 0.630 0.000 0.630 0.000 {built-in method min} 4671030 39.505 0.000 103.272 0.000 profilednamaker.py:10(max_overlap) 1658018 0.152 0.000 0.152 0.000 profilednamaker.py:42() 1656204 0.155 0.000 0.155 0.000 profilednamaker.py:43() 3627 1.311 0.000 1.768 0.000 listlib.py:16(remove) 1824 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects} 1814 1.650 0.001 104.921 0.058 profilednamaker.py:23(choose) 233,525,220 function calls in 106.705 seconds Ordered by: cumulative time List reduced from 31 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 106.705 106.705 {built-in method exec} 1 0.000 0.000 106.705 106.705 :1() 1 0.004 0.004 106.705 106.705 profilednamaker.py:49(perform_task) 1 0.010 0.010 106.699 106.699 profilednamaker.py:33(assemble) 1814 1.650 0.001 104.921 0.058 profilednamaker.py:23(choose) 4671030 39.505 0.000 103.272 0.000 profilednamaker.py:10(max_overlap) 194186759 58.607 0.000 58.607 0.000 profilednamaker.py:6(overlaps) 4671030 3.541 0.000 4.028 0.000 goody.py:17(irange) 3627 1.311 0.000 1.768 0.000 listlib.py:16(remove) 22001998 1.140 0.000 1.140 0.000 {built-in method len} 233,525,220 function calls in 106.705 seconds Ordered by: internal time List reduced from 31 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 194186759 58.607 0.000 58.607 0.000 profilednamaker.py:6(overlaps) 4671030 39.505 0.000 103.272 0.000 profilednamaker.py:10(max_overlap) 4671030 3.541 0.000 4.028 0.000 goody.py:17(irange) 1814 1.650 0.001 104.921 0.058 profilednamaker.py:23(choose) 3627 1.311 0.000 1.768 0.000 listlib.py:16(remove) 22001998 1.140 0.000 1.140 0.000 {built-in method len} 4671030 0.630 0.000 0.630 0.000 {built-in method min} 1656204 0.155 0.000 0.155 0.000 profilednamaker.py:43() 1658018 0.152 0.000 0.152 0.000 profilednamaker.py:42() 1 0.010 0.010 106.699 106.699 profilednamaker.py:33(assemble) Note it says "Ordered by: internal time" although it is ordered by "tottime"; the work "internal" is similar to how I described "individual" time above. As we can see directly from the information above, the most tottime is spent in the overlaps function. It is very simple, so when I tried to write it another way, I couldn't get any time improvement. So then I moved on to the max_overlap function. The max_overlap function calls overlaps for each possible overlap (based on lengths of the stands to match), so since the ratio of calls (overlap calls/max_overlap calls) is about 42/1, the average possible strand overlap is 42. So, it is possible to infer information about the code from this empirical data. By numcalls I realized from the place it was called (in choose) that I could simplify max_overlaps to not compute the maximum overlap, but just find (and immediately return) any overlap that exceeds a minimum specified in the choose method. By changing this code, the above profile turned into the following one. 196057553 function calls in 87.748 seconds Ordered by: internal time List reduced from 31 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 152048062 46.566 0.000 46.566 0.000 profilednamaker.py:6(overlaps) 4671030 31.241 0.000 84.226 0.000 profilednamaker.py:10(exceeds_overlap) 4671030 4.615 0.000 5.287 0.000 goody.py:17(irange) 1814 1.713 0.001 85.939 0.047 profilednamaker.py:23(choose) 26673028 1.334 0.000 1.334 0.000 {built-in method len} 3627 1.316 0.000 1.795 0.000 listlib.py:16(remove) 4671030 0.636 0.000 0.636 0.000 {built-in method min} 1656204 0.158 0.000 0.158 0.000 profilednamaker.py:43() 1658018 0.153 0.000 0.153 0.000 profilednamaker.py:42() 1 0.009 0.009 87.743 87.743 profilednamaker.py:33(assemble) I was able to decrease the run time by about 20 seconds (almost 20%). Although exceeds_overlap (I changed the name from max_overlap) is called the same number of times as before, it calls overlaps about 25% fewer times, saving 12 seconds; and because it is called fewer times, exceed_max saves another 8 seconds over max_overlap, which together accounts for the full 20 seconds. In a large system with thousands of functions, it is a big win to use the profiler to focus our attention on the important functions: the ones that take a significant amount of time, and therefore the ones likely to cause major decrease in the runtime if improved. A rule of thumb is 20% of the code accounts for 80% of the execution time (some say 10%/90%, but the idea is the same). We need to be able to focus on which small amount of code the program spends most of its time in. In the code above, if by hard work I could make the bottom 28 functions run instantaneously and there would be at most a 6 second (7%) speedup: why bother? Better to have that code written as clearly as possible, since its execution accounts for so little time. Finally, the complete specifications for cProfile and pstats are available in the Python library documentation, under section 27. Debugging and Profiling, and under 27.4: The Python Profilers. ------------------------------------------------------------------------------ Hashing: How Sets/Dicts are Faster than Lists for operations like "in" When we examined the complexity classes of various operations/methods on the list, set, and dict data-types/classes, we found that sets/dicts had many methods that were O(1). We will briefly explain how this is accomplished, but a fuller discussion will have to wait until ICS-46. First we will examine how hashing works and then analyze the complexity class of operations using hashing. Python defines a hash function that takes any object as an argument and produces a "small" integer (sometimes positive, sometimes negative) whose magnitude is 32 or 64 bits (depending on which Python you use). How this value is computed won't concern us now. But we are guaranteed (1) there is a fast way to compute it, (2) within the same program, an object with unchanged state always computes the same result for hashing, so immutable objects always compute the same result for their hash function. It is possible, but not likely, that two DIFFERENT objects will hash to the same value. Each class in Python implements a __hash__ method, which is called by the hash function in Python's builtins module. Small integers are their own hash: hash(1) is 1; hash(1000000) is 1000000; but hash(1000000000000000000000000) is 1486940387. Even small strings have interesting hash values: hash('a') is 1186423063; hash('b') is 1561756564; hash('pattis') is -1650297348 (is your name hashed positive or negative)? (BUT SEE IMPORTANT NOTE BELOW). ----- Note that when we define our own classes, if we want them to be used in sets or as the keys in dictionaries (things that are hashed), we must define a __hash__ method for that class (with only self as its parameter), and this method must return an int. If we do not provide a __hash__ method and try to use objects from that class in sets or as the keys of dictionaries, Python will raise a TypeError with the message: "unhashable type". Technically, classes that contain mutator methods should NOT be hashable, but it is OK to define __hash__ for a class of mutable objects: but if we do so, we must NEVER mutate an object while it is in a set or the key of dictionary: otherwise the object might be "lost" because it is in the wrong bucket (see below). If we want to mutate such an object, we should remove it from the set or dict, mutate it, and then put it back in: all these operations are O(1). Again hashing and hash algorithms are covered in much more detail in ICS-46. ----- Let's investigate how to use hashing and lists to implement pset (pseudo set) with a quick way to add/remove values, and check if a values is in the set: all are in complexity class O(1). We define the class pset (and its __init__) as follows. Notice the _bins is a list of lists (starting with just 1 inner list, which is empty). The __str__ method prints these bins. Objects in this class use _len to cache the number of values in their sets: incrementing it when adding values and decrementing when removing values. The last parameter specifies the load factor threshold, which we will discuss when we examine the add method. Notice that the first parameter, iterable, is anything we can iterate over, adding each value to the set to initialize it. class pset: def __init__(self,iterable=[],load_factor_threshold=1): self._bins = [[]] self._len = 0 # cache, so don't have to recompute from bins self._lft = load_factor_threshold for v in iterable: self.add(v) # See add method below, using hashing into _bins def __str__(self): return str(self._bins) Recall that _bins will store a list of lists (which we call a hash table). Each inner list is called a bin. Hash tables grow as we add values to them (just how they grow is the secret of the O(1) performance). Before discussing the add method, let's observe what _bins looks like as values are added. The load factor of a hash table is the number of values in the table divided by the number of bins (inner lists). As we add values to the bins, this number increases. Whenever this value exceeds the load factor threshold the number of bins is doubled, and all the values that were in the original bins are put back into the new bins (we will see why their positions often change). Such an operation is called rehashing. By increasing the number of bins, it will lower the load factor below the threshold (by increasing its denominator). Generally, when we double the number of bins, the average size of a bins is cut in half. In the example below, we will assume the load factor threshold is the default value, 1. 0) start : [[]] 1) add 'a': [['a']] 2) add 'b': [['b'], ['a']] 3) add 'c': [['b'], [], [], ['a', 'c']] 4) add 'd': [['b'], [], ['d'], ['a', 'c']] 5) add 'e': [[], [], [], ['c'], ['b'], ['e'], ['d'], ['a']] 6) add 'f': [['f'], [], [], ['c'], ['b'], ['e'], ['d'], ['a']] 7) add 'g': [['f'], [], [], ['c'], ['b'], ['e'], ['d'], ['a', 'g']] 8) add 'h': [['f'], [], ['h'], ['c'], ['b'], ['e'], ['d'], ['a', 'g']] Recall Load Factor = # values in the table/# of bins in the table 0) At the start, 1 bin with no values so the load factor is 0. 1) We add 'a' to the first bin in the _bins list; the load factor 1 2) We add 'b' to the first bin in the _bins list; the load factor is 2, which exceeds the threshold so all the values are rehashed as shown, and the load factor is now 2/2. 3) We add 'c' to a bin in the _bins list; the load factor is 3/2, which exceeds the threshold so all the values are rehashed (notice that 'a' and 'c' are both in the same bin; this is called a collision and often happens when there are many values in hash tables) 4) We add 'd' to the third bin in the _bins list; the load factor is 4/4. 5) We add 'e' to one bin in the _bins list; the load factor is 5/4, which exceeds the threshold so all the values are rehashed (notice all the values are in their own bins now), and the load factor is now 5/8. 6) We add 'f' to the first bin in the _bins list; the load factor is 6/8. 7) We add 'g' to the last bin in the _bins list; the load factor is 7/8. 8) We add 'h' to the third bin in the _bins list; the load factor is 8/8. (adding yet another value will double the number of bins) So, hashing (and rehashing) values put them in bins. Notice that some bins are empty and some store more than 1 value, but most store 1 value. This is because the load factor threshold is about 1. Now lets look at the _bin helper method, which finds the bin for the value: if the value is in the pset, it must be in that bin (although this calculation changes if the length of _bins changes, which is why rehashing is necessary); if the value is to be added to a pset, that is the bin it belongs in. def _bin_of(self,v): return abs(hash(v))%len(self._bins) It hashes the value, computes its absolute value, and then computes its remainder when divided by the number of bins; so the index it produces is always between 0 and len(self._bins)-1, is always a legal index for the _bins list. This is the bin the value belongs in WHEN THE HASH TABLE IS THAT LENGTH: if its length changes, the denominator in the calculation above also changes, so the bin that it belongs in probably changes too. The code for add is as follows: def add(self,v): index = self._bin_of(v) # hash v and compute its bin if v in self._bins[index]: # No work to do: already in the pset return # bins tend to be small so "in" is fast self._len += 1 # Cache len: now containing one more value self._bins[index].append(v) # Store it in the right bin # if exceed LoadFactorThreshold, rehash all values in a bigger table if self._len/len(self._bins) > self._lft: self._rehash() It first computes the bin in which the value v must be in (if it is in the hash table) and then checks if it is there; if so it returns immediately because psets have just one copy of a value. Otherwise, it increments _len and appends the value v into the bin in which it belongs. But if the newly added value makes the load factor exceed the threshold all the values are rehashed in the following helper method. The _rehash helper method is only called from add def _rehash(self): old = self._bins #double the number of bins (to drop the load_factor_threshold) #rehash old values: len(self._bins) has changed self._bins = [[] for _ in range(2*len(old))] self._len = 0 for bins in old: for v in bins: self.add(v) This method remembers the old bins, resets the _bins and _len, and then adds each value v from the old bins into the new bins; its bin number might change because the % function calculated in _bin. By doubling the number of bins, there will be no calls to _rehash while all these values are added. Checking whether a value v is in a pset is simple: it just checks whether v is in the bin/list that hashing says it belongs in. What will be stored in that bin: an empty list or a small list (maybe storing v, maybe not storing it). So calling in on it will be fast. def __contains__(self,v): return v in self._bins[self._bin_of(v)] Likewise, removal goes to the bin the value v would be in IF it were in the pset, and if it is there it is removed and the cached length is decremented; if not in this bin, no others need to be checked no changes are made to the pset. def remove (self,v): alist = self._bins[self._bin_of(v)] # Share list in hash table if v in alist: alist.remove(v) else: raise KeyError(str(v)) Actually, removal can sometimes shrink the number of bins and rehash all the values. We leave that feature out of this discussion, but you are welcome to extend the remove code to ensure as psests get smaller, the number of bins gets smaller too. Finally, we can show the trivial __len__ function, returning the cached values (incremented in add and decremented in remove): def __len__(self): return self._len # cached So, why are the add, contains, and remove methods O(1)? Because the hash function does a good job of randomizing in which bins values are stored, and the load factor is kept around 1 (meaning there are about as many bins as values: as more and more values are added, the length of the list of bins grows), the amount of time it takes to add/examine/remove a value from a its bin in hash table (as used in pset) is constant. That is, if there are N values in the pset, there are at lease N bins, and the average number of values in a bin is close to 1. It takes a constant amount of work (independent from the number of values in the hash table) to hash a value to find its bin, and since each bin has about 1 value in it, it takes a constant amount of time to check or update that bin. Now, some bins are empty and some can have more than one value (but very few have a lot of values). I added an analyze method to pset so that it can show statistics about the number of bins and their lengths. If we call the following function as experiment(1_000_000), it generates 1 million random 26 letter strings and puts them in a hash table, whose size grows from 1 to 2**20 (which is a bit over a million) def build_set(n,m=26): s = pset() word = list('abcdefghijklmnopqrstuvwxyz'[:m]) for i in range(n): random.shuffle(word) s.add(''.join(word)) return s def experiment(n,m=26): s = build_set(n,m) s.analyze() We can call this function to analyze the distribution of values in bins. Here is one result produced by calling experiment with the argument 1 million (whose output text is reduced a bit to fit nicely on one page). bins with 0 values = 403,619 totalling 0 values; cumulative = 0 bins with 1 values = 385,426 totalling 385,426 values; cumulative = 385,426 bins with 2 values = 184,243 totalling 368,486 values; cumulative = 753,912 bins with 3 values = 58,602 totalling 175,806 values; cumulative = 929,718 bins with 4 values = 13,708 totalling 54,832 values; cumulative = 984,550 bins with 5 values = 2,492 totalling 12,460 values; cumulative = 997,010 bins with 6 values = 427 totalling 2,562 values; cumulative = 999,572 bins with 7 values = 47 totalling 329 values; cumulative = 999,901 bins with 8 values = 10 totalling 80 values; cumulative = 999,981 bins with 9 values = 1 totalling 9 values; cumulative = 999,990 bins with 10 values = 1 totalling 10 values; cumulative = 1,000,000 As you can see, most bins store no values! But many other bins store one value (or just a few values); in fact the bins storing 1-5 values account for over 99% of the values in the hash table. So for over 99% of the values in the hash table, it takes at most 5 comparisons to examine/update these bins, and 5 is a constant. If we put 2 million values into the pset, the bin profile above would be similar, again with 99% of the values findable with at most 5 comparisons. Here is the information for a 2 million value pset. bins with 0 values = 807,118 totalling 0 values; cumulative = 0 bins with 1 values = 771,329 totalling 771,329 values; cumulative = 771,329 bins with 2 values = 368,773 totaling 737,546 values; cumulative = 1,508,875 bins with 3 values = 115,937 totalling 347,811 values; cumulative = 1,856,686 bins with 4 values = 27,777 totalling 111,108 values; cumulative = 1,967,794 bins with 5 values = 5,245 totalling 26,225 values; cumulative = 1,994,019 bins with 6 values = 851 totalling 5,106 values; cumulative = 1,999,125 bins with 7 values = 104 totalling 728 values; cumulative = 1,999,853 bins with 8 values = 15 totalling 120 values; cumulative = 1,999,973 bins with 9 values = 3 totalling 27 values; cumulative = 2,000,000 Now we can close the circle started in this lecture by using Performance to empirically analyze whether all our conjectures about the performance of hash tables are correct. We will construct psets with different numbers of values, doubling each time. We test each pset by adding N values and then performing N lookups. If each of these operations is truly O(1) and we do N of each, the complexity class of doing both it O(N), so doubling N should double the time. The data shows this behavior exactly, with much less error than our sorting analysis. N | Time | Time / N | Ratio | Predicted | %Error --------+---------+------------------+---------+-------------+----------- 1,000 | 0.030 | .000030 | | 0.030 | 0 2,000 | 0.060 | .000030 | 1.0 | 0.060 | 0 4,000 | 0.120 | .000030 | 1.0 | 0.120 | 0 8,000 | 0.240 | .000030 | 1.0 | 0.241 | 0 16,000 | 0.481 | .000030 | 1,0 | 0.481 | 0 (predictions based on this run) 32,000 | 0.962 | .000030 | 1.0 | 0.962 | 0 64,000 | 1.927 | .000030 | 1.0 | 1.924 | 0 128,000 | 3.873 | .000030 | 1.0 | 3.848 | 1 256,000 | 7.735 | .000030 | 1.0 | 7.696 | 1 All this code is in the hashing.py module in the empirical project folder: a download that accompanies this lecture. So, you can run your own experiments. ------------------------------ IMPORTANT: True before Python 3.7; not true now (but interesting) At present Python always computes exactly the same value when hashing a string while a program is running; but when the program stops and a new program is run, it gives a different value for the same string (but always the same one for that run of the progam). That makes things much harder to explain, because I cannot use hashing examples and write their output here). Python's hashing function uses a random number in the hash function, but one that is the same for all hashing while a program runs. This is good for exposing errors in code that uses hashing, but not so good for being able to show examples of hashing. This is also why different runs of exactly the same program with exactly the same data may produce different iteration orders for set (although Python 3.7 imposes extra constraints on iteration order, which may or may not go away in future Pythons, so don't count on them). ------------------------------ Here are the methods that implement iteration for psets. The order that the values in the pset are produced is: all those value (in order) in the list in bin 0, all those values (in order) in the list in bin 1, etc. def __iter__(self): for b in self._bins: for v in b: yield v Recall that the values moved around when rehashed. So that is why there is no simple order in the sets/dicts we iterate over. Finally, we have discussed that set and dictionary keys cannot be mutable. Now we can get some insight why. If we put a value in its bin, but then change its state (mutate it), the hash function would compute a different result and _bin_of would probably want it in a different bin. And if it is in the wrong bin, looking for it, or trying to remove it, or trying to add it (with no duplicates) will not work correctly. ------------------------------------------------------------------------------ DATA: Sorting first (on my old computer); Hash Table Next Sorting Data: actual data Sorting 100000 values Analysis of 10 timings avg = 0.037 min = 0.036 max = 0.038 span = 3.8% Time Ranges 3.63e-02<>3.64e-02[ 10.0%]|************************* 3.64e-02<>3.66e-02[ 20.0%]|************************************************** 3.66e-02<>3.67e-02[ 0.0%]| 3.67e-02<>3.68e-02[ 10.0%]|************************* 3.68e-02<>3.70e-02[ 20.0%]|**************************************************A 3.70e-02<>3.71e-02[ 20.0%]|************************************************** 3.71e-02<>3.72e-02[ 0.0%]| 3.72e-02<>3.74e-02[ 0.0%]| 3.74e-02<>3.75e-02[ 10.0%]|************************* 3.75e-02<>3.77e-02[ 0.0%]| 3.77e-02<>3.78e-02[ 10.0%]|************************* Sorting 200000 values Analysis of 10 timings avg = 0.080 min = 0.079 max = 0.081 span = 2.3% Time Ranges 7.94e-02<>7.96e-02[ 10.0%]|************************* 7.96e-02<>7.98e-02[ 20.0%]|************************************************** 7.98e-02<>8.00e-02[ 0.0%]| 8.00e-02<>8.02e-02[ 10.0%]|************************* 8.02e-02<>8.03e-02[ 10.0%]|*************************A 8.03e-02<>8.05e-02[ 10.0%]|************************* 8.05e-02<>8.07e-02[ 0.0%]| 8.07e-02<>8.09e-02[ 20.0%]|************************************************** 8.09e-02<>8.11e-02[ 10.0%]|************************* 8.11e-02<>8.12e-02[ 0.0%]| 8.12e-02<>8.14e-02[ 10.0%]|************************* Sorting 400000 values Analysis of 10 timings avg = 0.178 min = 0.175 max = 0.181 span = 3.0% Time Ranges 1.75e-01<>1.76e-01[ 20.0%]|********************************* 1.76e-01<>1.76e-01[ 0.0%]| 1.76e-01<>1.77e-01[ 0.0%]| 1.77e-01<>1.77e-01[ 10.0%]|**************** 1.77e-01<>1.78e-01[ 30.0%]|************************************************** 1.78e-01<>1.79e-01[ 0.0%]|A 1.79e-01<>1.79e-01[ 10.0%]|**************** 1.79e-01<>1.80e-01[ 10.0%]|**************** 1.80e-01<>1.80e-01[ 10.0%]|**************** 1.80e-01<>1.81e-01[ 0.0%]| 1.81e-01<>1.81e-01[ 10.0%]|**************** Sorting 800000 values Analysis of 10 timings avg = 0.416 min = 0.406 max = 0.457 span = 12.1% Time Ranges 4.06e-01<>4.12e-01[ 60.0%]|************************************************** 4.12e-01<>4.17e-01[ 10.0%]|********A 4.17e-01<>4.22e-01[ 20.0%]|**************** 4.22e-01<>4.27e-01[ 0.0%]| 4.27e-01<>4.32e-01[ 0.0%]| 4.32e-01<>4.37e-01[ 0.0%]| 4.37e-01<>4.42e-01[ 0.0%]| 4.42e-01<>4.47e-01[ 0.0%]| 4.47e-01<>4.52e-01[ 0.0%]| 4.52e-01<>4.57e-01[ 0.0%]| 4.57e-01<>4.62e-01[ 10.0%]|******** Sorting 1600000 values Analysis of 10 timings avg = 0.945 min = 0.940 max = 0.952 span = 1.3% Time Ranges 9.40e-01<>9.41e-01[ 30.0%]|************************************************** 9.41e-01<>9.42e-01[ 10.0%]|**************** 9.42e-01<>9.43e-01[ 0.0%]| 9.43e-01<>9.44e-01[ 0.0%]| 9.44e-01<>9.46e-01[ 10.0%]|****************A 9.46e-01<>9.47e-01[ 10.0%]|**************** 9.47e-01<>9.48e-01[ 0.0%]| 9.48e-01<>9.49e-01[ 10.0%]|**************** 9.49e-01<>9.50e-01[ 20.0%]|********************************* 9.50e-01<>9.52e-01[ 0.0%]| 9.52e-01<>9.53e-01[ 10.0%]|**************** Sorting 3200000 values Analysis of 10 timings avg = 2.145 min = 2.124 max = 2.177 span = 2.5% Time Ranges 2.12e+00<>2.13e+00[ 20.0%]|********************************* 2.13e+00<>2.13e+00[ 10.0%]|**************** 2.13e+00<>2.14e+00[ 0.0%]| 2.14e+00<>2.15e+00[ 30.0%]|**************************************************A 2.15e+00<>2.15e+00[ 10.0%]|**************** 2.15e+00<>2.16e+00[ 10.0%]|**************** 2.16e+00<>2.16e+00[ 10.0%]|**************** 2.16e+00<>2.17e+00[ 0.0%]| 2.17e+00<>2.17e+00[ 0.0%]| 2.17e+00<>2.18e+00[ 0.0%]| 2.18e+00<>2.18e+00[ 10.0%]|**************** Sorting 6400000 values Analysis of 10 timings avg = 4.853 min = 4.833 max = 4.885 span = 1.1% Time Ranges 4.83e+00<>4.84e+00[ 30.0%]|************************************************** 4.84e+00<>4.84e+00[ 10.0%]|**************** 4.84e+00<>4.85e+00[ 10.0%]|**************** 4.85e+00<>4.85e+00[ 0.0%]|A 4.85e+00<>4.86e+00[ 0.0%]| 4.86e+00<>4.86e+00[ 30.0%]|************************************************** 4.86e+00<>4.87e+00[ 0.0%]| 4.87e+00<>4.87e+00[ 10.0%]|**************** 4.87e+00<>4.88e+00[ 0.0%]| 4.88e+00<>4.88e+00[ 0.0%]| 4.88e+00<>4.89e+00[ 10.0%]|**************** Sorting 12800000 values Analysis of 10 timings avg = 10.925 min = 10.819 max = 11.348 span = 4.8% Time Ranges 1.08e+01<>1.09e+01[ 40.0%]|************************************************** 1.09e+01<>1.09e+01[ 40.0%]|**************************************************A 1.09e+01<>1.10e+01[ 10.0%]|************ 1.10e+01<>1.10e+01[ 0.0%]| 1.10e+01<>1.11e+01[ 0.0%]| 1.11e+01<>1.11e+01[ 0.0%]| 1.11e+01<>1.12e+01[ 0.0%]| 1.12e+01<>1.12e+01[ 0.0%]| 1.12e+01<>1.13e+01[ 0.0%]| 1.13e+01<>1.13e+01[ 0.0%]| 1.13e+01<>1.14e+01[ 10.0%]|************ Sorting 25600000 values Analysis of 10 timings avg = 24.578 min = 24.388 max = 25.426 span = 4.2% Time Ranges 2.44e+01<>2.45e+01[ 40.0%]|**************************************** 2.45e+01<>2.46e+01[ 50.0%]|**************************************************A 2.46e+01<>2.47e+01[ 0.0%]| 2.47e+01<>2.48e+01[ 0.0%]| 2.48e+01<>2.49e+01[ 0.0%]| 2.49e+01<>2.50e+01[ 0.0%]| 2.50e+01<>2.51e+01[ 0.0%]| 2.51e+01<>2.52e+01[ 0.0%]| 2.52e+01<>2.53e+01[ 0.0%]| 2.53e+01<>2.54e+01[ 0.0%]| 2.54e+01<>2.55e+01[ 10.0%]|********** Sorting 51200000 values Analysis of 10 timings avg = 56.953 min = 56.651 max = 57.477 span = 1.5% Time Ranges 5.67e+01<>5.67e+01[ 30.0%]|************************************************** 5.67e+01<>5.68e+01[ 20.0%]|********************************* 5.68e+01<>5.69e+01[ 10.0%]|**************** 5.69e+01<>5.70e+01[ 10.0%]|****************A 5.70e+01<>5.71e+01[ 0.0%]| 5.71e+01<>5.71e+01[ 0.0%]| 5.71e+01<>5.72e+01[ 0.0%]| 5.72e+01<>5.73e+01[ 10.0%]|**************** 5.73e+01<>5.74e+01[ 10.0%]|**************** 5.74e+01<>5.75e+01[ 0.0%]| 5.75e+01<>5.76e+01[ 10.0%]|**************** ------------------------------------------------------------------------------ Note when building a hash table of size = N, the code looks up every value in the table, so for size = 2N it does twice as many look ups. So we should really compute the amount of time/look up for each size, which stays relatively constant, because both the time and N double. Sets via Hash Table: Size = 1000 Analysis of 20 timings avg = 0.028 min = 0.028 max = 0.030 span = 7.5% Time Ranges 2.77e-02<>2.79e-02[ 25.0%]|***************************************** 2.79e-02<>2.81e-02[ 15.0%]|************************* 2.81e-02<>2.83e-02[ 30.0%]|**************************************************A 2.83e-02<>2.85e-02[ 15.0%]|************************* 2.85e-02<>2.87e-02[ 5.0%]|******** 2.87e-02<>2.89e-02[ 5.0%]|******** 2.89e-02<>2.91e-02[ 0.0%]| 2.91e-02<>2.93e-02[ 0.0%]| 2.93e-02<>2.95e-02[ 0.0%]| 2.95e-02<>2.98e-02[ 0.0%]| 2.98e-02<>3.00e-02[ 5.0%]|******** Sets via Hash Table: Size = 2000 Analysis of 20 timings avg = 0.056 min = 0.055 max = 0.057 span = 2.8% Time Ranges 5.54e-02<>5.55e-02[ 10.0%]|********************************* 5.55e-02<>5.57e-02[ 15.0%]|************************************************** 5.57e-02<>5.59e-02[ 15.0%]|************************************************** 5.59e-02<>5.60e-02[ 10.0%]|********************************* 5.60e-02<>5.62e-02[ 15.0%]|**************************************************A 5.62e-02<>5.63e-02[ 10.0%]|********************************* 5.63e-02<>5.65e-02[ 10.0%]|********************************* 5.65e-02<>5.66e-02[ 10.0%]|********************************* 5.66e-02<>5.68e-02[ 0.0%]| 5.68e-02<>5.70e-02[ 0.0%]| 5.70e-02<>5.71e-02[ 5.0%]|**************** Sets via Hash Table: Size = 4000 Analysis of 20 timings avg = 0.117 min = 0.111 max = 0.161 span = 42.2% Time Ranges 1.11e-01<>1.16e-01[ 80.0%]|************************************************** 1.16e-01<>1.21e-01[ 10.0%]|******A 1.21e-01<>1.26e-01[ 0.0%]| 1.26e-01<>1.31e-01[ 5.0%]|*** 1.31e-01<>1.36e-01[ 0.0%]| 1.36e-01<>1.41e-01[ 0.0%]| 1.41e-01<>1.46e-01[ 0.0%]| 1.46e-01<>1.51e-01[ 0.0%]| 1.51e-01<>1.56e-01[ 0.0%]| 1.56e-01<>1.61e-01[ 0.0%]| 1.61e-01<>1.66e-01[ 5.0%]|*** Sets via Hash Table: Size =8000 Analysis of 20 timings avg = 0.229 min = 0.223 max = 0.247 span = 10.7% Time Ranges 2.23e-01<>2.25e-01[ 20.0%]|****************** 2.25e-01<>2.28e-01[ 5.0%]|**** 2.28e-01<>2.30e-01[ 55.0%]|**************************************************A 2.30e-01<>2.33e-01[ 10.0%]|********* 2.33e-01<>2.35e-01[ 5.0%]|**** 2.35e-01<>2.37e-01[ 0.0%]| 2.37e-01<>2.40e-01[ 0.0%]| 2.40e-01<>2.42e-01[ 0.0%]| 2.42e-01<>2.45e-01[ 0.0%]| 2.45e-01<>2.47e-01[ 0.0%]| 2.47e-01<>2.50e-01[ 5.0%]|**** Sets via Hash Table: Size = 16000 Analysis of 20 timings avg = 0.452 min = 0.447 max = 0.461 span = 3.1% Time Ranges 4.47e-01<>4.49e-01[ 20.0%]|**************************************** 4.49e-01<>4.50e-01[ 20.0%]|**************************************** 4.50e-01<>4.51e-01[ 25.0%]|************************************************** 4.51e-01<>4.53e-01[ 10.0%]|********************A 4.53e-01<>4.54e-01[ 5.0%]|********** 4.54e-01<>4.56e-01[ 5.0%]|********** 4.56e-01<>4.57e-01[ 0.0%]| 4.57e-01<>4.59e-01[ 5.0%]|********** 4.59e-01<>4.60e-01[ 0.0%]| 4.60e-01<>4.61e-01[ 5.0%]|********** 4.61e-01<>4.63e-01[ 5.0%]|********** Sets via Hash Table: Size = 32000 Analysis of 20 timings avg = 0.908 min = 0.897 max = 0.918 span = 2.3% Time Ranges 8.97e-01<>8.99e-01[ 20.0%]|************************************************** 8.99e-01<>9.01e-01[ 5.0%]|************ 9.01e-01<>9.03e-01[ 5.0%]|************ 9.03e-01<>9.05e-01[ 5.0%]|************ 9.05e-01<>9.08e-01[ 15.0%]|************************************* 9.08e-01<>9.10e-01[ 10.0%]|*************************A 9.10e-01<>9.12e-01[ 0.0%]| 9.12e-01<>9.14e-01[ 15.0%]|************************************* 9.14e-01<>9.16e-01[ 10.0%]|************************* 9.16e-01<>9.18e-01[ 10.0%]|************************* 9.18e-01<>9.20e-01[ 5.0%]|************ Sets via Hash Table: Size = 64000 Analysis of 20 timings avg = 1.811 min = 1.796 max = 1.837 span = 2.2% Time Ranges 1.80e+00<>1.80e+00[ 20.0%]|************************************************** 1.80e+00<>1.80e+00[ 20.0%]|************************************************** 1.80e+00<>1.81e+00[ 15.0%]|************************************* 1.81e+00<>1.81e+00[ 5.0%]|************A 1.81e+00<>1.82e+00[ 10.0%]|************************* 1.82e+00<>1.82e+00[ 10.0%]|************************* 1.82e+00<>1.82e+00[ 0.0%]| 1.82e+00<>1.83e+00[ 10.0%]|************************* 1.83e+00<>1.83e+00[ 5.0%]|************ 1.83e+00<>1.84e+00[ 0.0%]| 1.84e+00<>1.84e+00[ 5.0%]|************ Sets via Hash Table: Size = 128000 Analysis of 20 timings avg = 3.612 min = 3.597 max = 3.637 span = 1.1% Time Ranges 3.60e+00<>3.60e+00[ 10.0%]|**************** 3.60e+00<>3.60e+00[ 0.0%]| 3.60e+00<>3.61e+00[ 30.0%]|************************************************** 3.61e+00<>3.61e+00[ 15.0%]|*************************A 3.61e+00<>3.62e+00[ 20.0%]|********************************* 3.62e+00<>3.62e+00[ 10.0%]|**************** 3.62e+00<>3.62e+00[ 10.0%]|**************** 3.62e+00<>3.63e+00[ 0.0%]| 3.63e+00<>3.63e+00[ 0.0%]| 3.63e+00<>3.64e+00[ 0.0%]| 3.64e+00<>3.64e+00[ 5.0%]|******** Sets via Hash Table: Size = 256000 Analysis of 20 timings avg = 7.211 min = 7.186 max = 7.238 span = 0.7% Time Ranges 7.19e+00<>7.19e+00[ 10.0%]|**************** 7.19e+00<>7.20e+00[ 0.0%]| 7.20e+00<>7.20e+00[ 0.0%]| 7.20e+00<>7.21e+00[ 20.0%]|********************************* 7.21e+00<>7.21e+00[ 30.0%]|**************************************************A 7.21e+00<>7.22e+00[ 20.0%]|********************************* 7.22e+00<>7.22e+00[ 5.0%]|******** 7.22e+00<>7.23e+00[ 5.0%]|******** 7.23e+00<>7.23e+00[ 0.0%]| 7.23e+00<>7.24e+00[ 5.0%]|******** 7.24e+00<>7.24e+00[ 5.0%]|********