Iterators via Classes In this lecture we will first learn how to fix a problem with the prange class that we wrote in the last lecture. Next then look at one type that stores and manipulates data, and also allows iteration over its data. Finally, we will begin to explore classes that operate on iterables and produce other iterables (sorted and reversed are two examples built into Python, but there are many other interesting and useful ones). We will finish the week with a lecture about a special kind of function called a generator, which provide an excellent mechanism for writing iterators (and iterators that process iterators). ------------------------------------------------------------------------------ Fixing a sharing problem with a nested class definition In the last lecture we discussed various classes that implemented the iterator protocols (by implementing the methods __iter__ and __next__). Typically in these cases (for both the Countdown and prange classes) the whole purpose of the class was to create an object to iterate over once. We processed objects from these classes only (or primarily) by iterating over them. Contrast these classes to the list/tuple/set/dict class, which we often iterate over but peform many other operations too: e.g., examing and updating data in these objects. Often we would construct an object from Countdown and prange only in a for loop: e.g., for i in prange(...) : .... not even binding a name to such objects so the objects couldn't be reused. But at the end of the last lecture we discussed sharing, and we will start our discussing by looking at another exampleof sharing, and how to fix a defect in our first implementation of the prange class. It will involve defining and constructing a class nested in the prange class, whose sole purpose is to provide a __next__ method. We will see that when we write iterators for more classes, controlling more complicated data, this same technique works nicely. To illustate the defect, we first define the following function, which uses a doubly-iterating comprehension. Here i1 and i2 are two objects that are iterable. They are simultaneously iterating over in all_pairs. def all_pair(i1,i2): return [(x,y) for x in i1 for i2 in b] this code is equivalent to answer = [] for x in i1: for y in i2: answer.append((x,y)) return answer Now let's run this function in various interesting ways. a = range (3) b = prange(3) print(all_pair(a,b)) # use a and b print(all_pair(b,a)) # use a and b print(all_pair(a,a)) # use only a print(all_pair(b,b)) # use only b These four print statements produce the following results. Notice the first three result are the same (producing tuples containing all pairs of the values in the range), but differ from the fourth (producing only the first of the tuples). [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)] [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)] [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)] [(0, 0), (0, 1), (0, 2)] The problem is that when __iter__ is called on a prange it returns the object on which iter was called. The last use of b (using it twice in a call to all_pair) above produces the wrong results. Python's standard range function doesn't suffer from problem, and we will fix prange to operate likewise. As a reminder, here is how prange defined __iter__ and __next__. def __iter__(self): self.n = self.start return self # must return an object on which __next__ can be called def __next__(self): if self.step > 0 and self.n >= self.stop or \ self.step < 0 and self.n <= self.stop: raise StopIteration save = self.n self.n += self.step return save __iter__ set a new instance variable (at least new since __init__ which did not set it) and returns the self; __next__ (written in prange) uses this instance variable and the stop/step instance variables to get the job done. The problem is that if we try to doubly iterate over the same prange object it will be in the middle of one iteration when a second iteration starts, clobbering the state maintained by the first (because both are using the same prange object). When the second iteration finishes, Python thinks the first is finished too, which is why after generating (0,0), (0,1), and (0,2) and finishing the second iteration, Python thought the first iteration was finished too. We will now fix this problem. Here is how we do it. We define the class prange_iter inside the __iter__ method. Every time we call __iter__ it creates a new object with its own state, so multiple iterations on the same prange object don't interact badly with each other. Th prange_iter class has just two methods: __init__ initializes it with the same information from the prange object, and __next__ which has the same code as the method above (but now applying to the prange_iter object). The final line of code returns prange_iter(self.start,self.stop,self.step): an object from a class that implements __next__ (as is required). The __next__ defined is the same as before, but now in prange_iter instead of prange. def __iter__(self): class prange_iter: def __init__(self,start,stop,step): self.n = start self.stop = stop self.step = step def __next__(self): if self.step > 0 and self.n >= self.stop or \ self.step < 0 and self.n <= self.stop: raise StopIteration save = self.n self.n += self.step return save return prange_iter(self.start,self.stop,self.step) All of the examples in this lecture will write code like this, with a class defining __init__ and __next_ defined in the __iter__ method. The total amount of code isn't much bigger, but it is certainly more complicated to define and use the class inside the __iter__ method (and think about it). But now every call to __iter__ has a __next__ that uses a different object, so two uses of the same prange object (as in all_pair(b,b) above) will now work correctly. Finally, we could also write the following code, which declares prange_iter not inside __init__ but in the prange class itself. The code for class prange_iter is identical (except outdented) but the call to __iter__ is now just one line calling the constructor for this class. class prange_iter: def __init__(self,start,stop,step): self.n = start self.stop = stop self.step = step def __next__(self): if self.step > 0 and self.n >= self.stop or \ self.step < 0 and self.n <= self.stop: raise StopIteration save = self.n self.n += self.step return save def __iter__(self): return prange.prange_iter(self.start,self.stop,self.step) ------------------------------------------------------------------------------ Classes that store interesting data and have iterators over the data Examine the defintion of the following class that stores and processes histograms. For simplicity we will assume it processes percentages (ints from 0 to 100) and places them in 10 bins: 0-9, 10-19, 20-29, ... 80-89, 90-100; note that the last bin really reprsents 11 values, while all the others represent 10 values. Of course we will focus on the how to accomplish iteration for objects of this class (iterating over the counts in their bins) but there are other interesting aspects about this class we will discuss first (and we could always generalize or add methods to make this class even more powerful). class Percent_Histogram: # Called when 0<=p<=100: 100//10 is 10 but belongs in index 9 def _tally(self,p): self.histogram[p//10 if p<100 else 9] += 1 def __init__(self,percents): self.histogram = 10*[0] # [0,0,0,...,0,0] length 10 for p in percents: self.tally(p) def clear(self): for i in range(10): # vs self.histogram = 10*[0] self.histogram[i] = 0 # tally allows any number of arguments, collected in a tuple def tally(self,*args): for p in args: if 0 <= p <= 100: self._tally(p) else: raise IndexError('Percent_Histogram.tally: '+str(p)+' outside [0.100]') # allow indexing for bins [0-9] # but can mutate these values only through __init__, clear, and tally # no __setitem__ defined def __getitem__(self,bin_num): if 0 <= bin_num <= 9: return self.histogram[bin_num] else: raise IndexError('Percent_Histogram.__getitem__: '+str(bin_num)+' outside [0,9]') # standard __iter__: defines a class with __init__/__next__ and returns # an object from that class def __iter__(self): class PH_iter: def __init__(self,histogram): self.histogram = histogram # list(histogram) for copy self.next = 0 def __next__(self): if self.next == 10: raise StopIteration answer = self.histogram[self.next] self.next += 1 return answer return PH_iter(self.histogram) # To reconstruct a call the __init__ that reproduces the correct counts in # the histogram, we supply the right number of values at the start of the # bin: e.g., if bin 5 has 3 items, the repr has three 50s in it def __repr__(self): param = [] for i in range(10): param += self[i]*[i*10] return 'Percent_Histogram('+str(param)+')' # a 2-dimensional display; do you know how to use .format? def __str__(self): return '\n'.join(['[{l: >2}-{h: >3}] | {s}'.format(l=10*i,h=10*i+9 if i != 9 else 100,s=self[i]*'*') for i in range(10)]) Notes: 0) The _tally function is not supposed to be called by only methods defined in this class. It puts a number in the range [0,100] into the correct bin, treating 100 specially (it belongs in bin 9, but p//10 would put it in bin 10, which doesn't exists. 1) The __init__ method uses the idiom 10*[0] which you should know. If not, experiment with it. 2) The clear method sets each bin in the list to 0; we could have allocated a new list as shown in the comment, but generally that takes more time and occupies more space. 3) By using *args, the tally method can have any number (0 or more) of positional arguments. All arguments are collected into tuple that is iterated over to process the value individually. 4) The __getitem__ method allows us to index all the bins, 0-9 inclusive. Note that we can set values into these bins (i.e., mutate the list), only via __init__ and tally. So we call this information read-only: we can read it but not write it. 5) We use the now standard way to implement __iter__, but defining a class that defines __next__ and returning an object from that class. We will discuss how changing self.histogram = histogram vs. self.histogram = list(histogram) changes the iterator. 6) The __repr__ method doesn't know what numbers went into the bins, but we can use the lowest number in each bin, repeated by the count in the bin, to specify a list needed to construct an equivalent object (with the equivalent number of values in each bin) with the construtor. 7) The __str__ method returns a two-dimensional plot of the histogram. Do you all know how to use the format method for strings? If not you should look it up (it is described online using something like EBNF) and practice using it. You should certainly be able to tell me what the string that .format is called on produces the result you'll see elwo When Python executes the following script: quiz1 = Percent_Histogram([50, 55, 70, 75, 85, 100]) quiz1.tally(20,30,95) print(quiz1.__repr__()) print(quiz1) for count in quiz1: print(count,end=' ') It prints the following information: Percent_Histogram([20, 30, 50, 50, 70, 70, 80, 90, 90]) [ 0- 9] | [10- 19] | [20- 29] | * [30- 39] | * [40- 49] | [50- 59] | ** [60- 69] | [70- 79] | ** [80- 89] | * [90-100] | ** 0 0 1 1 0 2 0 2 1 2 Normally we would use this class in a program that reads a file of scores. Now, what would happen if we executed the following code? for count in quiz1: print(count,end=' ') quiz1.tally(100) It would print: 0 0 1 1 0 2 0 2 1 11 Note that mutating the quiz1 object during iteration would result in the new, accumulated values for the results produce by the iterator (in the last bin). That is because the PH_iter object refers to the same list that the tally method increments. So that sharing results in the iterator always returning the most up-to-date value in the lit. What if we wanted to have the iterator produce the values in the histogram when the iteration started, and not show any updates after that. The change is trivial: in __iter__ we change self.histogram = histogram to self.histogram = list(histogram) Now instead of this iterator object sharing the list being using for the histogram, it has its own copy: a new/different list, but storing all the same values. So, changes to the original will not change the result of the iteration. The cost: extra space used for the list (not much, because the list contains just 10 values) and some extra time to construct the list. So, we need to decide (and document) the semantics for our iterators. Can you tell (and if so, with what code) what decision was made for the list iterator, and discuss why you think the designers made that decision? ------------------------------------------------------------------------------ Decorators: Classes that are initialized by/implement iterable Since iterators are so important, it is useful to have a bag of classes (this lecture) and functions (next lecture) that operate on iterablesto produce more iterable. When a class takes an object that has methods implementing a certain protocol and returns an object for that implements the same protocol, the class is called a decorator. We will write a bunch of classes that decorate iterables below (and even more in the next lecture). These are all pretty simple to think about, and while the code is complicated, it is complicated in the same way each time. Here is a first example of a decorator for iterators and a refinement of it. The Repeat class takes an iterable as an argument and implements an iterator __method__ that repeats that iterator over and over: whenever it runs out of values to produce, the entire sequence of values is produced again. We can test this class with any iterable, and strings are the simplest, so we will use a string. If we run the script for i in Repeat("abcde"): print(i,end='')) Python would print: abcdeabcdeabcde ... and keep going forever Here is that class class Repeat: def __init__(self,iterable): self.iterable = iterable def __iter__(self): class Repeat_iter: def __init__(self,iterable): self.iterable = iterable # remember for reuse in next self.iterator = iter(iterable) # remember for direct use in next def __next__(self): try: return next(self.iterator) except StopIteration: self.iterator = iter(self.iterable) # reuse iterable return next(self) return Repeat_iter(self.iterable) This uses the same define-a-class-in-the-__iter__-mtehod used in both classes above. We can generalize this class to Repeat an iterator either at most some fixed number of times or forever, using the following class. If the second argument to __init__ is an integer, it repeats the iterable at most that many times; if there is no second argumment, the iterator repeats forever (as above). class Repeat: def __init__(self,iterable,max_times=None): self.iterable = iterable self.max_times = max_times def __iter__(self): class Repeat_iter: def __init__(self,iterable,max_times): self.iterable = iterable self.max_times = max_times self.iterator = iter(iterable) def __next__(self): if self.max_times != None and self.max_times <= 0: raise StopIteration else: try: return next(self.iterator) except StopIteration: if self.max_times != None: self.max_times -= 1 self.iterator = iter(self.iterable) return next(self) return Repeat_iter(self.iterable,self.max_times) If we run the script for i in Repeat("abcde",3): print(i,end='')) Python would print: abcdeabcdeabcde Here is a third decorator for iterators. It returns all the values in an iterable, but never the same value twice. We call this class Unique. It works by keeping a set in each Unique_iter object that remembers and bypasses any value already returned from that iterator object. class Unique: def __init__(self,iterable): self.iterable = iterable def __iter__(self): class Unique_iter: def __init__(self,iterable): self.iterated = set() self.iterator = iter(iterable) def __next__(self): answer = next(self.iterator) while answer in self.iterated: answer = next(self.iterator) self.iterated.add(answer) return answer return Unique_iter(self.iterable) If we run the script for i in Unique('abcxyabdxzbcxyabdxz'): print(i,end='') Python prints: abcxydz We can also generalize this class by specifying the maximum number of times a value can be returned (with a default argument of 1, which brings us back to Unique, since it allows values to be returned only once). from collections import defaultdict class Unique: def __init__(self,iterable,max_times=1): self.iterable = iterable self.max_times = max_times def __iter__(self): class Unique_iter: def __init__(self,iterable,max_times): self.times = defaultdict(int) self.iterator = iter(iterable) self.max_times = max_times def __next__(self): answer = next(self.iterator) while self.times[answer] >= self.max_times: answer = next(self.iterator) self.times[answer] += 1 return answer return Unique_iter(self.iterable,self.max_times) If we run the script: for i in Unique('abcxyabdxzbcxyabdxz',2): print(i,end='') Python prints: abcxyabdxzcydz As another example, we will write the Filter class, which is supplied with a predicate function of one argument that returns a bool, indicating whether a value should be returned or filtered out, causing next to be called again, until it finds a value to return that is ok by the predicate. class Filter: def __init__(self,iterable,predicate): self.iterable = iterable self.predicate = predicate def __iter__(self): class Unique_iter: def __init__(self,iterable,predicate): self.iterator = iter(iterable) self.predicate = predicate def __next__(self): answer = next(self.iterator) while self.predicate(answer) == False: answer = next(self.iterator) return answer return Unique_iter(self.iterable,self.predicate) If we run the script: for i in Filter('abcdefghijklmnopqrstuvwxyz',lambda x : x not in 'richardpattis'): print(i,end='') Python prints all the letters not in my name: befgjklmnoquvwxyz Notice that the Repeat, Unique, and Filter classes all implement their iterators similarly, with the same pattern of code. In the next lecture we will rewrite these decorators -and more- much more simply as generator functions, which capture the pattern above easily. Here is a last decorator for iterators. It returns all the valuse in an iterable but in sorted order. We colect all the values from the iterator into a list and then sort the list and return its iterator (since the values are all in the correct order). We cannot return the smallest value until we have seen all the values. class psorted: # pseudo-sorted: works just like sorted def __init__(self,iterable,key=None,reverse=False): self.result = list(iterable) self.result.sort(key=key,reverse=reverse) def __iter__(self): return iter(self.result) Notice how we can combine these classes below. Suppose I want to print out all the letters in my name in alphabetical order, with no repetition of letters. I can do it with the following script. for i in Unique(psorted('richardpattis')): print(i,end='') It prints: acdhiprst What would the following script print? It reverse the order of the classes. for i in psorted(Unique('richardpattis')): print(i,end='')