ICS 33 Spring 2024, Notes and Examples: Generators

Background

Previously, we saw that Python's design includes two related concepts that address the problem of obtaining a sequence of values one at a time.

An iterable is an object that can produce a sequence of values, one at a time.
An iterator is an object that manages the process of producing a sequence of values.

Data structures like lists, tuples, sets, and dictionaries certainly fall into the category of objects whose contents we might want to iterate. But iteration is not only useful in conjunction with data structures. Iteration just means "I want a sequence of values, one at a time," which comes up surprisingly often in sofware design. Ranges, as we've seen, feel like they might be data structures, but really aren't. Their contents are entirely formulaic, so, as long as they know the formula — the start, the stop, and the step — they can quickly calculate any values that are needed. Their design makes one of computer science's classic tradeoffs: spending time to recalculate their values, rather than spending memory to remember what they are. (This is a good trade to make when the cost of recalculation is low, which is certainly the case with a range, since calculating an element boils down to one addition and one multiplication.)

When we need to process the lines of text from a file sequentially, one straightforward approach is the one shown below, which breaks the problem into a sequence of simple steps.


# Step 1: Read the file into a string variable
with open(file_path, 'r') as the_file:
    contents = the_file.read()

# Step 2: Split that string into a list of strings, where each line of text in
# the original string is one element of the list.
lines = contents.splitlines()

# Step 3: Iterate through the list of strings and process each one.  It's not
# particularly important what the process function does here, as we're focused
# on a general approach here, not a particular problem, but let's assume that
# we can process each line entirely separately from the others.
for line in lines:
    process(line)

This approach is not entirely without merit. It's simple to understand, because it does the job the way you might describe it to someone: read the file, split it into lines, and process the lines. We do want all three of those things to be done, and we could certainly do them sequentially like this. Once we have all of the text loaded, we have everything we need to be able split it into separate lines. Once we have all the lines split up, we have everything we need to be able to process them.

But this approach also has a cost that we need to consider here. What happens when the file contains 10,000,000 lines of text, each containing 1,000 characters of text? Here's what our approach will do.

Read a total of 10,000,000 * 1,000 = 10,000,000,000 characters into one string called contents. (This means we need around 10 GB of memory to store that string.)
Split that 10,000,000,000-character string into 10,000,000 strings with 1,000 characters each, then store those strings in a list. (This means we need another 10 GB or so of memory to store all of the smaller strings, because we can't throw away the original string until we're finished splitting it up.)
Iterate through the list of strings, processing each one separately.

It's important to point out that some of these costs are essential.

No matter what, we'll need to read the entire file from beginning to end.
No matter what, we'll need to find the newline characters and split each line from the one that follows it.
No matter what, we'll need to process each of the lines.

But some of these costs are not essential at all. If our goal is to process lines separately, there's never a point in time at which we need to be storing all of them. All we need is one at a time. (It might be more convenient to store them all, but it's not a necessity.)

Or, to put this into the context of asymptotic notation, we're spending O(n) memory (where there are n lines of text in the file) when we could instead be spending O(1) memory. As a practical matter, if n is small, we probably aren't that concerned about it. But if n can grow large, then reducing an O(n) memory requirement to O(1) makes it possible to process large files, where we might otherwise be unable to do so, even if we have plenty of time to wait for the result.

So, how might we fix this problem? Let's consider what you've already seen in previous coursework about reading from text files.

File objects offer a readlines() method

Python's file objects include a readlines() method, which returns all of the (unread) text in the file, organized as a list of strings, where each string in the list is one line of text from the file. With that, we could combine Steps 1 and 2 from our original solution into a single step.


# Step 1. Read the lines of text from the file and store them in a list
# of strings.
with open(file_path, 'r') as the_file:
    lines = the_file.readlines()

# Step 2. Iterate through the list of strings and process each one.
for line in lines:
    process(line)

How much memory did we save in making this change? Instead of storing all of the text in one string, and then storing it again in a list of strings, we'll only be storing one copy of the text. This cut our memory requirement roughly in half — we'll need around 10 GB for our large file instead of 20 GB — but, notably, this is still O(n) memory (i.e., we would still need proportionally more memory for a larger file than we would for a smaller one). In practice, these kinds of changes can sometimes be good enough (e.g., if you know that the files will never be larger than a certain size), but they don't change the general scale of what we can solve. If larger files require proportionally more memory, then ever-larger files will eventually be a problem.

So, this was an improvement, but not a panacea. If we can't find a way to process the first line before reading the second, process the second line before reading the third, and so on, we'll always have this problem. To address that, we'll need to fuse our steps together.

Fusing the steps together

Fortunately, Python's file objects give us a way to do that, because they are themselves iterable. When we iterate them, they produce one line of text at a time, with the iteration ending when there are no more lines of text in the file. We've seen that iterable objects can be used to drive for loops, so, if file objects are iterable, we would expect to be able to drive a for loop with a file object.


with open(file_path, 'r') as the_file:
    for line in the_file:
        process(line)

Reading the file and processing its lines are no longer separate steps at all. They're fused together into what you might call a pipeline. Each time we read a line from the file, we pass that line (and only that line) to our process function and wait until that's finished before we read the next line.

What is our memory requirement now? It depends on what happens behind the scenes when we iterate the lines of a text file. If the file object loads the entire file, splits it into lines, stores them in a list of strings, then iterates that list, we'll be no better off than we were. Instead of us doing something that doesn't scale, it would be Python doing it on our behalf.

Fortunately, that's not the truth of the matter here: Iterating a file means reading enough text from it to produce the first line, getting it back, and only continuing to read from the file when we ask for the second line. You've probably seen before that file objects keep track of a position in the file (i.e., after you read some text from them, your next read will give you the text that comes directly after what you got the first time). They're iterators, which means their __iter__ and __next__ methods provide the mechanism to make all of this work, a lot like our MyRange implementation, but using the file to store all of the information, so that it won't have to be stored again by our Python program.

That means our new solution will read and store only a single line at a time, process that line, and then move on to the next. Our memory requirement is now O(1) rather than O(n). We still have to process the entire file, so our time requirement is O(n) — we'll still need to spend proportionally more time processing larger files compared to smaller ones — but this, like our MyRange iterator, is as good as it gets for iterating and processing n elements: linear time and constant memory.

The downside of fusing the steps together

So, it's clear what we've gained, but what have we lost? In what ways were we better off when we had everything broken into separate steps? This might seem like an unvarnished win, because we have less code and require significantly less memory to run it, but there is a design-related price we've paid: When we fuse things together, we can no longer use them separately, which means we don't have components that be recombined in new ways as our needs change.

That may not cost us anything today, but, over the long haul, those kinds of costs add up. If I want to process files in ten different ways, I'll need to write this same pattern all ten times: open the file (and close it automatically), iterate its lines, and call a function to process each line. If that process plays out ten different times with ten different patterns, some of which are more complicated than the one we just wrote, we end up with a large and complex program where we might otherwise have written a much simpler one, making it more difficult for new team members to join a project and be productive, or for existing team members to be able to keep track of all of the details in their heads as time goes on.

In any programming language, you'll find recurring patterns that you'll need to follow, but one of the things you discover as you learn more about a programming language is that many of these patterns can be replaced by automated techniques. (The with statement has that characteristic, giving us automated wrap-up that we would otherwise have to write by hand each time.) As I'm learning a programming language I don't already know, I'm always on the lookout for ways to remove a boilerplate pattern and replace it with something simpler, or to implement a pattern once and use it repeatedly to avoid having to write it again. I'm always looking for ways to take fused-together pieces and separate them, so I can recombine them in new ways later. For small programs, this is often of little benefit, but for large programs, it's critical.

Given all of that, what we're looking for here are the seams we can use to separate the components of our solution. How could we keep our steps separate while gaining the same benefits we've achieved by fusing them together? That problem is partly solved by the iterables and iterators that we saw previously — which allow us to obtain a sequence of results one at a time — but are solved somewhat more generally by a different Python technique that we've not yet seen.

Generators in Python

A generator in Python is a function that returns a sequence of results, rather than returning a single result. Interestingly, generators don't return all of their results at once; they instead return one result at a time. Even more interestingly, generators don't calculate all of their results at once, either, which means they don't run to completion when you call them, but instead run in fits and starts, doing just enough work to calculate their next result each time they're asked for one.

A good starting point is to see how we obtain generators, so let's start there.


>>> def int_sequence(start, end):
...     current = start
...     while current < end:
...         yield current
...         current += 1
...
>>> int_sequence(3, 8)
    <generator object int_sequence at 0x000002E0D8B04900>

At a glance, our int_sequence function doesn't look particularly special. Its overall shape is a function that counts from the start value to the end. Our call int_sequence(3, 8) looks like it should result in five loop iterations: one each when current is 3, 4, 5, 6, and 7, exiting the loop when current is 8. There's no return statement, so it doesn't look like any values are being returned from it, though there's also a mysterious yield statement, which we haven't seen previously.

When we called int_sequence, the result was an object called a generator, which is an additional clue that there's something different about this function. Functions that return generators are called generator functions, and what makes a function a generator function is the presence of at least one yield statement. Our int_sequence function contains a yield statement, so it's a generator function.

So, what's a generator? It's an object whose job is to generate a sequence of values, one at a time. In that sense, it's a lot like an iterator. How do we ask a generator for each of its values? Python already provides a straightforward mechanism for this, the iterator protocol, so we might imagine that generators are also iterators.


>>> i = int_sequence(4, 7)
>>> '__next__' in dir(i)
    True            # It looks like generators have a __next__ method.
>>> next(i)
    4               # And it looks like they can be iterated.
>>> next(i)
    5
>>> next(i)
    6
>>> next(i)
    Traceback (most recent call last):
      ...
    StopIteration
                    # Iteration ends with a StopIteration exception, like an iterator.
>>> list(int_sequence(4, 7))
    [4, 5, 6]       # We can pass generators to the list constructor.
>>> for i in int_sequence(4, 7):
...     print(i, sep = ' ')
...
    4 5 6           # We can use generators to drive for loops.

At this point, we're pretty satisfied that our theory is the right one: generators are iterators. Then how are they different? The key difference is the use of that yield statement, which triggers some behind-the-scenes behavior that's a little bit mind-bending, yet very powerful. When a function contains at least one yield statement (i.e., when it's a generator function), then it's executed differently from a typical Python function.

When a generator function is first called, a generator object is returned, but none of the function's body is executed.
When the generator is first asked for a value, the generator function's body begins executing, continuing until the first time it reaches a yield statement. At that point, the yielded value is returned, then the generator function's body stops its execution, but remembers where it left off — including the values stored in its local variables.
The next time the generator is asked for a value, the generator function's body picks up where it left off, continuing until the next time it reaches a yield statement, at which point it's returned and execution is paused again.
This process continues until the generator function is exited (e.g., we fall off the end of it, we encounter a return statement, or an exception is raised). At that point, the generator will produce no more values.

Let's explore the details of that behavior, to be sure we understand it. Some targeted experimentation in the Python shell will get us where need to go. We'll start by printing some output, so we can visualize the execution a little more clearly.


>>> def chatty_int_sequence(start, end):
...     print('Entered chatty_int_sequence')
...     current = start
...     print(f'current = {start}')
...     while current < end:
...         print(f'Yielding {current}')
...         yield current
...         print(f'Yielded {current}')
...         current += 1
...         print(f'Incremented current to {current}')
...     print('Exited loop')
...
>>> chatty_int_sequence(3, 6)
    <generator object chatty_int_sequence at 0x000002CEA10F4900>
                    # We got back a generator, but nothing was printed.
>>> i = chatty_int_sequence(3, 6)
>>> next(i)
    Entered chatty_int_sequence
    current = 3
    Yielding 3
    3               # We got our first value back, but also saw some output.
                    # Notice how the output stopped at the yield statement.
>>> next(i)
    Yielded 3       # Execution picked up where we left off, as we expected.
    Incremented current to 4
    Yielding 4
    4
>>> next(i)
    Yielded 4       # Execution picked up where we left off, as we expected.
    Incremented current to 5
    Yielding 5
    5
>>> next(i)
    Yielded 5
    Incremented current to 6
    Exited loop     # When we reached the end value, the loop was exited.
    Traceback (most recent call last):
      ...
    StopIteration
                    # Exiting the generator function caused StopIteration to be raised.

What if we write a return statement in a generator function?


>>> def return_in_generator():
...     yield 1
...     return
...     yield 3
...
>>> i = return_in_generator()
>>> next(i)
    1
>>> next(i)
    Traceback (most recent call last):
      ...
    StopIteration
                    # A return statement stops the iteration.

What if the return statement returns a value? Where does it go?


>>> def return_value_in_generator():
...     yield 1
...     return 13
...     yield 2
...
>>> i = return_value_in_generator()
>>> next(i)
    1
>>> next(i)
    Traceback (most recent call last):
      ...
    StopIteration: 13
                 # ^^^ There's our return value.  If we had caught this exception,
                 #     that value would be stored in an attribute named 'value'.

Finally, what if a generator function raises its own exception?


>>> def raising_generator():
...     yield 1
...     raise ValueError('Doh!')
...     yield 2
...
>>> i = raising_generator()
>>> next(i)
    1
>>> next(i)
    Traceback (most recent call last):
      ...
    ValueError: Doh!
      # ^^^ There's the exception we raised.

Why not use iterators?

If generators are also iterators, then we might wonder why we couldn't have simply implemented our int_sequence generator function as an iterator class, similar to how we implemented an iterator for MyRange. If we already knew a technique to solve this problem, then why did we need a new technique at all? The best way to understand why is to consider what that int_sequence generator function would look like if implemented as an iterator class instead. How do the two techniques compare?

int_sequence.py


class IntSequence:
    def __init__(self, start, end):
        self._start = start
        self._end = end
        self._current = start


    def __iter__(self):
        return self


    def __next__(self):
        if self._current >= self._end:
            raise StopIteration
        else:
            result = self._current
            self._current += 1
            return result

Comparing that to our int_sequence generator function, we see that there is some additional complexity that we had to manage ourselves.

We had to decide which part of the work to do up-front (i.e, before the sequence was asked for its first element), and which to do in the __next__ method.
We had to manually raise a StopIteration exception when we reached the end of the sequence.
At a glance, the int_sequence generator function looks like a function that generates a sequence of increasing integers, because it contains a loop. The IntSequence iterator class requires more scrutiny before its purpose is clear to a reader, because it will be necessary to reconcile the behavior of the __init__ method with the two different behaviors of the __next__ method. In short, IntSequence is noisier than int_sequence is, and noisier code is harder to read.

This is not to say that we'd never want to write an iterator, nor that generators are always easier to write than iterators. But many kinds of sequences can naturally be expressed as generator functions, so this is a great technique to have in our repertoire.

Applying generators to our original problem

We began this section by considering a few ways to write a function that processes the lines of a text file, and realized that we could make an asymptotic improvement in our memory usage by fusing together the steps of our solution, so that we could interleave them instead of performing them sequentially. If we obtained each line of text from the file and processed it immediately, rather than obtaining all of the lines from the file and only then processing them, we would use a constant amount of memory instead of a linear amount, allowing us to process very large files without running out of memory.

However, we noted that this choice had a design cost, in which fusing the steps together meant that we could no longer implement them separately, meaning that we would no longer have the flexibility to recombine them in new ways. Generator functions offer us a solution to this problem, because they excel at doing exactly what we need here, allowing a function that returns a sequence of values to be written in a natural way, yet in a way that allows it to do its work in a piecemeal fashion instead of all at once.

How could we apply that technique to the original problem? If we wanted to disconnect the reading of the file from the processing of its lines, but we wanted it to use a constant amount of memory, what would that look like? Now that we know about generator functions, we know a technique for writing a function that returns the lines of a file one at a time.


def read_lines_from_file(file_path):
    with open(file_path, 'r') as the_file:
        for line in the_file:
            yield line

One thing worth noting about that function is the presence of a with statement. If generator functions don't run entirely to completion when they're called, by what mechanism will the file be closed?

Once we've read the entire file, the for loop will be exited.
This, in turn, will cause us to exit from the with statement, which will automatically close the file.
After that, we'll exit the generator function, which will cause a StopIteration to be raised, which will stop the iteration of the generator.

In other words, this generator function has two jobs: iterate the lines of a file and ensure it's closed when that iteration has been completed.

Once we have that generator function, we would then be able to write a separate function that called our generator function, then put the pieces together to solve our overall problem.


def process(line):
    print(len(line))


def process_lines(file_path):
    for line in read_lines_from_file(file_path):
        process(line)


if __name__ == '__main__':
    process_lines('C:\\Examples\\33\\example.txt')

But we could take this decoupling a step further. The caller of process_lines needs to know that it wants to read its lines from a particular file, but why does process_lines need to know that? Why couldn't it simply accept an iterator, iterate those lines — wherever they came from — and process them? The caller of process_lines would now have to be the one to call read_lines_from_file, but it already needed to know that. Not much has been lost, but much may have been gained.


def process_lines(lines):
    for line in lines:
        process(line)


if __name__ == '__main__':
    process_lines(read_lines_from_file('C:\\Examples\\33\\example.txt'))

Now we have new levels of flexibility, since the pieces have been pulled entirely apart from one another.

read_lines_from_file can feed a file's lines anywhere they're needed.
process_lines can accept lines of input from anywhere, instead of only from a file.

We could now write a read_lines_from_shell function alongside our others, which is a generator function with the job of reading lines of text from the Python shell until a blank line is entered.


def read_lines_from_shell():
    while True:
        next_line = input('Next line: ')

        if next_line == '':
            break

        yield next_line

If we wanted to do the same processing on lines of text, but read them from the Python shell instead of a file, then our top-level function call is the only other thing we'd need to change.


if __name__ == '__main__':
    process_lines(read_lines_from_shell())

And, of course, we could instead pass in a list of strings, a call to a function that downloaded text from the Internet via a URL, or whatever else; process_lines is now able to handle any of those scenarios equally well.