ICS 33 Spring 2024
Notes and Examples: Comprehensions


Background

Writing substantially sized programs requires arranging a variety of information into data structures. Fortunately for us, as Python programmers, many of our needs are met by the data structures built into the Python language. Lists, tuples, sets, and dictionaries each solve a differently shaped problem, but, collectively, they solve a wide variety of the problems we encounter. There are certainly problems for which we need less-common data structures — sometimes we even need to build our own fairly esoteric data structure, but that's mainly a topic for other courses — but those that are built in will get us a long way indeed.

Still, especially when we're first learning Python, we find ourselves employing the same long-winded patterns for constructing them. If I want a variable to refer to a list of values with some characteristic, I might write a pattern like this one: initializing an empty list, looping over values, filtering them by some characteristic, and appending them into the list.


>>> values = []
>>> for value in range(100):
...     if value % 10 > value // 10:
...         values.append(value)
...
>>> values
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19, 23, 24, 25, 26, 27, 28, 29,
     34, 35, 36, 37, 38, 39, 45, 46, 47, 48, 49, 56, 57, 58, 59, 67, 68, 69, 78, 79, 89]

If a pattern like that one is so common, though, it's ripe for simplification. Python offers simplified patterns for building common data structures using a feature called comprehensions.


List comprehensions

A list comprehension is an expression that builds a list based on a brief description of its values, rather than writing a loop that populates an empty list by appending elements into it individually. They're perhaps best demonstrated by example, so let's start there.


>>> [x for x in range(10)]
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [x * x for x in range(10)]
    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> [x.upper() for x in 'Boo']
    ['B', 'O', 'O']
>>> [x for x in range(10) if x % 2 == 0]
    [0, 2, 4, 6, 8]

What we see in the above examples is a pattern that we can describe more generally:

List comprehensions aren't a panacea for the creation of lists. Many lists can be described by list comprehensions, and, when they can be, that's often (though not always) a lot clearer to a human reader than the equivalent code that you might write instead. Consider these equivalents of the four examples above, which build the same lists without using list comprehensions.


>>> list(range(10))
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
              # This one had a simpler alternative than a list comprehension.

>>> values = []
>>> for x in range(10):
...     values.append(x * x)
...
>>> values
    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
              # When we perform an operation on each value, list comprehensions begin to shine.

>>> values = []
>>> for x in 'Boo':
...     values.append(x.upper())
...
>>> values
    ['B', 'O', 'O']

>>> values = []
>>> for x in range(10):
...     if x % 2 == 0:
...         values.append(x)
...
>>> values
    [0, 2, 4, 6, 8]
              # When we filter the values, list comprehensions also shine.

So, all in all, a list comprehension has the job of taking an iterable (i.e., a sequence of values), filtering it with the if clause (if any), transforming each element using the initial expression (the one before the for), and building a list of the results. In what ways might this fail?


>>> [x for x in 13]
    Traceback (most recent call last):
      ...
    TypeError: 'int' object is not iterable
        # The "in" keyword must be followed be something iterable.
        # Integers aren't iterable.
>>> [x for x in range(10) if x.upper() == 'Boo']
    Traceback (most recent call last):
      ...
    AttributeError: 'int' object has no attribute 'upper'
        # The filtering expression following the "if" needs to use the objects in
        # the sequence in a way that's compatible with their type.
>>> [x.upper() for x in range(10)]
    Traceback (most recent call last):
      ...
    AttributeError: 'int' object has no attribute 'upper'
        # So does the initial expression.

Writing more complex list comprehensions

While many list comprehensions you might write would follow the rules I described above, there's actually an additional (and very useful) twist. A list comprehension requires the initial expression and a for clause (i.e., for x in something_iterable), but that for clause can actually be followed by as many for and if clauses as you'd like, in any order you'd like, which means that some very terse expressions can lead to some very complex results.


>>> [x * y for x in 'ABC' for y in (1, 2, 3)]
    ['A', 'AA', 'AAA', 'B', 'BB', 'BBB', 'C', 'CC', 'CCC']
        # We're iterating over every value in 'ABC'.  For each of those, we're
        # iterating over every value in (1, 2, 3).  Every combination of these
        # values is passed to the initial expression, and we get a list of the results.
>>> [x for x in range(10) if x % 2 == 0 if x != 6]
    [0, 2, 4, 8]
        # Multiple if clauses can establish two separate filtering conditions,
        # all of which have to be truthy.

The initial expression in a list comprehension could itself be a list comprehension, which opens up some other interesting possibilities.


>>> [[x * x for x in range(4)] for y in range(4)]
    [[0, 1, 4, 9], [0, 1, 4, 9], [0, 1, 4, 9], [0, 1, 4, 9]]
>>> values = [[x * x for x in range(4)] for y in range(4)]
>>> values[0] is values[1]
    False           # The sublists have equivalent values, but they're separate lists.
>>> values[0][0] = 999
>>> values
    [[999, 1, 4, 9], [0, 1, 4, 9], [0, 1, 4, 9], [0, 1, 4, 9]]
                    # So, mutating one sublist leaves the others unchanged.
>>> [[0, 1, 4, 9]] * 4
    [[0, 1, 4, 9], [0, 1, 4, 9], [0, 1, 4, 9], [0, 1, 4, 9]]
                    # This looks like the same result.  But let's look more closely.
>>> other_values = [[0, 1, 4, 9]] * 4
>>> other_values[0] is other_values[1]
    True            # All of the sublists are actually the same list (i.e., other_values
                    # contains four references to the same list).
>>> other_values[0][0] = 999
>>> other_values
    [[999, 1, 4, 9], [999, 1, 4, 9], [999, 1, 4, 9], [999, 1, 4, 9]]
                    # So, mutating one sublist mutates them all.


Set comprehensions

As you've digested list comprehensions, you might already have imagined that lists aren't the only data structure that you might like to build with a comprehension. Suppose you want to build a set of integers formulaically, rather than a list. Why shouldn't we be able to use a comprehension for that? Sets, after all, are a lot like lists; they differ mainly in two ways:

Aside from those issues, building a set is a lot like building a list, so it's natural to imagine that we could use a comprehension to do it, with { and } taking the place of [ and ], and with the two differences between sets and lists taken properly into account.


>>> {x for x in range(10)}
    {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}           # Yep!  Works a treat!
>>> {x for x in range(10) if x % 3 != 0}
    {1, 2, 4, 5, 7, 8}                       # Filtering with "if" works, too.
>>> {x.upper() for x in 'Hello Boo!'}
    {'H', 'L', '!', 'E', 'B', ' ', 'O'}      # Duplicates eliminated, as expected.
>>> {list(range(x)) for x in range(3)}
    Traceback (most recent call last):
      ...
    TypeError: unhashable type: 'list'       # Mutable elements not permitted, as expected.

Otherwise, set comprehensions are enough like list comprehensions that you'll likely be unsurprised if you experiment with them in more detail, as we did with list comprehensions previously.

Interlude: What does it mean to be hashable?

When we attempted to build a set whose elements were lists, we were rightly prevented from doing so. Sets are not permitted to contain mutable elements, but lists are mutable. The error message might have been slightly perplexing, though, because it introduced a new term you may not have seen before. Rather than informing us that lists are mutable, it told us that lists are unhashable. This is a term we'll see again, so it's best for us to pause here and be sure we know what it means.

Sets are required to maintain the uniqueness of their elements. Since sets are mutable, they have to ensure that uniqueness not only at the time of creation, but continually as new elements are added later.


>>> s = {1, 2, 3}
>>> s.add(1)
>>> s
    {1, 2, 3}           # Nothing has changed, because 1 was already in the set.
>>> s.add(4)
>>> s
    {1, 2, 3, 4}        # Since 4 wasn't already in the set, it's been added.

Suppose we have a set containing n elements, and then we call add to add a new element to it. How much time would we expect to be spent adding it? There are two things that need to be done:

Behind the scenes, a set needs to store its elements somehow. One possibility would be to store them in the equivalent of a Python list — arranged one after another. If that was true, though, adding a new element would require searching that list to determine whether the element was already in the list. With no insight about where to look, we'd have no choice but to look everywhere, which would take O(n) time (i.e., a constant amount of time spent asking whether each of the n elements is the one we're looking for). We might get lucky and find it right away, but we might not, and if the element isn't in the set, we'd have to look at every element before we could conclude that it's safe to add it. Alternatively, we could keep the list sorted, which would allow us to use fancier search techniques to find elements more quickly, but would then dramatically raise the cost of adding an element while maintaining the sorted order. Sorted or not, we lose either way.

Because of this, sets are implemented differently, using a data structure called a hash table. We'll defer discussing most of the details of hash tables until ICS 46, but they're built on a simple idea: If every element has a definitive location where it belongs, we'll always know where to find it when we're looking for it, and we'll always know where to store it when we're adding it. The notion of "where it belongs" is determined by a technique called hashing, in which we take the element and "boil it down to its essence" in some way, so that we have an integer that describes it more simply, but still takes into account its entire value. Once we have the hash value (or hash) that describes an object, that tells our set where to look for it or where to store it. And, since it knows precisely where to go, it can do these jobs in O(1) time. But, of course, this trick only works if an object's hash never changes, which goes a long way toward explaining why the objects in a set need to be immutable; if they could be mutated, their hashes might change, and then it would no longer be possible to find them.

An object in Python is hashable if it supports a protocol that provides two operations:

So, technically, that's why a set can't contain elements that are lists. Lists don't provide a __hash__ method, which means they aren't hashable. The underlying reason they shouldn't provide a __hash__ method is because they're mutable, which would presumably mutate their hash, as well.

(The entire story of hashing and hash tables is more complicated than this, but this is enough detail for us to proceed with for now.)


Dictionary comprehensions

If we're able to build lists and sets using comprehensions, we might also expect to be able to build dictionaries with them, as well. Dictionaries have similarities to sets, but they also have an important difference.

Still, if you can start with something iterable and, for each element in it, determine both a key and a value to associate with it, that would be a good recipe for a dictionary comprehension. All we need now is a syntax.


>>> {'A': 1, 'B': 2, 'C': 3}
    {'A': 1, 'B': 2, 'C': 3}
                  # A dictionary literal has keys and values separated by colons,
                  # surrounded in its entirety by curly braces.
>>> {x: len(x) for x in ['Boo', 'is', 'happy', 'today']}
    {'Boo': 3, 'is': 2, 'happy': 5, 'today': 5}
                  # The same notation appears here with the same meaning: curly braces and colons.
                  # Notice that duplicate values are fine, as they should be; 5 appears twice.
>>> pairs = [('A', 1), ('B', 2), ('C', 3)]
>>> {key: value for key, value in pairs}
    {'A': 1, 'B': 2, 'C': 3}
                  # We're sequence-assigning each element of the tuples within pairs
                  # into the two variables key and value here.
>>> {x.upper(): x.lower() for x in 'Hello Boo!'}
    {'H': 'h', 'E': 'e', 'L': 'l', 'O': 'o', ' ': ' ', 'B': 'b', '!': '!'}
                  # Duplicate keys are eliminated.
>>> {list(range(x)): len(range(x)) for x in range(3)}
    Traceback (most recent call last):
      ...
    TypeError: unhashable type: 'list'
                  # Unhashable keys are disallowed.

As with set comprehensions, there are few surprises beyond this to be found in the mechanics of dictionary comprehensions, but it's worth spending a little time experimenting to look for any sharp edges or gaps in your understanding.

It's worth noting, though, that the proper usage of dictionary comprehensions requires the ability to take a sequence and formulaically use each element to determine both a key and its associated value. With that higher bar to be cleared, you may find that you'll use dictionary comprehensions less often than the others.


What about tuple comprehensions?

We saw previously that sets are enough like lists that Python provides set comprehensions that mostly mirror the mechanics of list comprehensions, while differing in the ways that sets and lists are different from each other. Tuples are a lot like lists, too, with the main difference being that tuples are immutable (though their elements may be mutable), so you might also expect there to be tuple comprehensions, with ( and ) taking the place of [ and ].


>>> (x * x for x in range(7))
    <generator object <genexpr> at 0x000002120B11CA50>

Interestingly, this syntax worked — in the sense that an exception wasn't raised and we got back a value — but didn't do what we expected. The reason is simple: Python does not provide a tuple comprehension, so that syntax actually means something different. We'll circle back to this technique, which is a fantastically useful one called a generator comprehension, a little later in the course; all things in time. But, in the meantime, you can pass a generator into the tuple constructor, if you ever want to use a comprehension to build a tuple. So, the closest thing Python provides to a tuple comprehension is this:


>>> tuple(x * x for x in range(7))
    (0, 1, 4, 9, 16, 25, 36)


Asymptotic analysis of comprehensions

We now have a growing number of design choices on our palette, which gives us a richer ability to express the design of a Python program. You may have noticed, though, that some of these choices overlap with others. For example, we can build a list using a list comprehension, the list constructor, or a hand-written loop. Which of these three techniques should we use?

Sometimes the answer to that question will be clear, because one technique will work and others simply won't, or one will be straightforward and the others cumbersome. The list constructor will work nicely if you already have an iterable containing the list's elements, but will otherwise be little help. List comprehensions provide a syntax that tersely describes certain patterns of list creation, but the limitations of that syntax mean that there are plenty of lists for which they're an unnatural fit. The hand-written loop is the most flexible choice, but also the noisiest.

But there's another aspect to these choices that we should consider. Some techniques will run faster than others. Some techniques will require more memory than others. All things being equal, spending less time or using less memory is better than the alternative. Truthfully, we'll often need to balance that goal against others — we might be able to make a program run faster, but only at the cost of writing significantly more code or using esoteric techniques that are harder for subsequent readers of our program to understand — so our goal, in practice, is to make sure our programs are fast enough for our needs and don't use more memory than we have available. In other words, we're not aiming for "perfect," as much as we're aiming for "good enough."

Determining these things requires some means of measuring how much time or memory our programs might consume. For that purpose, asymptotic analysis is a great place to start. It can give us a big-picture understanding of these costs and how they'll increase as problem sizes increase. To start building our skills at applying asymptotic analysis to Python programs, let's consider the asymptotic analysis of comprehensions.

Asymptotic analysis of list comprehensions

What can we say about the performance of the following Python function?


>>> def f(n):
...     return [x * n for x in range(n)]
...

A good first step is to understand what the function does. The parameter n is used to build a range, so it's clearly intended to be a non-negative integer. So, let's try a few non-negative integer arguments.


>>> f(5)
    [0, 5, 10, 15, 20]
>>> f(10)
    [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> f(0)
    []

In each case, it looks like we've gotten back a list of n integers, whose values are multiples of n. When n was 5, we got the first 5 multiples of 5; when n was 10, we got the first 10 multiples of 10; when n was 0, we got an empty list.

Looking at the list comprehension we wrote, that seems completely reasonable:

Now, what can we say about how much time the function would spend building the list? First of all, is it safe to assume that it will always take the same amount of time, or do we expect it to vary in some way depending on the value of n? Given the shape of the results we're seeing, it's hard to believe that the time spent will always be the same; building longer lists should obviously take longer than building shorter ones, because we have more multiplications to do and more elements to add to the list. So, what we really want to know how the time spent varies as n does. Let's consider the sum total of the work done by the comprehension, in no particular order.

So, in total, we've obtained n values from the range at a constant cost each, performed n multiplications at a constant cost each, and added n elements to the end of the list at a constant cost each. If we were to write that as a mathematical function, it might be an + bn + cn, where a, b, and c are the constant costs of obtaining a value from the range, performing a multiplication, and adding an element to the end of the list, respectively. We could reorganize that function slightly using a little bit of algebra:

an + bn + cn = (a + b + c)n

And now we have a function that boils down to "a constant value multiplied by n." All such functions have the same shape in an asymptotic analysis; our "closest fit" O-notation for the running time of f would be O(n). As a practical matter, that means the amount of time it takes for f to run increases linearly as its parameter n increases. If we call f with some n, and then we call f with 5 * n, we expect it to take about five times as long to run. If we call f with 50 * n, we expect it to take about fifty times as long.

What about memory usage? What does f use memory for?

To determine how much memory will be used by f, we'll need to know a little more about how ranges and lists are implemented.

Interlude: How do ranges work?

Ranges are a succinct way of describing sequences of integers. What makes them so succinct is not just the choice of syntax; their succinctness is enabled by the fact that not all sequences of integers can be described by them.

So, ranges are limited in their capabilities, though not so limited that they don't have many realistic uses. But their limitations don't just mean that their syntax is simple; it also means that they're a lot more efficient than they might be if they had a more difficult problem to solve.

Ranges are iterable, which means that they support a protocol that allows us to ask them for a single element at a time, continuing until they tell us that there are no more elements. We'll discuss the details of the iterable protocol a little later in the course, but let's think about how ranges might be able to solve this kind of problem.

A range is entirely described by its three components: start, stop, and step. While we're iterating a range, the only additional thing a range would need to know is last.

So, all in all, a range requires the same amount of memory, no matter how many elements it has. We would state that asymptotically as O(1).

Interlude: How do lists work?

A Python list is a sequence of elements, where each element has an index. The index of the first element is 0, the index of the second element is 1, and the indices increase sequentially from there. (This scheme is often called zero-based indexing.)

Those indices aren't just conceptual; they also provide a powerful but simple way for Python to find its way around the list. Internally, a list contains a reference to the first of a sequence of its elements. The elements are arranged immediately next to one another in memory. Each element is actually a reference to a Python object, which means that each element has the same size in memory — references are the same size, no matter what kind of object they refer to. So, if a list knows where its first element is, it can easily know where the element at any other index is; all it needs to do is some arithmetic.

In other words, if we know the index of what we want in a list, we can obtain it in a constant amount of time, because all the list needs to do is one multiplication and one addition, no matter how big the list is and no matter what the index is.


>>> def element_at(elements, index):
...     return elements[index]
...              # This function runs in O(1) time.

Determining the cost of adding elements to a list is a little bit trickier, because the time needed is situationally dependent. Understanding why requires us to know two more facts about the internals of a Python list:

(That second fact turns out to be a half-truth in practice — there's more to the story than this — but is a good enough assumption for now.)

If those things are true, then how do elements get added to a list? It depends on where they're being added.

These two functions solve the same problem in the reverse order. Even though their results are the same, one is wildly more expensive than the other.


>>> def relatively_cheap(n):
...     values = []
...     for i in range(n):
...         values.append(i)
...     return values
...
>>> def relatively_expensive(n):
...     values = []
...     for i in reversed(range(n)):
...         values.insert(0, i)
...     return values
...

The relatively_cheap function appends elements to the end of the list. Each of those calls to append take O(1) time — it's not more expensive to add the 100th element than it is to add the first — so the total time spent is O(n).

The relatively_expensive function inserts elements at the beginning of the list. Each of those calls to insert require shifting all of the elements already in the list, which means they cost more the longer the list gets. The first call to insert shifts no elements, then inserts; the second call shifts one element, then inserts; the third call shifts two elements, then inserts; and so on. So, the total time spent looks something like this:

1 + 2 + 3 + ... + n

That sum appears frequently in computer science, so it's worth knowing that it simplifies to n(n + 1) / 2, or, when simplified further to an asymptotic notation, O(n2). So, the time spent in relatively_expensive grows in proportion to the square of n, which means as n gets larger, the impact of the difference between these functions becomes significantly greater. Before long, n will be large enough that the time spent will be problematic.

This is why it's so important for us to know a little bit about how the tools built into our languages and libraries work. We don't need to know every detail, but if we don't know the significant ones — such as the relative costs of manipulating lists in different ways — we won't realize when we've stepped on a performance landmine.

Memory usage of list comprehensions

Now that we've seen a little bit about how ranges and lists work internally, we have the knowledge we need to answer our open questions about how much memory is required by our list comprehension. We realized that memory is used for two things, but didn't know how to quantify the amount required for each. Now we know.

O(1) + O(n) = O(n), so the total amount of memory required to run our f function that builds a list comprehension is O(n).