ICS 46 Spring 2022, Notes and Examples: Skip Lists

Recalling binary search

If you have an array of elements that you know are sorted, and you want to check whether a particular value is in that array, there is a well-known algortihm called binary search that can do the job surprisingly quickly. Consider this array of integers sorted in ascending order:

0	1	2	3	4	5	6	7	8
11	13	20	34	37	44	59	67	71

Suppose we were searching for the integer 34. Binary search would have us solve the searching problem by first looking at the middle element of the array. If it's the one we want, we're done. If not, we instead see how it compares to what we're searching for. The middle element is 37, because there are four elements before it and four elements after it. While it isn't the element we're looking for, we do know two important things that will help us dramatically as we continue our search:

34 is smaller than 37
The array is sorted, which means that 34 could only be in a cell with a smaller index than the one we're looking at; the others can be eliminated from consideration entirely

The algorithm proceeds similarly, looking at the middle of the remaining elements and eliminating (slightly more than) half the elements from subsequent consideration. After roughly log₂n steps in an n-element array, we'll have narrowed our search down to a single element, and no others will need to be looked at. So we would say that a binary search would take O(log n) time to complete.

When first confronted with a logarithmic search algorithm like this, it sounds like the silver bullet you may have been looking for. But, of course, we can't always use binary search, for the simple reason that the cost of sorting the array — if it isn't already sorted — would be quite high (Ω(n) time, since you might have to move every element to its correct position, and quite likely worse than this). And if we're making many modifications to the array, the cost of maintaining that sort as we go along would also be prohibitive, because inserting or removing elements early in an array requires shifting a large number of them one way or the other. However, if there are many more searches than there are modifications (e.g., you're loading the data from disk once and searching it repeatedly in memory), this can be a simple and effective technique; like many algorithms, the important thing is to understand the costs and the benefits and decide when one clearly outweighs the other.

Why binary search is problematic on a linked list

While it is possible to run this same algorithm on a linked list, it no longer confers the same benefits. Consider if you had a singly-linked list with n nodes, each containing an integer, and suppose that the integers are sorted in ascending order as you iterate through the list. On the face of it, that sounds like a case where binary search might be very useful indeed.

But the very first step in a binary search — look at the middle element and compare it to what you're looking for — turns out to be a doozy. Regardless of which linked list variation you have, there will be no pointer directly to the middle node, so you'll be reduced to iterating through n/2 nodes, one by one, to obtain that pointer. In a list with n nodes, this will take Θ(n) time. And that's only the first step!

So, generally, binary search on a linked list is pointless; the key to the logarithmic performance of binary search is that you can divide the search space in half in constant time, by looking only at the middle element, but a linked list doesn't offer you that ability, because it gives you no immediate access to that middle element. (You might argue that you could maintain a pointer to the middle node in the list, but that would only delay the inevitable; the second step in a binary search is to look at the middle of the remaining elements, and you won't have a pointer there. And if you then argued that you could simply maintain an array of pointers to all the nodes, then you've got a data structure that is, in a sense, the worst of both worlds, with both the benefits and costs of arrays and linked lists, particularly with respect to inserting new nodes into the list later or removing existing ones.)

Tolerating imperfection

We've seen previously how we can sometimes achieve some really favorable and interesting results when we're willing to tolerate a certain level of imperfection. Even if it takes an unreasonably long time to calculate a perfect answer, it's often true in algorithm design that we can spend a lot less time to calculate an imperfect — but "good enough" — answer instead. We've seen a couple of examples of that in practice already in this course:

Even though it was too expensive to keep binary search trees in perfect balance, AVL trees are balanced enough (i.e., they still have logarithmic height, even in the worst case), but compromise on perfection in favor of much more efficient algorithms for insertion and removal.
When we try to add an element to a std::vector and it's full, the reallocation that's done deliberately gives up a lot of extra space (temporarily) — by multiplying its capacity, rather than adding to it — but does so specifically so that the cost of the reallocations amortizes evenly across the subsequent cheap adds that follow it. It would be nice if we could use less memory at a time, but we then would lose that amortization benefit.

Another kind of compromise that we can make is to use randomization in the behavior of an algorithm. Randomization generally means that we give up on any guarantee that it will perform within a certain bound — since it's possible, albeit very unlikely, to flip a coin 1,000 times and have it come up heads every time — we can (with care) leave ourselves an extremely high probability that the performance will be acceptable, even with the algorithm remaining quite simple. Skip lists, which we'll look at next, make exactly this trade-off.

Skip lists

A skip list is a collection of linked lists that together provide what you could think of, conceptually, as a linked list that can be searched much more efficiently. They give up some extra memory — by duplicating some of their nodes one or more times — so that they can be searched in a manner more like binary search. A skip list's behavior isn't as "perfect" as binary search, but we'll see that it has a very high probability of being in the same ballpark; in practice, their performance rivals that of balanced binary search trees.

Structural details

A skip list is an ordered collection of linked lists, one on top of another. The linked lists store a set of unique keys, and those keys are required to have order, i.e., it's necessary to be able to determine, for any two keys, which is smaller. We say that each of these linked lists is a level, and we'll refer to these levels by their index, with level 0 being the bottommost, level 1 being the one directly above it, and so on. Over time, the number of levels can vary, but there will always be at least one, and there will almost always be more than that.

Level 0 is a fairly standard singly-linked list with head. If the skip list contains n keys, level 0 will have a total of n + 2 nodes: one containing each key, and two additional ones containing the conceptually special keys -∞ and +∞, which are considered, respectively, to be smaller and larger than any possible key. The keys on level 0 are stored in ascending order, so the first node will contain -∞, the last node will contain +∞, and the other n nodes will contain all of the keys in the skip list in ascending order.

Subsequent levels above level 0 will also contain -∞ and +∞, along with a subset of the keys on the level below. (All keys in level i are present on level i − 1, though not all keys on level i − 1 are present on level i.) When the same key is present on both level i and i − 1, the node on level i points to the node on i − 1 containing the same key; this mechanism is used to link the levels together, so it becomes possible to traverse the structure both across and down, which, as we'll see, is the key to skip list's solid performance characteristics.

Below is an example of a skip list, with each node consisting of a key and two pointers: one to the node after it on the same level, and another to the corresponding node on the level below it.

Skip list structure

Searching a skip list

At first, the arrangement of a skip list might seem like a lot of overhead with little benefit; why are there multiple copies of nodes containing the same key? The reason becomes clearer after we see how to search one. A search algorithm might look something like this:

start at the head node of the top level

loop:
    if the current node's key is the one we're looking for:
        found it!
    else if the next node's key is larger than the key we're looking for:
        move down one level (terminating the search if we're already on the bottom level)
    else:
        move forward to the next node on this level

That algorithm, run against the skip list pictured above, would traverse the following path to search for the key 37.

Skip list searching

On level 4, the first node following -∞ is 44, which is larger than our search key of 37, so we move one level down.
On level 3, the first node following -∞ is 29, which is smaller than our search key of 37, so we move forward. The next node is 44, which is larger than our search key of 37, so move one level down.
On level 2, the first node following 29 is 44, which is larger than our search key of 37, so we move one level down.
On level 1, the first node following 29 is 37, which is equal to our search key of 37, so we move forward, then find what we're looking for; our search is done.

Note that the reason for the -∞ and +∞ nodes on each level is to keep the algorithm simple, so that there are no special cases (e.g., checking for nullptr, handling the first node differently from the others). They can be avoided in practice, but at the cost of making the algorithm slightly more complex — and note that the complexity has not only a cost in terms of understandability, but also potentially a performance cost, as well.

You could implement this algorithm iteratively or recursively. The potential benefit is the same either way: using the higher levels of the list to quickly get close to the key we're looking for, then using the lower levels to refine the search and obtain a final result. This algorithm will be an improvement over a regular linked list search if the higher levels have many fewer nodes on them than the lower levels do, because it will allow us to skip large numbers of nodes on the bottom level without ever looking at them.

But how do we ensure that the higher levels have fewer nodes — and a reasonably good spread of keys — than the lower ones? As with binary search trees, skip lists don't just fall out of the sky already arranged; we need an algorithm for inserting keys into them.

An insertion algorithm

Conceptually, inserting a key into a skip list is quite simple, even though the algorithm's implementation can be a bit hairy. Rather than specifically attempting to arrange the keys on each level in a particular way, we instead use randomness to decide which keys appear on higher levels and which don't.

Begin by doing a search, partly because our keys are required to be unique (so we need to find if the key we want is already there) and partly so we can find the appropriate insertion point in level 0 where our new key would need to be inserted. Either way, we would run the same basic search algorithm.
Every time we insert a key, we always insert it into a new node on level 0. Level 0 is always complete (i.e., it always contains all of the keys, plus -∞ and +∞).
After inserting the key on level 0, we "flip a coin," which is to say that we choose a random bit, equally likely to be 0 or 1. If the random bit is 1, we also insert the key in a new node on level 1; if the random bit is 0, we're done with the insertion.
If we did insert the node on level 1, we do the same thing again: flip a coin and decide whether it should also be inserted on level 2. If so, do it; if not, we're done.
This continues until our random bit comes back 0, at which point we're done with the insertion.

Levels are added to the skip list as necessary. Initially, the skip list has one level; as our coin flips take us to levels we've not yet seen, we create them on the fly.

Assuming we're generating random bits properly (i.e., they really do have a 50% chance of being 0 or 1), this algorithm's result is that a given key has a 100% chance of appearing on level 0, a 50% chance of appearing on level 1, a 25% chance of appearing on level 2, and so on. Generally, the probability that a key will appear on level i is 1/2ⁱ.

Thought differently, if there are n keys on level 0, you'd expect about n/2 of them to appear on level 1 (though it would not be guaranteed), about n/4 of them to appear on level 2, and so on. This is where skip lists derive their power; because so few keys are on the highest levels, you'll quickly eliminate large swaths of the search space by looking at just a few nodes; in very general terms, moving from one level down to the next should be of similar benefit to the early steps in a binary search of an array.

A removal algorithm

Removing a key from a skip list would begin with the same search algorithm we've seen previously, with the goal of finding the node at the highest level that contains the key we want to remove. At that point, we would "point around" that node (and delete it), then move down and do the same with each subsequent level until we reach level 0.

One nice performance optimization would be to remove levels that we no longer need. When we remove the last node from a level (aside from the -∞ and +∞ nodes), we can safely remove that level from the skip list altogether.

As always, the devil is in the details, but the concept is a simple one.

Analysis of skip lists

To analyze the performance characteristics of a skip list, there are a few things we should pay attention to, which we'll take in turn.

Memory usage

The first thing we should consider is how much memory is needed. It's clear that skip lists are attempting a fairly classic tradeoff in computing: time vs. memory, in this case spending extra memory (to store nodes containing the same keys on multiple levels) to save time in searching. So how much memory do skip lists use to make that trade?

Generally, the answer to this question has a lot to do with the randomness of the random bits we're generating to do our "coin flips." If we assume a good random engine — so each random bit independently has a 50% chance of being a 0 or a 1 — then it's reasonable for us to expect the following to be true as n grows large.

Level 0 will contain n nodes (we'll leave the -∞ and +∞ nodes out, because they don't dramatically affect the analysis)
Level 1 will contain n/2 nodes
Level 2 will contain n/4 nodes
And so on...

In total, then, here's how many nodes we'd expect there to be in all the lists combined:

n + n/2 + n/4 + ...

This is something called a geometric series and it has a well-known result of 2n. So we would expect that there would be a total of Θ(n) nodes across all the levels, so a total of Θ(n) memory would be needed. Note that this is the same asymptotic notation as the amount of memory required by just one linked list; all we've done is multiplied our memory requirement by a constant factor.

Of course, if the quality of our random bit sequence is bad, all bets are off; it's vital that we use a good random engine for this, and that we use it properly (seeding it from a legitimate source of entropy, then using the same engine to generate a sequence of results).

Height of the skip list

We say that the height of a skip list is the number of levels it has. Note that we add new levels only when we need them, and we remove levels when they become empty, so it's safe to say that a skip list's height is a function of the number of keys it stores at any given time. But what is that function?

There isn't a specific function that is always correct; it depends very much on the coin flips we do, which determine how many levels each key appears on. So we can't say for sure, but we can say some useful things about it.

First of all, your intuition should be telling you something right off the bat. In the expected case where all of the nodes appear on level 0, half of them appear on level 1, a quarter of them appear on level 2, and so on, what level would you be on by the time there's only one node left? The answer here is log₂n, which is roughly the height we'd expect.

However, there's uncertainty here. We're generating random bits, and even if we use a good random engine, there still exists the possibility that we'll see uncharacteristically long sequence of 1's once in a while. What are the odds that the height is much worse than what we expect? A couple of facts will help us to estimate the probability.

The probability that an individual key is present on level i is the same as the probability of flipping a coin and having it come up heads i times in a row: 1/2ⁱ.
The probability that any of the n keys is present on level i — which is also the probabiliy that the height of the skip list is at least i — is harder to calculate. But one thing's for sure: it's no more than the sum of the probability that each individual key is on that level, i.e., it's no more than n/2ⁱ. (In practice, it's probably a lot less than that, but certainly no more.)

We'll define P_i as the probability that the height of the skip list is at least i. Given the facts above, we can come to the following conclusion:

P_{3 log n} ≤ n / 2^{3 log n}
       ≤ n / n³
       ≤ 1 / n²

To put that result into perspective, if we have 1,000,000 keys in our skip list, the odds that the height is even three times worse than what we expect is no more than 1 in 1,000,000,000,000. And, even then, we'd still be talking about a height of Θ(log n).

So, what are the odds that the height is n? From our analysis, we have:

    P_n ≤ n / 2ⁿ

If we have just 10,000 keys, what are the odds that the height of our skip list will be 10,000? The answer is ridiculously small: 10,000 / 2^10,000 ≈ 10,000 / 10^3,010 ≈ 1 / 10^3,006. (To get an idea of just how small that number is, it is estimated that there are 10⁸⁰ atoms in the universe.)

So, with extremely high probability, the height of the skip list will be near our expectation. As n grows large enough for us to care, it's reasonable to presume that the height will be logarithmic in practice.

Search time

To keep the analysis simple here — some calculus and some probability theory would be necessary to be more precise — there are two facts we can focus on.

We'd expect the topmost level of the skip list to roughly divide the entire list in half.
We'd expect the next level to roughly divide the entire list in quarters.
And so on...

In other words, we would expect the number of keys we'd visit in a search to be approximately log₂n, for the same reason that a binary search will require us to look at about log₂n elements.

Additionally, we'll have to traverse downward from the topmost level to the bottommost. We expect there to be approximately log₂n levels, as well.

So, in total, we'd expect to visit about 2 log₂n nodes during a search. This would take O(log n) time, generally, which is a very good result.

It's certainly true that the searches could take substantially longer if, for example, the height of the skip list was abnormally large, or if the number of nodes on the higher levels was abnormally large; but, as we saw in the previous section, the probability of that is ridiculously small. Skip lists, in practice, perform very well and are a good tool to have in our toolbox.