ICS 46 Spring 2022
Notes and Examples: The Union-Find Algorithm


Thinking back to our maze generator

Consider the maze generator we implemented in Project #1. Our goal was to generate a perfect maze, which we defined as a maze in which there was exactly one path connecting each pair of cells; in other words, if you chose two cells in the maze, there would always be exactly one way to get from one to the other (and the only way to get back would be to follow the same steps in reverse). We asked you to solve the problem in a particular way: a depth-first, "tunnel-digging" algorithm, which had two pedagogical upsides:

However, we provided another maze generator that used a very different approach, which boiled down to this:

start with all possible walls in place

while the maze is not perfect:
    choose a wall at random

    let i and j be the cells the wall sits between

    if there is already a path between i and j:
        do nothing
    else:
        remove the wall

Of course, this description leaves a number of open questions that we'd need to consider more carefully if we wanted to actually be able to implement this algorithm.

One thing we've learned in this course, though, is that we can often take what seems like a complicated problem and recast it as a simpler one. Sometimes, there's a fair amount of conceptual distance between the problem and the technique we choose to solve it; the finesse is in learning to let go of the details and think about the underlying realities of the problem you're trying to solve, which can open up possibilities that are otherwise quite difficult to see. In this set of notes, we'll explore how you might think about this problem very differently than you might have in Project #1, yet nonetheless arrive at a simple and performant algorithm for solving it.

But, first, we'll need to consider some math and some computer science theory that we've not yet seen.


Equivalence classes

We say that an equivalence relation specifies whether any two objects are considered equivalent to each other in some fashion. Equivalence relations must have three characteristics, which will sound generally familiar if you've studied algebra in your past.

So is the relation "Student X has studied at UCI for the same number of years as student Y" an equivalence relation?

Fair enough, but what good is an equivalence relation? Given an equivalence relation, we can divide a collection of objects — and I mean "objects" in the abstract sense, rather than the C++ sense — into equivalence classes, which are objects that are all considered equivalent to each other based on the equvialence relation. Given the characteristics of an equivalence relation, we can assume the same characteristics of equivalence classes: reflexive, symmetric, and transitive, which leads to some interesting results.

Granted, this all sounds pretty abstract, but it's surprisingly useful in practice. For example, thinking back to the problem of generating a perfect maze, it turns out that we could recast the problem into one involving equivalence classes as follows.

From an implementation perspective, then, we need three things.

The Union-Find algorithm is a well-known solution to these problems.


The Union-Find algorithm

Earlier this quarter, when we talked about general trees, we discussed two different implementation strategies. One of them was called the parent pointer implementation, in which each node could tell you its parent, but no node could tell you about any of its children. That probably sounded strange at the time, because there are some often-needed questions that you couldn't answer efficiently with such an implementation, such as these:

The kinds of tree traversals we learned, for example, would be off-limits if we used to parent pointer implementation, because they all revolved around working our way downward, requiring us to know (efficiently!) which nodes were children of a particular node.

But some tree problems are narrow enough that the parent pointer implementation is just what the doctor ordered. Suppose we applied the parent pointer implementation to the problem of maintaining a set of equivalence classes.

This data structure is sometimes called a disjoint-set forest, because it's a collection of trees, each of which represents sets of objects that are disjoint from the other sets; each of those disjoint sets could be seen as a single equivalence class. There are two basic operations we would want to perform on a disjoint-set forest:

Implementing a disjoint-set forest

The simplest way to implement a disjoint-set forest would be to use an array-based structure, such as a std::vector. Each element of the std::vector would have an index, of course, and would need to store (at least) the index of its parent — or some special value, such as -1, to indicate that it had no parent.

If so, then our find algorithm would start at the given index, looking for the index of the root of that tree, by working its way upward in the tree — iteratively or recursively — looking for the index of a node with no parent.

find(n):
    if n's parent is -1:
        return n
    else:
        return find(n's parent)

Meanwhile, our union algorithm would use find to determine the roots of the two trees, then, if different, would make one of those roots into the parent of the other.

union(n1, n2):
    root1 = find(n1)
    root2 = find(n2)

    if root1 != root2:
        make root1's parent be root2

Assuming we had five elements total, we might start with the following values in our std::vector.

0 1 2 3 4
-1 -1 -1 -1 -1

If we then ran our union algorithm on the nodes 1 and 3, we'd determine that both were their own roots, then make 3 be 1's parent.

0 1 2 3 4
-1 3 -1 -1 -1

If we then ran our union algorithm again, this time on the nodes 1 and 4, we would discover that 1's root is 3 and 4's root is 4, then make 4 be 3's parent.

0 1 2 3 4
-1 3 -1 4 -1

So, generally, we have the machinery that will let us implement an algorithm where we start with every element in its own equivalence class, then gradually merge them together into a single one. Yet, this is exactly what we need to solve the perfect maze problem! Revisiting our sketch of a solution from earlier, we see that we had the following:


Analysis

Overall, the find algorithm is the one that will have the most bearing on our performance; the union algorithm calls find twice, then does a (presumably) constant amount of work to change the root of one of the nodes. So how fast is find? We could say it takes Θ(d) time, where d is the depth of the node (i.e., how far down it is from the root). So, in general, if we can minimize that depth, we'll be a lot better off.

However, unless we take some care in our implementation, nothing prevents a situation like this one, which you might think of as a "degenerate" disjoint-set forest:

0 1 2 3 4
-1 0 1 2 3

All of the nodes are in the same set, but a find on the node at index 4 would require visiting every node.

Path compression

The simplest way to avoid this problem is to use a technique called path compression, which revolves around the idea of using an expensive find operation to improve the shape of the tree, so that subsequent find operations become significantly cheaper. In the degenerate example above, we'll have discovered that 4's parent is 3, 3's parent is 2, 2's parent is 1, and 1's parent is 0. Ultimately, though, all of these nodes are in the same tree, so any shape that leads to them all having the same root will mean the same as any other, except a flatter shape would be better, so why not change them all to have the same parent? The following version of find would accomplish that goal nicely.

find(n):
    if n's parent is -1:
        return n
    else:
        root = find(n's parent)
        make n's parent be root
        return root

If we took our degenerate disjoint-set forest and called find(4) on it, it would be updated to look like this instead, which would be a significant improvement.

0 1 2 3 4
-1 0 0 0 0

The height of the tree dropped from 4 to 1, but the asymptotic performance of this find operation won't have changed; we had to traverse all the way to the top of the tree, anyway, so we might as well use that traversal to make an improvement for the future. Of course, subsequent find operations will be significantly improved by this, so this modification is all upside.

Weighted union

Another technique that can help is to do what's called a weighted union, which means that we track the number of nodes in every tree; then, when we union two of the trees together, we make the "lighter" of these trees (i.e., the ones with fewer nodes) be a child of the "heavier" one, then also update the counts accordingly. If we only stored weights in the root nodes, maintaining them would be inexpensive — all operations involve finding roots, anyway, so there would only be one value to update in a union.

The iterated logarithm

While the analysis goes beyond the scope of this course, it can be shown that combining these two techniques leads to an amoritzed running time of find on a disjoint-set forest containing n nodes that is Θ(log*n). log*n is what's called an iterated logarithm, which measures how many logarithms it takes to get from n to a value that's less than or equal to 1. Assuming we're using base-2 on the logarithms, for example, log*65536 = 4, because log265536 = 16, log216 = 4, log24 = 2, and log22 = 1; it took four iterations of the logarithm to get us from 65536 to 1.

As you'd expect, Θ(log*n) grows very slowly as n grows — log*999,999,999,999 is still only 5, since log2999,999,999,999 is around 39.8 — so the cost of each find operation in a series of find operations is nearly as good as Θ(1), which is a great result for an algorithm that's as simple as this one.