ICS 46 Spring 2022, Notes and Examples: The Union-Find Algorithm

Thinking back to our maze generator

Consider the maze generator we implemented in Project #1. Our goal was to generate a perfect maze, which we defined as a maze in which there was exactly one path connecting each pair of cells; in other words, if you chose two cells in the maze, there would always be exactly one way to get from one to the other (and the only way to get back would be to follow the same steps in reverse). We asked you to solve the problem in a particular way: a depth-first, "tunnel-digging" algorithm, which had two pedagogical upsides:

The depth-first nature of the algorithm foreshadowed some of our later work in the course when we learned about tree and, particularly, graph traversals.
The recursive nature of the algorithm warmed up your ability to apply recursion to the kinds of problems we saw repeatedly throughout the course.

However, we provided another maze generator that used a very different approach, which boiled down to this:

start with all possible walls in place

while the maze is not perfect:
    choose a wall at random

    let i and j be the cells the wall sits between

    if there is already a path between i and j:
        do nothing
    else:
        remove the wall

Of course, this description leaves a number of open questions that we'd need to consider more carefully if we wanted to actually be able to implement this algorithm.

How can we inexpensively determine whether there is already a path between i and j? That's the heart of the algorithm — don't remove walls that introduce new ways to travel between cells that are already reachable from one another — but it's also a potentially expensive thing to check.
How do we know the maze is perfect, so we'll know when to stop? This seems to be a special case of the previous question, but one that applies to all pairs of cells instead of just one pair, so that sounds like it might be even more expensive to check; yet this, too, is something we'll need to check at every step.

One thing we've learned in this course, though, is that we can often take what seems like a complicated problem and recast it as a simpler one. Sometimes, there's a fair amount of conceptual distance between the problem and the technique we choose to solve it; the finesse is in learning to let go of the details and think about the underlying realities of the problem you're trying to solve, which can open up possibilities that are otherwise quite difficult to see. In this set of notes, we'll explore how you might think about this problem very differently than you might have in Project #1, yet nonetheless arrive at a simple and performant algorithm for solving it.

But, first, we'll need to consider some math and some computer science theory that we've not yet seen.

Equivalence classes

We say that an equivalence relation specifies whether any two objects are considered equivalent to each other in some fashion. Equivalence relations must have three characteristics, which will sound generally familiar if you've studied algebra in your past.

They must be reflexive, which is to say that an object is always considered equivalent to itself.
They must be symmetric, which means that if object X is considered equivalent to object Y, then object Y is considered equivalent to object X.
They must be transitive, which means that if object X is considered equivalent to object Y and object Y is considered equivalent to object Z, then object X is considered equivalent to object Z.

So is the relation "Student X has studied at UCI for the same number of years as student Y" an equivalence relation?

Is it reflexive? Yes, trivially; you've studied at UCI for the same number of years as yourself.
Is it symmetric? Yes. If you've studied at UCI for the same number of years as someone else, they've studied at UCI for the same number of years as you.
Is it transitive? Yes. If you've studied at UCI for the same number of years as a second person, and that second person has studied at UCI for the same number of years as a third person, then you've studied at UCI for the same number of years as that third person.

Fair enough, but what good is an equivalence relation? Given an equivalence relation, we can divide a collection of objects — and I mean "objects" in the abstract sense, rather than the C++ sense — into equivalence classes, which are objects that are all considered equivalent to each other based on the equvialence relation. Given the characteristics of an equivalence relation, we can assume the same characteristics of equivalence classes: reflexive, symmetric, and transitive, which leads to some interesting results.

If I know that the object X is equivalent to the object Y, I know that X is also equivalent to every object in Y's equivalence class, without having to explicitly check.
If I already know that the object X is equivalent to many other objects (other than Y), and I already know that the object Y is equivalent to many other objects (other than X), and then I discover that X and Y are equivalent, I can conclude that I can safely merge X's and Y's equivalence classes together — without having to explicitly check, I know that all of the objects in both equivalence classes are equivalent to each other (i.e., they all belong in the same equivalence class).

Granted, this all sounds pretty abstract, but it's surprisingly useful in practice. For example, thinking back to the problem of generating a perfect maze, it turns out that we could recast the problem into one involving equivalence classes as follows.

The equivalence relation would be this: "Two cells in the maze are considered equivalent if there is a path connecting them."
The equivalence relation would meet the necessary three characteristics:
- It is reflexive; it's always possible to get from some cell in the maze to itself (without having to go anywhere, i.e., via en empty path).
- It is symmetric; if there is a path connecting cells i and j, then there's a path connecting cells j and i, by simply taking the same steps in reverse. (The hallways in a maze are not directional; if we can go from one cell to another, nothing prevents us from going back where we came from.)
- It is transitive; if there is a path connecting cells i and j and a path connecting cells j and k, then there's a path connecting cells i and k, by simply following the path from i to j, then following the path from j to k.
The equivalence classes would be the subsets of the cells in the maze that are reachable from one another.
- Initially, with all walls present, every cell would be in its own equivalence class.
- We would never remove a wall between two cells in the same equivalence class, because there would already be a path connecting them.
- As walls were removed, equivalence classes would be "merged" together, because if we introduce a way to get from cell i to cell j, then we've also introduced a way to get from any cell that could already reach cell i to any cell that could already reach cell j.
- We would stop when all cells are in the same equivalence class; at that point, every cell is reachable from every other cell.

From an implementation perspective, then, we need three things.

A data structure that efficiently represents a set of equivalence classes.
An algorithm for efficiently determining whether two objects are in the same equivalence class.
An algorithm for efficiently merging two equivalence classes together.

The Union-Find algorithm is a well-known solution to these problems.

The Union-Find algorithm

Earlier this quarter, when we talked about general trees, we discussed two different implementation strategies. One of them was called the parent pointer implementation, in which each node could tell you its parent, but no node could tell you about any of its children. That probably sounded strange at the time, because there are some often-needed questions that you couldn't answer efficiently with such an implementation, such as these:

Which node is the leftmost child of node X?
Which nodes are siblings of node Y?

The kinds of tree traversals we learned, for example, would be off-limits if we used to parent pointer implementation, because they all revolved around working our way downward, requiring us to know (efficiently!) which nodes were children of a particular node.

But some tree problems are narrow enough that the parent pointer implementation is just what the doctor ordered. Suppose we applied the parent pointer implementation to the problem of maintaining a set of equivalence classes.

Instead of storing one tree, we'd store a collection of potentially many trees called a forest. (Yes, many trees is called a "forest," which makes sense when you think about it.) But we'd store the entire forest in one data structure.
Every node in one of the trees would represent one object whose equivalence (to the other objects) we want to keep track of.
We would say that two objects are in the same equivalence class if their nodes are in the same tree.

This data structure is sometimes called a disjoint-set forest, because it's a collection of trees, each of which represents sets of objects that are disjoint from the other sets; each of those disjoint sets could be seen as a single equivalence class. There are two basic operations we would want to perform on a disjoint-set forest:

The find operation, which determines what equivalence class an object is in. We do this by determining the root of its tree, which leads to a simple algorithm for determining if two objects are equivalent: Call find twice and see if they return the same root; if so, they're in the same equivalence class and, thus, are equivalent.
The union operation, which checks if two nodes are in the same equivalence class; if not, it combines two equivalence classes into a single one, by changing the root of one of the equivalence classes' trees so that its parent is one of the nodes in the other equivalence class' tree, meaning that all of them are now in the same tree.

Implementing a disjoint-set forest

The simplest way to implement a disjoint-set forest would be to use an array-based structure, such as a std::vector. Each element of the std::vector would have an index, of course, and would need to store (at least) the index of its parent — or some special value, such as -1, to indicate that it had no parent.

If so, then our find algorithm would start at the given index, looking for the index of the root of that tree, by working its way upward in the tree — iteratively or recursively — looking for the index of a node with no parent.

find(n):
    if n's parent is -1:
        return n
    else:
        return find(n's parent)

Meanwhile, our union algorithm would use find to determine the roots of the two trees, then, if different, would make one of those roots into the parent of the other.

union(n1, n2):
    root1 = find(n1)
    root2 = find(n2)

    if root1 != root2:
        make root1's parent be root2

Assuming we had five elements total, we might start with the following values in our std::vector.

0	1	2	3	4
-1	-1	-1	-1	-1

If we then ran our union algorithm on the nodes 1 and 3, we'd determine that both were their own roots, then make 3 be 1's parent.

0	1	2	3	4
-1	3	-1	-1	-1

If we then ran our union algorithm again, this time on the nodes 1 and 4, we would discover that 1's root is 3 and 4's root is 4, then make 4 be 3's parent.

0	1	2	3	4
-1	3	-1	4	-1

So, generally, we have the machinery that will let us implement an algorithm where we start with every element in its own equivalence class, then gradually merge them together into a single one. Yet, this is exactly what we need to solve the perfect maze problem! Revisiting our sketch of a solution from earlier, we see that we had the following:

Initially, with all walls present, every cell would be in its own equivalence class.
- This is the default starting point for a disjoint-set forest.
We would never remove a wall between two cells in the same equivalence class, because there would already be a path connecting them.
- Call find on both cells. If both return the same root index, don't remove the wall. Or, alternatively, have union return true if it actually merged two equivalence classes and false if they were already the same.
As walls were removed, equivalence classes would be "merged" together, because if we introduce a way to get from cell i to cell j, then we've also introduced a way to get from any cell that could already reach cell i to any cell that could already reach cell j.
- This is the union algorithm.
We would stop when all cells are in the same equivalence class; at that point, every cell is reachable from every other cell.
- Start a counter with the total number of cells. Every time you merge two equivalence classes, decrement the counter. Stop when the counter reaches 1.

Analysis

Overall, the find algorithm is the one that will have the most bearing on our performance; the union algorithm calls find twice, then does a (presumably) constant amount of work to change the root of one of the nodes. So how fast is find? We could say it takes Θ(d) time, where d is the depth of the node (i.e., how far down it is from the root). So, in general, if we can minimize that depth, we'll be a lot better off.

However, unless we take some care in our implementation, nothing prevents a situation like this one, which you might think of as a "degenerate" disjoint-set forest:

0	1	2	3	4
-1	0	1	2	3

All of the nodes are in the same set, but a find on the node at index 4 would require visiting every node.

Path compression

The simplest way to avoid this problem is to use a technique called path compression, which revolves around the idea of using an expensive find operation to improve the shape of the tree, so that subsequent find operations become significantly cheaper. In the degenerate example above, we'll have discovered that 4's parent is 3, 3's parent is 2, 2's parent is 1, and 1's parent is 0. Ultimately, though, all of these nodes are in the same tree, so any shape that leads to them all having the same root will mean the same as any other, except a flatter shape would be better, so why not change them all to have the same parent? The following version of find would accomplish that goal nicely.

find(n):
    if n's parent is -1:
        return n
    else:
        root = find(n's parent)
        make n's parent be root
        return root

If we took our degenerate disjoint-set forest and called find(4) on it, it would be updated to look like this instead, which would be a significant improvement.

0	1	2	3	4
-1	0	0	0	0

The height of the tree dropped from 4 to 1, but the asymptotic performance of this find operation won't have changed; we had to traverse all the way to the top of the tree, anyway, so we might as well use that traversal to make an improvement for the future. Of course, subsequent find operations will be significantly improved by this, so this modification is all upside.

Weighted union

Another technique that can help is to do what's called a weighted union, which means that we track the number of nodes in every tree; then, when we union two of the trees together, we make the "lighter" of these trees (i.e., the ones with fewer nodes) be a child of the "heavier" one, then also update the counts accordingly. If we only stored weights in the root nodes, maintaining them would be inexpensive — all operations involve finding roots, anyway, so there would only be one value to update in a union.

The iterated logarithm

While the analysis goes beyond the scope of this course, it can be shown that combining these two techniques leads to an amoritzed running time of find on a disjoint-set forest containing n nodes that is Θ(log^*n). log^*n is what's called an iterated logarithm, which measures how many logarithms it takes to get from n to a value that's less than or equal to 1. Assuming we're using base-2 on the logarithms, for example, log^*65536 = 4, because log₂65536 = 16, log₂16 = 4, log₂4 = 2, and log₂2 = 1; it took four iterations of the logarithm to get us from 65536 to 1.

As you'd expect, Θ(log^*n) grows very slowly as n grows — log^*999,999,999,999 is still only 5, since log₂999,999,999,999 is around 39.8 — so the cost of each find operation in a series of find operations is nearly as good as Θ(1), which is a great result for an algorithm that's as simple as this one.