ICS 46 Spring 2022
Notes and Examples: General Trees


What are trees?

So far, we've mainly studied data structures that are either one-dimensional — such as arrays and linked lists — or that are multidimensional, but shaped as a rectangle (in two dimensions), rectangular prism (in three dimensions), and so on. Lots of real-world collections of data can be organized this way, and many algorithms store their temporary results in such structures, as well. But it's certainly not the case that this is the best choice for all data, so we should understand what some of the alternatives are, and when we could best use them.

We'll start with a data structure called a tree, which introduces the notion of hierarchy, which is to say that some elements lie "above" collections of others, which, in turn, lie "above" still others. While there are lots of variants of trees, each of which solves different kinds of problems well, we'll start by defining trees generally.

A tree is either:

Of course, trees are data structures, so each node, in addition to being partly responsible for holding the tree together structurally, also holds a data element. The tree's contents are the set of all data elements held in all of the nodes. Note, too, the presence of the word "disjoint" in the definition, which is an important detail here: The disjointness of the subtrees means that no node will be in two subtrees at the same time. This assumption will simplify many tree algorithms.


Terminology

Before we can start learning about trees in depth, it's a good idea if we first agree on some terminology that we can use to describe them. This terminology will recur throughout our discussion of trees, so best that we understand what all of these terms mean.

Classically, trees are drawn this way:

General Tree

Each circle in this picture is called a node. The nodes are connected by lines, which denote the way that the nodes are related to one another in the tree's hierarchy. In particular, we see that one node lies "above" one or more others that are connected to it:

When a line is drawn between nodes in this way, it represents that each node below is the root of a subtree belonging to the node above. So, for example, Q is the root of one of X's subtrees; M is the root of one of F's subtrees; and so on. Describing the relationship this way is a bit of a mouthful, so we often simplify this by describing it in terms of parent and child. For example, we would say that M is a child of F, and that X is the parent of D. Due to the way trees are defined, nodes can have as many children as you'd like, but each node can only have one parent. Not surprisingly, the use of genealogical terminology can be extended in obvious ways, so we would say that X is the grandparent of S; that S is a grandchild of X; that L is a descendent of F; that F is an ancestor of L; and that Q, F, and D are all siblings (since they have the same parent).

Not all terminology for describing trees is genealogical; some of it is botanical, also used to describe the kinds of trees that grow out of the ground. First of all, we say that X is the root of the entire tree. (Note that roots are special, in that they are the only nodes in a tree that have no parent.) This definition seems incompatible with the overall definition of a tree, which implies that every node is a root node. However, the reality is that the definition is really saying that every node is the root of its own subtree; taking the tree as a whole, it has only one root, X. We say, also, that C, N, R, S, L, and H are leaves. The leaves are the only nodes that have no children. Other terms that we may see later, such as branches and branching factor, approach trees from this same point of view.

As with other data structures and algorithms we've studied so far, we'll need to focus our attention not only on how they work, but also on how well they work. To that end, we'll need to understand what aspects of a tree we might like to measure, and how we'd like to measure them. A few terms and ideas that we'll use in that context are:


Implementing general trees

Of course, we can draw all the diagrams and introduce all the terminology we want, but the ultimate goal is that we're able to actually implement such a data structure, so we should also consider how we could do something like that. Hearkening back to our conversation about linked data structures, we could certainly imagine that each node is an object, and that the objects are linked by pointers; the question, though, is how we arrange the objects, and which ones we maintain pointers to. While there is a variety of ways you could do this, they lie along a spectrum with the following two approaches at the extremes.

The parent pointer implementation

One way to implement a tree, which may seem counterintuitive, but which has its uses, is called the parent pointer implementation. Conceptually, the notion is simple: every node can tell you which node is its parent. There still has to be a way to find the individual nodes, though, and this is typically done in one of a couple of ways:

Using an approach like this, some operations can be very efficient, while others can be quite inefficient. Whether we choose to use this kind of approach depends very much on what operations we actually need. If we need operations that will be expensive, it might be better to use a different approach.

For example, given a node, it is difficult to find that node's children. If each node can tell you where its parent is, but not where its children are, then if we need to move "downward" in a tree from a node n to its children, we'll have no choice but to search the entire tree looking for nodes whose parent is n. And even if we find one, we won't ever know that there won't be additional ones, which means we'll always have to search the entire tree. This will take Θ(n) time, provided that the nodes are stored in a linear data structure that allows us to iterate through them.

On the other hand, some operations will be efficient. For example, given two nodes, it's easy to figure out whether they're siblings, by simply comparing their parent pointers. It's also relatively straightforward to find the root of the tree, by simply following parent pointers upward. By extension, it's easy to discover whether two nodes are in the same tree — one thing that the "parent pointer" implementation is often used for is to store a collection of trees called a forest, instead of just a single one — by simply finding the two nodes' roots; if they're in the same tree, they'll have the same root. (This algorithm is actually at the heart of one of the maze generators provided in Project #1.) Such an operation would take O(max(h1, h2)) time, where h1 and h2 are the heights of the two nodes' trees.

The list of children implementation

An alternative implementation, which has broader use because it makes the opposite tradeoff, is called the list of children implementation, so named because every node keeps track of where its children are. In C++ terms, each node is stored in an object (say, of a Node struct or class) containing both a data element and a collection of pointers to its child nodes; the collection of pointers would presumably be stored in some kind of linear data structure, such as an array, a vector, or a linked list.

First thing's first: How would we choose a linear data structure in which to store the children?

Another upside of the list of children implementation is that you need only to store a pointer to the root node, and you'll then be able to reach every other node in the tree; every other pointer you need will be stored in one of the nodes.

Of course, if you have a pointer to some node and you want a pointer to its parent, you're out of luck; this implementation won't give it to you. Note, though, that you could add a parent pointer to every node, as well, if that's something you needed frequently.