ICS 46 Spring 2022, Notes and Examples: Graphs

Where trees fall short

We've spent a lot of time in recent weeks discussing various uses of trees. Trees represent their data as a hierarchy. Sometimes, we do this because the data itself is inherently hierarchical, such as the chain of command in a lot of organizations, or the way that files are organized into directories and subdirectories on a filesystem. Other times, we use the tree's hierarchical organization to store data in a way that makes it easier to search or mainpulate, as we did when we talked about binary search trees and binary min heaps. But even though we've seen examples where a tree's hierarchy can be used to efficiently represent non-hierarchical data, it's not the case that we'll always be able to do that efficiently. Binary search trees worked because we had a very particular problem: searching for data given a unique key. Binary min heaps, similarly, worked because the problem was special: organizing data by a key, with the minimum key always in a convenient place.

So what's an example of data that isn't well-represented in a tree structure? Consider a social/interpersonal network like Facebook or LinkedIn. While these sites let us do a lot of different things, their most fundamental goal is to keep track of a very large collection of people, with each of those people being "linked" to some of the others. Knowing how each person relates to each other in this way, by itself, gives these sites a lot of power; understanding who people are and how they relate to one another leads to all kinds of useful features, as well as the ability to more effectively advertise to them or sell them services outright.

When we consider the way that people can be linked together in a social network, we realize that the set of relationships simply isn't hierarchical; you just couldn't reasonably store this information in a tree and then make sense out of it. There are two things that can happen in a social network that can't happen in a tree.

Splits and joins, cycles

A condition that I call splits and joins, which is to say that there might be two separate paths connecting two people together. In a tree, we know that the subtrees of each node are completely distinct, but this property doesn't hold in a social network.
The presence of cycles, which is to say that if you start with one person and follow links, you might eventually wind up back where you started. This also can't happen in a tree, but is certainly true in a social network — and not even rare, because so many people are mutually linked.

So any attempt to try to store the data describing the relationships between people in a tree structure is going to either be impossible, or it will need to be so contorted that it will lose its meaning. We'd be better off using a data structure that's designed to store this more open-ended kind of data. When we want to store information about things and the direct relationships between those things — a surprisingly common pattern in computing — what we want is something called a graph.

What is a graph?

Because the problem of storing objects and an open-ended set of their direct relationships is so common in computing, it's not a bad idea to come up with a single solution to it. Many real-world data sets can be represented as graphs, and many problems you might want to solve can be understood to be examples of existing, well-known graph algorithms. Understanding graphs and graph algorithms means that we already know how to solve a wide variety of problems, which is a powerful position to be in. But, first thing's first. What's a graph? It turns out that they come in a few slightly different flavors, and it's wise for us to understand what choices are available to us before we go too much further.

Directed graphs

A directed graph (or digraph) consists of two things:

A finite set V of vertices

A finite set E of ordered pairs of vertices in V, called edges

From a structural perspective, a directed graph might be thought of in terms of these two sets. For example, the directed graph G₁ might be written this way:

V = {a, b, c}
E = {(a, b), (a, c), (b, a), (b, b), (b, c), (c, c)}

This directed graph consists of three vertices, which we're calling a, b, and c, along with a collection of six edges linking those vertices together in various ways. There are a couple of things worth noting here, which are implied by the definition, but easy to miss if you don't stop and think about them a bit:

Each edge is an ordered pair of vertices. The word ordered implies the directionality of the edges. The edge (a, b) is not the same thing as the edge (b, a), which is why the graph G₁ can (optionally) have both.
There is no restriction on the same vertex appearing in an edge twice — an ordered pair of vertices can consist of the same vertex twice, for the same reason that the Cartesian coordinate (1, 1) is legal in algebra — which means that it's possible to have what you might call self-edges like (b, b), an edge connecting a vertex to itself.

While we could write graphs out in the long-hand form you see above, they're usually a lot easier for people to understand if presented visually. A visual representation of that same graph G₁ would best be drawn like this:

The directed graph G1

We represent each vertex as a circle and each edge as an arrow connecting two of the vertices. The reason we use an arrow instead of a line is to make clear the directionality of the edges.

From a practical perspective, how might you use a directed graph? Let's consider our social network example again. If we wanted to keep track of people and the connections between people (i.e., Person A lists Person B as a friend/connection), a directed graph would be a good way to do that. Each person would be represented by a vertex, and each connection between people would be represented as an edge. Why a directed graph would make sense is because these kinds of connections in social networks are often directional — just because I connect to you doesn't mean that you connect to me.

Directed graph terminology

Just as we did when we first learned about trees, we should agree on some terminology to use when discussing graphs. The most basic terminology — vertex and edge — has been defined already, but there are other things we'll need to be able to say about graphs, so it's a good idea that we all say them the same way. Consider the directed graph G₁ above as we work our way through the various terms.

When we want to talk about the (potential) existence of an edge connecting vertex v to vertex w, we'll denote that as v → w.
- Such an edge will be said to be an outgoing edge of v.
- Such an edge will be said to be an incoming edge of w.
- In G₁, the edge a → c is an outgoing edge of a and an incoming edge of c.
Because we'll need to do an analysis of the performance of graphs and graph algorithms, we'll need a way to count various aspects of a graph's structure. One thing we might like to count is how many edges are associated with a vertex. In a directed graph, since edges are directional, there are two ways to think about that.
- The out-degree of a vertex is how many outgoing edges it has. In G₁, the vertex b has an out-degree of 3, and the vertex c has an out-degree of 1.
- The in-degree of a vertex is how many incoming edges it has. In G₁, the vertex b has an in-degree of 2 and the vertex c has an in-degree of 3.
As we did when we talked about tree algorithms, we'll find that many graph algorithms do their work by following edges from one vertex to another, tying their performance analysis (at least partly) to the sequences of vertices that can be reached by following edges from one vertex to another.
- A path in a graph is a sequence of vertices where each vertex v in the sequence that is followed by a vertex w implies that the edge v → w exists.
- The length of a path is the number of edges you'd need to follow to get from the first vertex in the path to the last one. (This mirrors the definition of the length of a path in a tree; it's about how many steps you need to take.)
- In G₁, the following paths exist:
  - {a} (length 0)
  - {a, b} (length 1)
  - {a, c, c, c, c} (length 4)
Finally, we'll say that a cycle is a path that begins and ends with the same vertex. In G₁, the path {b, a, b} is a cycle.

Directed acyclic graphs (DAGs)

While directed graphs are capable of representing a set of relationships that has cycles in it, it's not always the case that the problem we're trying to solve will allow them. For example, you've no doubt seen before that it's possible, in C++, for one class to inherit from another class. What you may not have seen is that it's actually possible for a class to inherit from multiple other classes. So, as an example, consider this skeletal C++ code.

class A { ... };
class B : public A { ... };
class C { ... };
class D : public C { ... };
class E : public B, public D { ... };
class F : public E { ... };

One of the primary jobs of a C++ compiler is to parse the code that's been written and infer its meaning. One aspect of that would be to understand the inheritance relationships between the classes, and then to use that information later in the compilation process. With respect to inheritance relationships, it would need to understand details like this:

When allocating an object of the class E above, it would need to know that objects of this class would contain the member variables and member functions declared in the classes A, B, C, D, and E.
When assigning, say, an E* into another variable, it would need to know that it's legal to assign it into a variable of type A*, B*, C*, D*, or E*.

To do that, it would be necessary for the compiler to build a data structure internally that tracks inheritance relationships. Because C++ allows multiple inheritance — one class that inherits from more than one other class — a tree would not be appropriate, but a directed graph surely would be. (If it's not clear to you why a tree would not be appropriate, draw a circle for each of the classes above, then each time any class inherits from any other class, draw an arrow connecting the base class to the derived one. Notice that there exist what we called "splits and joins" earlier. The relationships are not solely hierarchical.)

There's one thing we know for sure about a legal set of inheritance relationships: They cannot be cyclic. For example, this code would be illegal:

class X : public Z { ... };
class Y : public X { ... };
class Z : public Y { ... };

It would be illegal for the simple reason that it would not be clear what any of these objects would look like. In every X is a Z, in which you'd find an Y, in which you'd find an X, in which you'd find a Z, in which you'd find a Y, and so on — forever! A C++ compiler would disallow this code and give an error message. This means two things:

A compiler would need to use an algorithm to determine whether the directed graph of all inheritance relationships contained a cycle. We'll see such an algorithm in a later set of notes.
Once that algorithm had been run successfully, we could be sure that the result would be acyclic, which might allow us to run certain algorithms subsequently that only work in the absence of cycles.

When we have a directed graph that we know not to have cycles, we often call it a directed acyclic graph, or refer to it by the acronym DAG. Knowing by the nature of the problem you're solving that you have a directed acyclic graph will enable you to do things — run certain algorithms, make certain assumptions — that you would otherwise be unable to do.

Undirected graphs

A undirected graph consists of two things:

A finite set V of vertices

A finite set E of two-element sets of vertices in V, called edges

You may notice that this definition is quite similar to the definition of a directed graph, with the only difference being that the edges are "two-element sets" of vertices, rather than being "ordered pairs" of vertices. That one difference tells you everything you need to know, structurally, about the distinction between undirected and directed graphs:

Since the edges are sets of vertices, the order of the elements in that set is considered irrelevant. This tells you that the edges have no notion of directionality, i.e., the edge {a, b} is equivalent to the edge {b, a}.
The "self-edges" permitted in directed graphs (i.e., a → a) are not permitted in undirected graphs, because a two-element set can't contain the same element twice.

As we did with directed graphs, we can describe undirected graphs in a mathematical notation. The undirected graph G₂ might be written this way:

V = {a, b, c}
E = {{a, b}, {a, c}, {b, c}}

However, many people tend to prefer to see this denoted visually, so a visual representation of that same graph G₂ might instead be drawn like this:

The undirected graph G2

As in directed graphs, we draw each vertex of an undirected graph as a circle, while each edge is drawn as a line connecting two vertices. There's no arrowhead on the line, because the edges have no directionality.

Practically speaking, when we need an undirected graph is when we want to represent a real-world collection of things and connections between those things, where those connections are not directional. One example would be to represent all of the ways that someone could walk from one place to another on the campus of UCI. Each place you might want to go — say, each building — could be represented by a vertex, while each edge would represent a stretch of sidewalk or other kind of walkway. Because walkways are not directional, it would be best to represent these edges without direction, so we might well want to use an undirected graph to represent this kind of information.

Differences in undirected graph terminology

The terminology we discussed previously for directed graphs is all usable when talking about undirected graphs, though there is one refinement we can make. Since the edges in an undirected graph have no direction, there's no distinction between outgoing and incoming edges; all edges are both. That means there's little reason to distinguish between the in-degree and out-degree of a vertex, since it will always be equal. So we'll add one term to our toolbox of terms:

The degree of a vertex in an undirected graph is the number of edges in which that vertex is included.

Analyzing graphs and graph algorithms

Before we consider how graphs are implemented and perform some analysis on our choices, we first need to consider how we might perform analysis on graphs at all. What are the variables at work here? Two variables, which can vary somewhat independently, become clear right away:

The number of vertices in the graph. We'll always refer to this number as v.
The total number of edges in the entire graph. We'll always refer to this number as e.

Notably absent here is the variable n; unlike many data structures, graphs don't offer a single, obvious notion of a variable that describe their size, since the number of vertices might be much larger than the number of edges or vice versa. So we'll be clear by using two variables whose names represent what they are. Our analyses will generally be in terms of two variables: v and/or e:

v represents the number of vertices in the graph
e represents the number of edges in the graph

Implementing graphs

Now that we've seen what graphs are, and what kinds of problems they might be used to solve, we now turn our attention to how you might actually implement one. Generally, we have two things we need to keep track of: vertices and edges. So where do we store them? How do we arrange them so we can efficiently do the things we need to do with them?

The first thing we need to understand is that graphs are data structures. Part of what they need to convey is the existence of vertices and edges. But they also tend to convey additional information about those vertices and/or those edges. Each vertex, then, might have an object — or member variables in an object, let's say — that represents information about it. Similarly, each edge might also have an object that represents information about it. For example, if we were talking about a graph representing the cost of airplane flights between airports throughout the contentinental United States, we might want to represent a few things:

Each airport would be represented as a vertex in the graph. The key thing we'd need to know about that vertex is the name of the airport.
Each available flight from one airport to another would be represented as an edge in the graph. The key thing we might need to know, for each edge, is the cost of that flight. (We're assuming here that there's only one flight between each airport, for simplicity.)

So, when we investigate ways to implement graphs, we'll need to consider where to put this kind of information, in addition to whatever we need to know about the graph structurally (i.e., the existence of vertices and edges).

Naturally, there are a number of subtly different ways we might implement graphs, but they tend to fall into two basic categories, which we'll focus on here: an adjacency matrix and adjacency lists.

Adjacency matrix

An adjacency matrix stores most of a graph's representation in a two-dimensional array (or similar array-based structure; see the Multidimensional Data notes for some implementation options in C++). There is one row in the two-dimensional array for each vertex; there is also one column for each vertex. This means there's one cell for each possible pair of vertices, which means that each cell has the job of representing one possible edge that the graph might have. What's stored in each cell depends on what you need to know about each edge:

If all you need to know is the existence or absence of an edge, but there's nothing else you need to know about it, each cell could store some kind of 1 or 0 — a bool, perhaps, or even a single bit if memory is at a premium.
If there is additional information you need to know about each edge — such as the cost of an airplane flight, in our example above — that information would be stored in each cell. You'd need to choose a special value that represents the absence of an edge; for example, if each edge was mainly represented as the cost of a flight, then you might use +∞ to represent the absence of such a flight (i.e., an infinite cost).

Additionally, if information needs to be stored about each vertex, then you'd also need a separate, one-dimensional array in which to store it. Each cell would store whatever information is needed (e.g., the name of the airport) for a single vertex.

Note that you can use an adjacency matrix to implement either a directed or an undirected graph. If the graph is undirected, you have two additional options to consider:

If the edge {a, b} is present, store two copies of its information, one in the cell [a][b] and another in the cell [b][a].
Store only the upper diagonal "half" of the two-dimensional array. If it's a dynamically-allocated array of arrays (see the Multidimensional Data notes), then you could have each of the subarrays contain a different number of possible edges, so that no possible edge is duplicated. Then, when you go to look for the edge {a, b}, you would have to either look in cell [a][b] or cell [b][a] (e.g., depending on whether a ≤ b).

Analysis of adjacency matrix

There are two things we should consider here: the amount of memory required to store an adjacency matrix and the amount of time required to perform common operations on it.

From a memory perspective, what we have are two things:

A two-dimensional array, in each dimension of which is one index per vertex. This array, whether we have any edges or not, will have a total of v² elements in it. This two-dimensional array, then, would take Θ(v²) memory to store.
A one-dimensional array, with one element per vertex. This array will have a total of v elements in it, and would take Θ(v) memory to store.

Adding these two things together — since we'd need to store both — we'd need Θ(v²) + Θ(v) = Θ(v²) memory.

Checking for the existence of an edge i → j, given the index of two vertices, would require simply indexing into the two-dimensional array. As we've seen, this can be done in Θ(1) time, so this will be very efficient.

On the other hand, enumerating all of the possible edges in the graph will take Θ(v²) time, even if the number of edges is quite small. This is because we would nonetheless have to iterate through the entire two-dimensional array, looking to see which edges exist.

Generally, then, we might conclude that this is a somewhat wasteful approach — both in terms of time and memory — when the graph is sparse (i.e., when e << v²). Fundamentally, the reason it's so wasteful in this case is that we'll spend memory storing every possible edge, even if most are missing; algorithms, too, might need to look at every possible edge, which can waste a lot of time in a graph that's sparse.

But if we have a somewhat dense graph (i.e., one where many of the edges are present), or if we're going to very frequently look up the existence of an edge, then this approach might well be worthwhile.

Adjacency lists

Another way we can implement a graph is to store only the actual edges, as opposed to storing all of the possible edges. We call this the adjacency lists approach. When implementing a graph as adjacency lists, here's what we store:

A one-dimensional array (or an array-based list, such as a std::vector) in which each cell represents one vertex.
Associated with each cell (e.g., each cell could be an object and this could be a member variable of that object) would be a separate linked list — technically, you could also use an array-based list such as a std::vector, but it's important that it be something that grows and shrinks dynamically and won't be much larger than the amount of data you're really going to store in it. Each element of these sublists would contain information about one edge that is outgoing from the given vertex.

If we needed to store information about each vertex, it would belong in the outer array. If we needed to store information about each edge, it would belong in the sublists.

Analysis of adjacency lists

Using adjacency lists, we're making a tradeoff, relative to the use of an adjacency matrix. Rather than storing every possible edge, we're storing only the edges that are actually present in the graph. This makes some things better at the expense of others; our goal here is to determine which things those are.

First, we should consider the amount of memory we'll need.

The one-dimensional array of vertex information will have a total of v cells, for a total of Θ(v) memory.
Different linked lists of edge information will have different lengths, but we know that the total number of nodes in those lists would be:
- e in a directed graph (one for each edge)
- 2e in an undirected graph (two for each edge, because the edge {i, j} would need to represented in the outgoing edge list for both the vertex i and the vertex j)
Either way, this would require a total of Θ(e) space.

The total amount of memory required would be Θ(v) + Θ(e) = Θ(v + e). The reason we write this as the sum of the two variables is that they can vary independently. One doesn't necessarily dominate the other, because it depends on how dense or how sparse the graph is. So all we can say, realistically, is that the amount of memory we're using will be a linear amount with respect to v plus a linear amount with respect to e.

Now let's consider the time it takes to perform the two operations we considered for an adjacency matrix.

Checking for the existence of an edge i → j, given the index of two vertices, would require searching one of the linked lists. This is actually a bit tougher to analyze:

One thing would could say, definitively, is that it would take O(v) time, because there will be no more than v edges outgoing from any vertex. This is a potentially misleading result, so maybe we can say something better.
Another thing we could say is that it would take O(m), where m is the maximum number of edges outgoing from any vertex. This makes clearer the relationship between the amount of time required and how many edges are outgoing from each vertex. This also leads to the conclusion that for a somewhat sparse graph — one where there are relatively few edges from each vertex — this will actually be quite fast.

Enumerating all of the edges in the graph would require traversing all the linked list nodes plus traversing the array to get to the linked lists. This would require a total of Θ(v + e) time — Θ(v) to traverse the array plus Θ(e) to traverse all of the linked lists.

Where adjacency lists excel as an implementation technique is when the graph is relatively sparse (i.e., when e << v²). Some uses of graphs will certainly be this way:

In a social network, the number of people dwarfs the number of connections each person has.
On a street map, where the decision points (intersections, freeway on-ramps, etc.) are the vertices and the stretches of road connecting those decision points are edges, there aren't that many places you can go directly from any one decision point.

How you choose between these implementation techniques is to understand the nature of the real problem you've solving, and to understand how sparse or how dense your graph is likely to be.