Graph Algorithms I: (Topological Sorting, Connected Components, Minimum Spanning Tree) Before looking at these three algorithms, let's examine the HashGraph class that you will write as part of Programming Assignment #5. First, note that it uses a LocalInfo class which keeps sets of incoming and outgoing nodes and edges for each node in the graph. This is redundant information (it takes up more space than necessary, and we could reconstruct it from less information) but it allows standard graph operations to all run in O(1). Second, note that the ArrayGraph classes keeps a map from each node to its LocalInfo as well as a map from each node-pair to its edge value (of type T) -if the nodes are connected by an edge. template class ArrayGraph { //Forward declaration: used in templated typedefs below private: class LocalInfo; public: //Typedefs typedef std::string NodeName; typedef ics::pair Edge; typedef ics::pair NodeLocalEntry; typedef ics::ArrayMap NodeMap; typedef ics::ArrayMap EdgeMap; typedef ics::pair NodeMapEntry; typedef ics::pair EdgeMapEntry; typedef ics::ArraySet NodeSet; typedef ics::ArraySet EdgeSet; static bool LocalInfo_gt(const NodeLocalEntry& a, const NodeLocalEntry& b) {return a.first < b.first;} //Destructor/Constructors ~ArrayGraph(); ArrayGraph(); ArrayGraph(const ArrayGraph& g); //Queries bool empty () const; int node_count () const; int edge_count () const; bool has_node (std::string node_name) const; bool has_edge (NodeName origin, NodeName destination) const; T edge_value(NodeName origin, NodeName destination) const; int in_degree (NodeName node_name) const; int out_degree(NodeName node_name) const; int degree (NodeName node_name) const; const NodeMap& all_nodes() const; const EdgeMap& all_edges() const; const NodeSet& out_nodes(NodeName node_name) const; const NodeSet& in_nodes (NodeName node_name) const; const EdgeSet& out_edges(NodeName node_name) const; const EdgeSet& in_edges (NodeName node_name) const; //Commands void add_node (NodeName node_name); void add_edge (NodeName origin, NodeName destination, T value); void remove_node(NodeName node_name); void remove_edge(NodeName origin, NodeName destination); void clear (); void load (std::ifstream& in_file, std::string separator = ";"); void store (std::ofstream& out_file, std::string separator = ";"); //Operators ArrayGraph& operator = (const ArrayGraph& rhs); bool operator == (const ArrayGraph& rhs) const; bool operator != (const ArrayGraph& rhs) const; template friend std::ostream& operator<<(std::ostream& outs, const ArrayGraph& g); private: //All methods and operators relating to LocalInfo are defined below, followed by // a friend function for << onto output streams for LocalInfo class LocalInfo { public: LocalInfo() {} LocalInfo(ArrayGraph* g) : from_graph(g) {} void connect(ArrayGraph* g) {from_graph = g;} bool operator == (const LocalInfo& rhs) const { //No need to check in_nodes and _out_nodes: redundant information there return this->in_edges == rhs.in_edges && this->out_edges == rhs.out_edges; } bool operator != (const LocalInfo& rhs) const { return !(*this == rhs); } //LocalInfor instance variables //The LocalInfo class is private to code #including this file, but public // instance variables allows ArrayGraph them directly //from_graph should point to the ArrayGraph of the LocalInfo it is in, so // LocalInfo methods can access its edge_value map (see <<) ArrayGraph* from_graph = nullptr; NodeSet out_nodes; NodeSet in_nodes; EdgeSet out_edges; EdgeSet in_edges; }; friend std::ostream& operator<<(std::ostream& outs, const LocalInfo& li) { outs << "LocalInfo[" << std::endl << " out_nodes = " << li.out_nodes << std::endl; outs << " out_edges = set["; int printed = 0; for (Edge e : li.out_edges) outs << (printed++ == 0 ? "" : ",") << "->" << e.second << "(" << li.from_graph->edge_values[e] << ")"; outs << "]" << std::endl; outs << " in_nodes = " << li.in_nodes << std::endl; outs << " in_edges = set["; printed = 0; for (Edge e : li.in_edges) outs << (printed++ == 0 ? "" : ",") << e.first << "(" << li.from_graph->edge_values[e] << ")" << "->" ; outs << "]]"; return outs; } //ArrayGraph instance variables NodeMap node_values; EdgeMap edge_values; }; In the following algorithms, we assume a HashGraph for computing complexity classes (and HashSets, HashMaps, etc). We also assume a graph has N nodes and M edges: recall M can be anything from 0 to N^2. ------------------------------------------------------------------------------ Topological Sorting First, we will discuss Topological Sorting. Here the domain is a digraph where an edge from A to B means node A "comes before" node B (or, if you will, node A is "less than" node B). Note that the edge can have a value too, but only the existence of an edge is relevant to this algorithm. Unlike normal types where the law of trichotomy holds (for any two values, one is <, =, or > than the other) two values might be "uncomparable" meaning their order is not directly specified: there is no edge from A to B or from B to A. Of course, if there is a path from A to B that means A is < B (same from B to A). The edges in the graph supply what mathematicians call a "partial order" of the nodes. A topological sort of a graph is a sequence of nodes such that all the constraints (edges saying which nodes come before which others) are satisified in the resulting sequence. In the node sequence a, b, c, ... then either there is an path a->b (a < b) or no path between a and b; there is an path b->c (b < c) or no path between b and c; ... The original "make" program for compiling systems built from many C modules had a list of entries of the form c1.c -> c2.c, meaning the C program file c1.c must be compiled before the C program file c2.c. the "make" program would create a graph whose nodes were program files, with edges indicating the order of compilation for any program files for which order was important. The "make" program would first topologically sort this graph to determine in what order to compile all the program files, and then compile each file according to that order. A simple algorithm for doing topological sorting follows. It produces a queue of nodes satisfying all the edge constraints (e.g., ArrayQueue). Note that this algorithm mutates the graph while running: if the graph must remain unchanged, we can copy the graph first. while there is a node with in-degree 0 enqueue it to (put it next in) the answer queue remove it from the graph (affecting the degrees of other nodes the refer to it or that it refers to) This process continues until (a) all nodes are removed from graph or (b) some nodes still remain in the graph, but no node has in-degree = 0. In the latter case, the graph is cyclic: there are nodes a, b, c, ... ,x, y, z such that edges require that a must come before b, b must come before c, ... x must come before y, y must come before z, and z must come before a (which is an impossible ordering). If no node has in-degree 0, then every node is the destination of another (only possible if there is a cycle). Note that the sequence of nodes may NOT be uniquely determined: if at any time there is more than 1 node with in-degree 0, we can arbirarily choose any of those nodes to come next in the queue: there are no constraints on which order they appear. Recall that in graphs, we usually let N denote the number of nodes in the graph and M denote the number of edges. In sparse graphs M is O(N) and in dense graphs M is O(N^2). We can characterize this algorithm as O(N^2 + M): finding an in-degree 0 node is O(N) (iterate through all the nodes), and in a non-cyclic graph this will happen N times before the graph is empty. To remove all the nodes requires removing all M edges. For dense graphs the complexity is O(N^2 + N^2) = O(N^2). For sparse graphs the complexity is O(N^2 + N) = O(N^2). We will apply this algorithm to generate one (of a few different) topological orders (we have lattitude to pick an in-degree 0 node; different picks lead to different final orders, but all satisifying the required constraints) for the graph below (draw a picture of this graph before starting). A -> D A -> E B -> D C -> E C -> H D -> F D -> G D -> H E -> G One order is: A, B, C, D, E, F, G, H Another is : B, A, C, D, F, H, E, G A slightly more complicated but faster algorithm (by Kahn) for a graph G first computes a set of all source nodes and updates this set based on removing edges (at which time new nodes can become source nodes). O = a queue that is empty S = a set of all source nodes (with in-degree 0) while S is not empty n = any node from S (use an iterator to get just the "first") enqueue n to O for each edge e (from n to m) remove edge e from the graph if in-degree of m is now reduced to 0 by the removal add m into S if the graph has edges after terminating the loop, then the graph is cyclic else O is a topologically sorted version of all nodes in G Computing O and S initially are O(1) and O(N) respectively. For an acyclic graph, the outer loop executes once for each node, so is executed N times; the inner loop executes once for each edge, so is executed a total of M times (a bit at a time, some for each outer loop). The other operations are all O(1) so the running time is O(N+M). Recall that for dense graphs M is O(N^2), so in the worst case (a dense graph) this algorithm is O(N+N^2) = O(N^2). But if the graph is sparse (say each node as at most 10 edges), then it is O(N + 10N) = O(N). ------------------------------------------------------------------------------ Connected Components If an undirected graph is connected, it means that we can reach any node by following a path from any other node (edges go both ways). If a graph is not connected, there are node pairs that are not reachable from each other. Non-connected graphs contain connected components: subgraphs that are each connected. If a graph is connected then there is just one component, which includes every node; if it is not connected there will be some number (>1) of subgraphs, each of which is connected (all the subgraph's nodes can reach each other) and in the extreme case -a graph with no edges- every node is in its own connected component. "Connected to" is an equivalence relation in an undirected graph: 1) Any node is connected to itself 2) If a is connected to b, then b is connected to a (in an undirected graph) 3) If a is connected to b, and b is connected to c, then a is connected to c So, we can compute the connected components of a graph using an equivalence relation. Each equivalence class represents one connected component: all the nodes in an equivalence class can reach each other. The algorithm is simply: 1) Put every node in the graph into its own equivalence class (as a singleton) 2) For every edge in the graph, merge the equivalence classes of its origin and destination nodes. The complexity class is O(N+M) since (1) we put each of the N nodes in an equivalence class N*O(1), and (2) each the loop process every edge exactly once and each equivalence relation operation is approximately O(1) for HashEquivalence using the compression heuristic. So, in a dense graph M is O(N^2), so this algorithm is also O(N^2); for a sparse graph M is O(N) so this algorithm is O(N). Note that this process terminates when all the edges are processed; it could also terminate when the number of classes is 1: at this point all nodes are in the same component, so processing more edges is not necessary. One application of computing connected components is the "percolation problem": It asks, given a matrix of values (some open = blank, some closed = X), is there a path from any of the open top line cells to the open bottom line cells. A path must move through only open cells, and each open cell must be connected to the next by moving up, down, left, or right (not diagonally). An example matrix is: +---+---+---+---+---+---+---+---+---+---+ | X | | X | X | | X | X | | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | | | | X | | | X | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | | X | | | X | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | | | | | | | X | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | X | X | X | X | | | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | | | | X | | | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | | X | | | | | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | | X | X | X | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | | | X | X | | | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | | X | X | X | | | X | +---+---+---+---+---+---+---+---+---+---+ Here is a percolation path through this matrix (numbered 1-24). Note the numbers generally go downward (bigger numbers BELOW smaller numbers), but some numbers on the path go upward (bigger numbers ABOVE smaller numbers). +---+---+---+---+---+---+---+---+---+---+ | X | 1 | X | X | | X | X | | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | 2 | 3 | 4 | X | | | X | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | 5 | X | | | X | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | 6 | 7 | 8 | 9 | 10| 11| X | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | X | X | X | X | 12| | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | 19| 18| 17| X | 13| | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | 20| X | 16| 15| 14| | +---+---+---+---+---+---+---+---+---+---+ | X | | | X | 21| X | X | X | X | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | 23| 22| X | X | | | X | +---+---+---+---+---+---+---+---+---+---+ | X | X | X | 24| X | X | X | | | X | +---+---+---+---+---+---+---+---+---+---+ To solve this problem, assume that every matrix cell is a node in graph. Put an edge in the graph between every two adjacent open nodes (the ones that are next to each other: either related as top/bottom or left/right). Note that this is a sparse graph because each node has at most 4 edges. Also, put a node outside the matrix on the top and on the bottom: the top one has edges to each open cell in the first line; the bottom one has edges to each open cells in the bottom line. Now compute the connected components of this graph. There is a percolation path if the extra node on top is in the same component as the extra node on the bottom. Note that running this algorithms says whether or not a percolation exists. It doesn't compute the percolation path. To do that you could do a graph search starting at the extra node at the top. To find the minimum path would be like a breadth first search (finding all nodes reachable in 1 step, 2 steps, etc. keeping track of the fastest way to reach each node). ------------------------------------------------------------------------------ Minimum Spanning Tree (MST): A spanning tree of a undirected graph is a subgraph, such that every node is reachable from every other node, WITH NO REDUNDANT EDGES. That means in a graph with N nodes, will have exactly N-1 edges in its spanning tree. The graph must be connected to have a spanning tree. We can prove we need only N-1 edges by induction. In all 1 node graphs, we need only 0 edges to have the graph be connected (satisfying the N-1 requirement); if we take any connected N node graph and add another node, all we must do is add one more edge (from any of the N nodes to the new one) to have the graph be connected again. Note that spanning trees are acyclic graphs, because any cyclic structure has a redundant edge: we can remove any edge in the cycle and the graph is still a spanning tree. We call this graph a spanning "tree" because we can look at each graph as a tree: chose any node as the root; all the nodes immediately reachable from it become the root's children; the nodes immediately reachable from these (those not already reached) become the grandchildren, etc. Choosing a different root will yield a different tree. The height of each tree is the maximum number of edges that must be followed from the root node to reach the "farthest away" node. A minimum spanning tree (MST) is a spanning tree whose cost (sum of its edges; assuming they are numbers) is the minimum possible. If we wanted to connect various cities with a phone network, and we knew the cost of laying the fiber-optic cable between any two cities (it might NOT just be proportional to the the linear distance between the cities: this cost is the value on an edge between two cities), finding the MST would solve the problem. Kruskal's algorithm is one efficient way to find a MST. To run this algorithm, we must be able to tell whether an edge connects two nodes that do not yet (in the MST we are building) have a path between them. We can easily "see" this property on small graphs; we have seen above that connectedness is an equivalence relation, which we implement efficiently with a HashEquivalence. We will use that data type in our algorithm, along with a priority queue. Kruskal's Algorithm for MST 1) Put each node in own equivalence class (add_singleton in Equivalence). 2) Put all the edges in a priority queue, such that edges with smaller values have higher priorities. 3) Start with a MST graph containing all the nodes but no edges. It will eventually store these same nodes and only edges that are in the minimum spanning tree. 4) While T's # edges < N-1 (see above: N node graphs have N-1 edge MSTs) a) Dequeue the next edge from the priority queue (the smallest remaining edge value). If there are no more edges, the graph isn't connected and there is no [minimal] spanning tree b) If its origin and destination nodes are in the SAME equivalence class, do nothing If its origin and destination nodes are in two DIFFERENT equivalence classes, add the edge to MST and merge the equivalence classes of the origin and destination nodes. The complexity of this algorithm is O(N + M Log2 M). The N term comes from step 1, the M Log2 M term comes from step 4 - possibly dequeuing all the edges in the priority queue (in the case of a graph with the longest edge going to a node that is connected to no other node). In a sparse graph, with O(N) edges, this algorithm is just O(N Log2 N), therefore it is "fast" (like fast sorting). In a dense graph, with O(N^2), the algorithm is O(N^2 Log2 N^2), which is actually O(N^2 Log2 N) because Log2 N^2 = 2 * Log2 N (and we remove the constant factor). Kruskal's algorithm is known as a "Greedy algorithm". In a greedy algorithm a locally correct choice (here the next smallest edge to process) is guaranteed to be in the solution. So, once we choose this edge to be part of a MST, we will never have to "unchoose" it. In Standard backtracking search algorithms sometimes we make choices that we have to "unmake", choosing something else instead (so such an algorithm is not greedy). If a problem is solvable by a greedy algorithm, the choices the algorithm makes never need to be un-made (and typically its complexity class will be lower than for a backtracking search, whose complexity class is exponential). An example of a problem where a greedy algorithm fails is making change using US currency (where there are a limited number of coins). Normally you'd first find the number of quarters needed, then find the number of dimes, then nickels, then pennies. But suppose you want to make 30 cents change but in the cash register has only 1 quarter, 3 dimes, and no nickels or pennies. A greedy algorithm would choose the quarter first, but be unable to choose any dimes, and fail to solve the problem because there are no nickels. A backtracking search algorithm would undo the choice of 1 quarter, then choose 3 dimes to solve the problem. Applying Kruskal's algorithm to a graph with unique edge costs will produce a unique MST. If some edge costs are duplicates, then there might be multiple MSTs with the same total cost. We applied this algorithm to generate an MST for the graph below. Here A<-(4)->B means the cost of the undirected edge between A and B is 4. A<-(4)->B A<-(1)->C A<-(4)->D B<-(5)->C B<-(7)->G B<-(9)->E B<-(9)->F C<-(3)->D D<-(10)->G D<-(18)->J E<-(2)->F E<-(4)->H E<-(6)->I F<-(8)->G F<-(2)->H G<-(9)->H G<-(8)->J H<-(3)->I H<-(9)->J I<-(9)->J If we put these values in the priority queue, one order (there are multiple edges with the same value). Also, initially each node is in its own separate cluster/equivalence class A<-(1)->C E<-(2)->F F<-(2)->H C<-(3)->D H<-(3)->I A<-(4)->B A<-(4)->D E<-(4)->H B<-(5)->C E<-(6)->I B<-(7)->G F<-(8)->G G<-(8)->J B<-(9)->E B<-(9)->F G<-(9)->H H<-(9)->J I<-(9)->J D<-(10)->G D<-(18)->J Note that this graph has 10 nodes so the MST will have 9 edges, so we can terminate the algorithm when we reach 9 edges. 1) We remove A<-(1)->C; A and C are in different clusters, so we put this edge in the MST (it now has 1 edge) and update the clusters so that now {A,C} are in the same cluster. 2) We remove E<-(2)->F; E and F are in different clusters, so we put this edge in the MST (it now has 2 edges) and update the clusters so that now {E,F} are in the same cluster. 3) We remove F<-(2)->H; F and H are in different clusters, so we put this edge in the MST (it now has 3 edges) and update the clusters so that now {E,F,H} are in the same cluster. 4) We remove C<-(3)->D; C and D are in different clusters, so we put this edge in the MST (it now has 4 edges) and update the clusters so that now {A,C,D} are in the same cluster. 5) We remove H<-(3)->I; H and I are in different clusters, so we put this edge in the MST (it now has 5 edges) and update the clusters so that now {E,F,H,I} are in the same cluster. 6) We remove A<-(4)->B; A and B are in different clusters, so we put this edge in the MST (it now has 6 edges) and update the clusters so that now {A,B,C,D} are in the same cluster. 7) We remove A<-(4)->D; A and D are already in the same cluster (see 6) so we don't include this edge in the MST. This is the first time that we are discarding an edge (not using it in the result graph). 8) We remove E<-(4)->H; E and H are already in the same cluster (see 5) so we don't include this edge in the MST. 9) We remove B<-(5)->C; B and C are already in the same cluster (see 6) so we don't include this edge in the MST. 9) We remove E<-(6)->I; E and I are already in the same cluster (see 5) so we don't include this edge in the MST. 10) We remove B<-(7)->G; B and G are in different clusters, so we put this edge in the MST (it now has 7 edges) and update the clusters so that now {A,B,C,D,G} are in the same cluster. 11) We remove F<-(8)->G; F and G are in different clusters, so we put this edge in the MST (it now has 8 edges) and update the clusters so that now {A,B,C,D,E,F,G,H,I} are in the same cluster. This edge connects two formerly unconnected components. 11) We remove G<-(8)->J; G and J are in different clusters, so we put this edge in the MST (it now has 9 edges) and update the clusters so that now {A,B,C,D,E,F,G,H,I,J} are in the same cluster. Note, we now have the needed 9 edges, and we see that all the nodes are in the same cluster (all reachable from each other). So, the following nodes are in the MST A<-(1)->C E<-(2)->F F<-(2)->H C<-(3)->D H<-(3)->I A<-(4)->B B<-(7)->G F<-(8)->G G<-(8)->J With a total cost (sum of the edges) of 1+2+2+3+3+4+7+8+8 = 38. ------------------------------------------------------------------------------