Graphs The mathematical theory of graphs was first developed by the famous Swiss mathematician Leonard Euler (pronounced like "Oiler") in 1735. It was motivated by a desire to solve the "Bridges of Konigsberg" problem. A brief introduction to this problem and a graphic for it appears first in the graph.pdf accompanying this lecture. We will start our discussion of Graphs by introducing this problem, learning to represent it as a graph, and then partially solving the problem. In Konigsberg, Germany, a river ran through the city such that in its center was an island, and after passing the island, the river broke into two parts. Seven bridges were built so that the people of the city could get from one part to another (see graphic). The people wondered whether or not one could walk around the city in a way that would involve crossing each bridge exactly once. It doesn't matter where the people started and stopped. Euler proved that no such tour (now called an Euler tour or Euler path) was possible in Konigsberg. A similar problem is known as "The Traveling Salesman" problem, in which the traveler must end up at the same place he/she started (and visit every node exactly once: that is called a hamiltonian path): often it also involves another criteria: minmizing the distance traveled. It is a much harder problem to solve. Using some of the terminology that we will learn below, the relevant theorems are: Theorem 1: If an undirected graph has more than two nodes with an odd degree, then it does not have an Euler path. Theorem 2: If an undirected graph has two nodes or fewer with an odd degree, then it has at least one Euler path. It is interesting that local properties (the degrees of the nodes) determine whether or not some global property (Euler path) is possible. Although this information tells whether or not a Euler tour exist, it does not tell us how to actually construct a path/tour: an algorithm to find a tour has complexity O(E) where E is the number of edges (one way to measure the size of a graph). Mathematically, Graph theory is a sub-area in the Topology (which generalizes Geometry): topology is concerned less with distance and angles and more with connectedness; for example, the triangle inequality for distance (d is the distance function) - d(a,b) + d(b,c) >= d(a,c) - is not necessarily true in graphs. 10 a----->c \ ^ d(a,b) + d(b,c) = 5 2 \ /3 d(a,c) = 10 v / b Fundamentally Graphs consists of nodes (aka vertices) and edges (aka arcs). Nodes are typically labelled by some string that identifies them; edges are often labelled by the value of the edge. For example, we can construct a graph of airline fares, where each node is an aiport and each edge is the cost from flying from the origin airport to the destination airport. Such a structure would be useful to determine the (minimum) cost of a trip between any 2 airports, whether or not they are directly connected (get from A to B by taking flights whose total cost is minimized: note that cost to travel from airport A to B, and then B to C, might be less than the cost to traver directly from A to C, which shows the triangle inequality -see above- doesn't necessarily hold for graphs ....or airfares). We can omit the edge values if we are concerned only whether or not we can fly from the origin airport to the destination airport: represented by whether or not there is an edge between them). Likewise we can use an edge value that is a pair where each element in the pair gives a departure and arrival time (so that we can actually schedule trips through intermediate airports by choosing particular flights whose times allow connections to be made). So, although edge values that are single numbers are common and important, edge values can be much more general. When we show implementations of the Graph data type, its class will be templated by the type of value stored for each edge. Many problems can be encoded into graphs that represent them, and the solved by using well-known/efficient graph algorithms. There are large books written entirely on the subject of graph theory and graph algorithms, and UCI has an advanced ICS course (CS 163) focussing on graphs and graph algorithms. ------------------------------------------------------------------------------ Technical Terms Let's pause here to define some more important graph terms, beyond the nodes and edges discussed above. In a "directed graph" (aka digraph, the kind we will mostly study), edges have a distinguishable "origin" and "destination" node; an edge is written as an arrow from its origin to its destination. A digraph might contain just one edge between two nodes, or it might contain two: one from the first to the second, and one back from the second to the first (with each edge associated with its own value). For example, in a graph whose nodes are street corners and whose edge values are the times to travel between them (a digraph: something a GPS must represent and use to solve minimum-time route problems) a one-way street will have one edge between the nodes; a two-way street will have two edges (one going each way: travel times might be different, if not always, at different times of the day because of commuter traffic). Instead of allowing multiple edges from one node to another, we can specify a data structure for the edge that allows multiple values. The in-degree of a node is a count of the number of edges having this node as their destination; likewise, the out-degree of a node is a count of the number of edges having this node as their origin. The degree of a node is the sum of its in-degree and out-degree. A node is considered a "source" in a graph if it has in-degree of 0 (no nodes have a source as their destination); likewise, a node is considered a "sink" in a graph if it has out-degree of 0 (no nodes have a sink as their origin). In an "undirected graph", there can be only one edge/value between any pair of nodes: each node serves as both the origin and destination of that edge. For an undirected graph, the in-degree, out-degree, and degree are all the same. We can use a digraph to represent an undirected graph by using two edges (each with the same value) to connect any two nodes. A directed graph is "weakly-symmetric" if when there is an edge from node1 to node2, then there also is an edge from node2 to node1; likewise, a directed graph is "strongly-symmetric" if when there is an edge from node1 to node2, then there also is an edge from node2 to node1 with the associated values for these edges equal. A "subgraph" of a graph contains a subset of its nodes and edges. The "natural subgraph" of a graph (containing a certain subset of nodes) includes all the edges in the graph that have a node in this subset as both an origin and destination node. The "natural subgraph" of a graph (containing a certain subset of edges) includes all the nodes in the graph that are endpoint of any edge is the subset. We have used graphs, informally, in programming assignment #1. There, we represented a digraph by a map whose key is the name of an origin node and whose value is the set of names of all the destination nodes reached by its edges. In this representation, we omitted the value for the edges and it was not easy/efficient to find the nodes leading into a node: finding the origin nodes of a destination node. Both of these deficiencies are removed in the actual graph classes we will implement. A "path" in a graph is a sequence of nodes n1, n2, ..., nx, such that there is an edge from n1 to n2, from n2 to n3, etc. to nx. Equivalently we can represent a path as a sequence of edge e1, e2, ... en such that the destination node of e(sub i) is the original node of e(sub i+1). The "transitive closure" of a graph is a graph with the same nodes, such that if there is ANY path from node1 to node2 in the original graph, there is an edge directly connecting node1 to node2 in the transitive closure graph; the value on this edge is often related to the values on the path: one useful way to do this is to assign the value of this edge to be the minimum sum of the edge values, representing the minimim-sum path between the nodes. A graph is called "cyclic" if it has at least one path in the graph that contains the same node twice. Such a path is called a "cycle". Likewise, if a graph contains no cycles, the graph is called "acyclic" (aka "noncyclic"). A graph is "connected" if there is a path between every two nodes. Typically we discuss connectedness in terms of an undirected graph. If a graph is not connected, it can be decomposed into its "connected components": each component is the largest subgraph that is connected. Connectedness is an Equivalence relation: (1) every node is connected to itself; (2) if node a is connected to node b then node b is connected to node a; (3) if node a is connected to node b and node b is connected to node c, then node a is connected to node c. This means that if two components include an edge between any of their nodes, then they can be merged into a larger component. When we discuss how to compute connected components, the Equivalence data type that we have discussed (and you will implement in a quiz) will play a major part in the algorithm. A "spanning tree" is an acyclic/connected undirected graph that represents an N-ary tree; we can choose any node as the root. Typically, there are many spanning trees for a graph. A "minimum spanning tree" is the spanning tree that minimizes the sum of the values associated with all the edges contained in the spanning tree. We can represent a project to lay fiber optic cables between N cities by a graph: each city is a node and the edges between cities are the cost of laying the fiber optic cable. The minimum spanning tree is the minimum cost to lay enough fiber optic cable so that there is a path between any two cities. We will discuss an algorithm to solve this problem easily, relying on efficient implementations of the PriorityQueue and Equivalence relation data types. ------------------------------------------------------------------------------ An Example Graph The second example in the graph.pdf accompanying this lecture represent some airports and the edges represent flights from one airport to another. The edge values represent the mileage for each flight (or, they could represent the cost of an airplane ticket for that flight, the amount of time each flight takes, etc). This graph is strongly symmetic; rather than showing two edges connecting each pair of nodes, we show only one (double-arrowed) edge. While mileage has this property, other edge values (cost, travel time, etc.) might not. This graph is taken from the excellent book: Goodrich (a faculty member at UCI) and Tamassia, Data Structures and Algorithms in Java, John Wiley & Sons, 2010 (it has since been updated). Let's state some facts about this tree using some of the terminology defined above. There is a node named SFO representing San Francisco. There is an edge from the node named SFO (origin) to the node named BOS (destination) -and vice versa- that has the value 2704. The graph is stongly symmetric (so, really it is an undirected graph). The graph is cyclic; in fact, not only does it have many cycles, it is connected: there is a path from every node to every other node. It has a natural subgraph (for ORD, PVD, JFK) that is is also connected; it has a natural subgraph (SFO, MIA, PVD) that is not connected: in fact, such a natural subgraph contains no edges. A similar but much more extensive graph is used as the underlying data structure in Mapquest or GoogleMaps, web sites that plan minimal travel routes, including computing the expected amount of travel time. Note that real graphs might model one-way streets (so there may be an edge -a street that one can travel- from corner1 to corner2 but not vice versa). Also, some roads may be partitioned into more lanes going one way than the other, so although there are edges going each way, their values might be different. These program can take into account what time you are traveling (in some places, traffic patterns vary tremendously from the norm during rush hours); in fact, if billions of sensors are placed on roads throughout the US, they could report traffic slowdowns to these programs, which could contact you in your car, and automatically reroute you to avoid such delays. Or, if cars (or the cell phones of occupants) report their position and speed to a website, we would not need road sensors. Graphs can also easily model the servers (nodes) and transmission lines (edges, with their transmission speeds/capacities -bandwidth- indicated by their values) of the internet. We can ask questions like what is the minimum time it would take to transmit a large number of web pages from one server to another using all the paths available, not exceeding the bandwidth of any transmission line. This problem, a bit beyond the scope of this course, was originally solved by the Ford-Fulkerson algorithm, and improved by the Edmonds-Karp algorithm, whose complexity class is O(n e^2), where n is the number of nodes and e is the number of edges respectively in the graph. ------------------------------------------------------------------------------ Graph Metrics We will discuss that the minimum and maximum number of edges in a graph with N nodes (the minimum is 0 edges - no nodes connected; the maximum is N^2 edges - every node has an edge to every node, including itself), and use the terms "sparse" and "dense" to discuss graphs whose nodes have O(N) and O(N^2) edges respectively. Also, we can ask how many structurally different graphs there are with N nodes (we asked this same questions for linked lists and trees): for directed graphs that allow an edge from a node to itself, there are (2^N)^N different graphs, or 2^(N^2): Each node in an N-node directed graph has 2^N different possible patterns of out-edges (yes/no to each of the other N-1 nodes and itself) and there are N nodes each having its own pattern. Think of the pattern for node 0 as representing one subset of the values 1 through N (a number is in the subset if node 0 has an out edge to that node): there are 2^N different subsets of N numbers. For example, a 4 node graph (say nodes A, B, C, and D) a given node A, can have 1 way of no out-edges, 4 ways of 1 out-edges (to A, B, C, or D), 6 ways of 2 out-edges (to A and B, to A and C, to A and D, to B and C, to B and D, or to C and D), 4 ways of 3 out edges (to A, B, and C, to A, B, and D, to A, C, and D, and to B, C, and D), and 1 way of 4 out edges (to A, B, C, and D) for a total of 16 (= 2^4). Each of the 4 nodes can have the same 16 possible patterns of out-edges, so there are 16^4 (63,536) different graphs, which is also 2^16 (= (2^4)^4. So for 10 nodes there are 2^100 different graphs or (2^10)^10 or about 10^30 different graphs. For 1000 nodes there are 2^1,000,000 different graphs or (2^10)^100,000, or about 10^300,000 different graphs (recall there are about 10^68 to 10^72 atoms of matter in the known universe; so, 10^300,000 is unimagineably larger). So, the number of graphs grow much faster than the number of lists (all N value lists are the same) and the number of trees (4^N/sqrt(pi*N^3)) whose exponent is just N not N^2. ------------------------------------------------------------------------------ Storing/Manipulating Graphs The most fundamental question we can ask about a graph is (a) whether there is an edge from node A to node B (and if there is, what is its value). Another important question is, (b) given node A, what are all the edges whose source is A (or edges whose destination is A). There are a few standard way to store information about a graph so that we can answer these questions efficiently. 1) A MATRIX, with N rows and N colums (one for each node in the graph) whose values (Ath row and Bth column) stores nothing (there is no edge) or the value on the edge from node A to B in this graph. In a directed graph we would store all N^2 values, and the value in row A and column B might be different than the value in row B and column A. In an undirected graph, we could just store the "upper triangular part" since the value at row A and column B is the same as the value in row B and column A: so lookup the value in row min(A,B) and column max(A,B). To answer question (a) is O(1) and to answer question (b) is O(N) - scan one entire row or column in the matrix. Note that a matrix requires O(N^2) storage, even if the graph is sparse and contains only O(N) edges. 2) An array with N rows (one for each node in the graph) with each index i storing a linked list of edges values/destination nodes whose origin node is numbered i. This is called an ADJACENCY LIST: each node stores a reference to a list of nodes reachable from it. To answer question (a) we go to the index for node A and traverse all the values in the linked list looking for B. So to answer question (a) is O( out-degree(A) ) and to answer question (b) is O(1) since the reference in a row stores a list to exactly those nodes that are destinations of node A. In a sparse graph out-degree(i) is O(1) and in a dense graph out-degree(i) is O(N). 3) A HashMap with M keys (M is the number of edges in the graph; each key is a pair of nodes) and each key is associated with the value of the edge between those nodes; and, a second HashMap with N keys (one for each node in the graph) and each key is associated with a set of edges having that node as their origin. To answer question (a) we just do a map lookup of the edge, which is O(1). To answer question (b) we do a map lookup of the node, which again is O(1). We will add a Graph class to our standard Data Types that has many more interesting commands and queries. To implement all operations efficiently, we will store avariety of sets in the map from nodes to "information" connected to a node: in/out edges and in/out nodes) to allow quick execution of these useful methods. This will be done in a HashGraph, which is Programming Assignment #5. Here is a quick overview of what we will be able to do with Graphs: add a node, add an edge, remove a node, remove an edge get a count of the nodes and a count of the edges check whether a graph has a node, has an edge, and get the value of an edge find the in-degree, out-degree, and the degree of a node iterate over all nodes and all edges in the graph iterate over all out-nodes, in-nodes, out-edges, and in-edges of any given node print a graph into a file and load a graph from a file