 # ICS 280, Spring 1999: Computational Statistics

## Hierarchical Clustering

Hierarchical clustering refers to the formation of a recursive clustering of the data points: a partition into two clusters, each of which is itself hierarchically clustered.

One way to draw this is some kind of system of nested subsets, maximal in the sense that one can't identify any additional subsets without violating the nesting:

```
__________________________
|   _________              |
|  |  ______  |    _____   |
|  | |      | |   |     |  |
|  | | x  y | |   |  u  |  |
|  | |______| |   |     |  |
|  |          |   |  v  |  |
|  |    z     |   |_____|  |
|  |__________|            |
|__________________________|
```

Alternatively, one can draw a "dendrogram", that is, a binary tree with a distinguished root, that has all the data items at its leaves:

```
/\
/  \
/    \                 ________
/      \               |        |
/\       \     or     __|__      |
/  \       \          |     |    _|_
/\   \      /\        _|_    |   |   |
/  \   \    /  \      |   |   |   |   |
x    y   z  u    v     x   y   z   u   v
```

Conventionally, all the leaves are shown at the same level of the drawing. The ordering of the leaves is arbitrary, as is their horizontal position. The heights of the internal nodes may be arbitrary, or may be related to the metric information used to form the clustering.

### Data Models

The data for a clustering problem may consist of points in a Euclidean vector space, or more structured objects such as DNA sequences (in which case the hierarchical clustering problem is essentially equivalent to reconstructing evolutionary trees and is also known as "phylogeny"). However many clustering algorithms assume simply that the input is given as a distance matrix. The distances may or may not define a metric; one popular data model is that the data form an "ultrametric" or "Archimedean metric", a special type of metric in which the distances satisfy the "ultrametric triangle inequality"
dist(a,c) < max(dist(a,b), dist(b,c))
This inequality is satisfied, for instance, if the data points are leaves of a dendrogram drawing, with the distance defined to be the height of their least common ancestor; in fact this is just an equivalent way of defining an ultrametric. The inequality is also satisfied for vertices in any graph, with the length of a path defined to be its maximum weight edge.

The ultrametric requirement may be too strong -- e.g. in evolutionary trees, if one measures distance by mutation rate, it would imply the biologically unrealistic condition that all species evolve at the same rate. A weaker condition is that the distances are formed by path lengths in a tree, without the ultrametric requirement that each root-leaf path have equal length. A distance function meeting this weaker requirement is also known as an additive metric.

If the data model is that the data points form an ultrametric, and that the input to the clustering algorithm is a distance matrix, a typical noise model would be that the values in this matrix are independently perturbed by some random distribution. Additional data observations may consist of new points in this ultrametric (e.g. in the large-population genetic studies used to test the hypothesis that the human species evolved in Africa before migrating to other continents [cite?]), or may be used to reduce the perturbation in existing distance measurements (e.g. by comparing larger amounts of DNA sequence information).

Another common data and noise model for DNA sequence input is the Cavender-Farris model [C78], in which the data parameters consist of an ultrametric (dendrogram with edge lengths) together with a prototype sequence stored at the root of the dendrogram. The observed data in this model represent the contents of a single position of the DNA sequence for each species, and are formed by starting with a symbol at the root of the dendrogram and then propagating that value downwards, with mutation rates proportional to the dendrogram edge lengths.

### Dynamic Programming Methods

Dynamic programming may be used to find the "most parsimonious" tree for a given set of sequences; that is, a dendrogram, with each internal vertex labeled by a sequence constructed by the algorithm (the leaves are labeled by the input sequences), such that the total amount of mutation between sequences adjacent to each other on the tree is minimized. However, the computational complexity of this is prohibitive: typically, it is something like O(nk) where n is the sequence length and k is the number of sequences. This method may not always converge to the correct tree (e.g. in the Cavender-Ferris model), but as Rice and Warnow show [RW97], it performs very well in practice

Another dynamic program arises from the problem of fitting an optimal dendrogram to a distance matrix. If one has a particular dendrogram in mind (as an abstract tree), the problem of assigning heights to its vertices to minimize the maximum difference between the dendrogram distance and the given distance matrix is simply a linear program, where each height is linearly constrained to be above its children and within D of each distance represented by it in the matrix. Since this linear program has only two variables per inequality, it can be solved efficiently [cite?]. However, the difficult part of this procedure is in choosing which dendrogram to use. The number of possible different dendrograms with k leaves is (2k-1)!!=1*3*5*7*...(2k-1) since there are 2k-1 ways of adding a kth leaf to a (k-1)-leaf dendrogram, so unless k is very small one can't hope to try them all. One could possibly find the optimal solution for somewhat larger (but still not very large) k by a dynamic programming method in which one finds the optimal dendrogram for each subset of the data points, by trying all ways of partitioning that subset into two smaller subsets.

### Local Improvement Methods

If one can't compute a global optimum, one can at least achieve a local optimum: start with any tree, and then repeatedly use dynamic programming to re-optimize small subtrees. This idea can be used to get a (1+epsilon)-factor approximation to parsimony [cite?].

### Bottom Up Methods

There is a large family of clustering algorithms that all work as follows: Choose some measure of "affinity" for clusters. Start with n clusters, each consisting of a single point. Then repeatedly merge the two clusters with the highest affinity into a single supercluster. The differences between these algorithms involve the definition of affinity, and the implementation details. Note that even for points in a metric space, the "affinity" need not satisfy the triangle inequality or other nice properties.

If the affinity of two clusters is defined to be the distance between the closest pair of points, the problem is known as "single-linkage" clustering. Essentially, it is solved by the minimum spanning tree (the top level clustering is formed by removing the heaviest edge from the MST, and the remaining levels are formed in the same manner recursively). Since minimum spanning trees can often be found efficiently, so can this clustering, and it does produce the correct results when the input is an ultrametric distance matrix (without errors). However, for points in Euclidean spaces, it tends to produce unsatisfactory long stringy clusters.

Another common affinity function is the average distance between points in a cluster (complete linkage clustering). For distance matrix input, this can be found by replacing two rows of the matrix by an appropriate weighted average. Alternatively, if one has a good notion of single point estimation for a cluster (e.g. the centroid), one can define affinity in terms of the distance between cluster centers.

"Neighbor-joining" [SN87], is a more complicated affinity: It assumes that one has a tree in which each point is connected to a common center, and measures the amount by which the total edge length of the tree would be reduced by placing a "Steiner point" between two points, and connecting that Steiner point to the center. Formally, the affinity between x and y is

distance(x,center) + distance(y,center) - distance(x,SP) - distance(y,SP) - distance(SP,center)
where SP is the Steiner point. For distance matrix input, the center is found by averaging the rows of the whole matrix, so the distance from x to the center is just the average distance of x from any point. Similarly, the Steiner point can be found by averaging the rows for x and y (in which case the two terms distance(x,SP)+distance(y,SP) add to distance(x,y)). Atteson [A96] showed that this method converges to the correct tree for distance matrices that are sufficiently close (in terms of Linfinity distance) to a tree metric.

Many of these methods have efficient algorithms, both for distance matrix input [E98], and for points in low dimensional Euclidean spaces [KL95]. However, Neighbor-Joining seems more difficult, with the best known time bound being O(n3) (and some commonly available implementations taking even more than that). However even the faster of these methods are too slow for data consisting of millions of points in moderate to high dimensional Euclidean spaces.

### Top Down Methods

One can also form hierarchical clusterings top down, following the definition above: use your favorite nonhierarchical clustering algorithm to find a partition of the input into two clusters, and then continue recursively on those two clusters.

The advantage of these methods is speed: they can scale to problem sizes that are beyond the bottom up methods. However, the quality of the results may be poorer, because the most important decisions are being made early in the algorithm before it has accumulated enough information to make them well. Also, because of the emphasis on speed, the nonhierarchical clustering methods used tend to be primitive (e.g. split the points by an axis-parallel hyperplane).

### Incremental Methods

Incremental hierarchical clustering methods can be even faster than the top down approach. These methods build the hierarchy one point at a time, without changing the existing hierarchy. To add a new point, simply trace down from the root, at each step choosing the child cluster that best contains the given point. Once you reach a cluster containing a single point, split it into two smaller clusters, one for that point and one for the new point.

This is extremely fast, and good for applications such as nearest neighbor classification, but not very good for accurately reconstructing the clusters present in the original data model.

### Numerical Taxonomy

Cavalli-Sforza and Edwards [CE67] introduced the problem of finding taxonomy by finding the "nearest" tree metric or ultrametric to a given distance matrix. Of course one has to define "nearest"; the natural choices seem to be L1, L2, or Linfinity metrics on the distance matrix coordinates.

The L1 and L2 problems are NP-complete for both the tree metric and ultrametric variants [D87]. However, the Linfinity-nearest ultrametric can be found in time linear in the distance matrix size [FKW95].

Agarwala et al. [ABFNPT98] show that it is NP-complete to get within a factor of 9/8 of the Linfinity-nearest tree metric; however they describe a fast approximation algorithm that achieves a factor of three.

There has also been some work within the theoretical computer science community (motivated by approximation algorithms for various network design problems) on approximating a distance matrix by a tree metric with minimum "dilation" or "stretch" (maximum ratio of original distance to tree distance). The main result in this area is that one can define a random distribution on trees such that the expected dilation for a given pair of vertices is O(log n log log n) [CCGGP98] This doesn't seem to mean much about finding an individual tree with optimal or nearly-optimal dilation, but it does imply that one can find a tree with average dilation O(log n log log n).

Guénoche [G96] has proposed a way for measuring the nearness of a given distance matrix to a tree metric, that depends only on the ordering among distances and not so much on the exact distance values; in that sense it is "distance-free" similarly to the centerpoint and regression depth methods for single point estimation and linear regression. He defines the "order distance" Delta(x,y) to be the number of pairs (u,v) for which x and y disagree on the ordering of their distances: dist(x,u) < dist(x,v) but dist(y,u) > dist(y,v) (with a pair counted only half in case of an equality). He then suggests applying an incremental method using Delta instead of the original distance.