Computational Statistics

One way to draw this is some kind of system of nested subsets, maximal in the sense that one can't identify any additional subsets without violating the nesting:

__________________________ | _________ | | | ______ | _____ | | | | | | | | | | | | x y | | | u | | | | |______| | | | | | | | | v | | | | z | |_____| | | |__________| | |__________________________|

Alternatively, one can draw a "dendrogram", that is, a binary tree with a distinguished root, that has all the data items at its leaves:

/\ / \ / \ ________ / \ | | /\ \ or __|__ | / \ \ | | _|_ /\ \ /\ _|_ | | | / \ \ / \ | | | | | x y z u v x y z u v

Conventionally, all the leaves are shown at the same level of the drawing. The ordering of the leaves is arbitrary, as is their horizontal position. The heights of the internal nodes may be arbitrary, or may be related to the metric information used to form the clustering.

dist(a,c) __<__ max(dist(a,b), dist(b,c))

This inequality is satisfied, for instance, if the data points are
leaves of a dendrogram drawing, with the distance defined to be the height of
their least common ancestor; in fact this is just an equivalent way of
defining an ultrametric. The inequality is also satisfied for
vertices in any graph,
with the length of a path defined to be its maximum weight edge.
The ultrametric requirement may be too strong -- e.g. in evolutionary trees, if one measures distance by mutation rate, it would imply the biologically unrealistic condition that all species evolve at the same rate. A weaker condition is that the distances are formed by path lengths in a tree, without the ultrametric requirement that each root-leaf path have equal length. A distance function meeting this weaker requirement is also known as an additive metric.

If the data model is that the data points form an ultrametric, and that the input to the clustering algorithm is a distance matrix, a typical noise model would be that the values in this matrix are independently perturbed by some random distribution. Additional data observations may consist of new points in this ultrametric (e.g. in the large-population genetic studies used to test the hypothesis that the human species evolved in Africa before migrating to other continents [cite?]), or may be used to reduce the perturbation in existing distance measurements (e.g. by comparing larger amounts of DNA sequence information).

Another common data and noise model for DNA sequence input is the Cavender-Farris model [C78], in which the data parameters consist of an ultrametric (dendrogram with edge lengths) together with a prototype sequence stored at the root of the dendrogram. The observed data in this model represent the contents of a single position of the DNA sequence for each species, and are formed by starting with a symbol at the root of the dendrogram and then propagating that value downwards, with mutation rates proportional to the dendrogram edge lengths.

Another dynamic program arises from the problem of fitting an optimal dendrogram to a distance matrix. If one has a particular dendrogram in mind (as an abstract tree), the problem of assigning heights to its vertices to minimize the maximum difference between the dendrogram distance and the given distance matrix is simply a linear program, where each height is linearly constrained to be above its children and within D of each distance represented by it in the matrix. Since this linear program has only two variables per inequality, it can be solved efficiently [cite?]. However, the difficult part of this procedure is in choosing which dendrogram to use. The number of possible different dendrograms with k leaves is (2k-1)!!=1*3*5*7*...(2k-1) since there are 2k-1 ways of adding a kth leaf to a (k-1)-leaf dendrogram, so unless k is very small one can't hope to try them all. One could possibly find the optimal solution for somewhat larger (but still not very large) k by a dynamic programming method in which one finds the optimal dendrogram for each subset of the data points, by trying all ways of partitioning that subset into two smaller subsets.

If the affinity of two clusters is defined to be the distance between the closest pair of points, the problem is known as "single-linkage" clustering. Essentially, it is solved by the minimum spanning tree (the top level clustering is formed by removing the heaviest edge from the MST, and the remaining levels are formed in the same manner recursively). Since minimum spanning trees can often be found efficiently, so can this clustering, and it does produce the correct results when the input is an ultrametric distance matrix (without errors). However, for points in Euclidean spaces, it tends to produce unsatisfactory long stringy clusters.

Another common affinity function is the average distance between points in a cluster (complete linkage clustering). For distance matrix input, this can be found by replacing two rows of the matrix by an appropriate weighted average. Alternatively, if one has a good notion of single point estimation for a cluster (e.g. the centroid), one can define affinity in terms of the distance between cluster centers.

"Neighbor-joining" [SN87], is a more complicated affinity: It assumes that one has a tree in which each point is connected to a common center, and measures the amount by which the total edge length of the tree would be reduced by placing a "Steiner point" between two points, and connecting that Steiner point to the center. Formally, the affinity between x and y is

distance(x,center) + distance(y,center)
- distance(x,SP) - distance(y,SP) - distance(SP,center)

where SP is the Steiner point.
For distance matrix input, the center is found by averaging the rows of
the whole matrix, so the distance from x to the center is just the average
distance of x from any point. Similarly, the Steiner point can be found
by averaging the rows for x and y (in which case the two terms
distance(x,SP)+distance(y,SP) add to distance(x,y)).
Atteson
[A96]
showed that this method converges to the correct tree
for distance matrices that are sufficiently close
(in terms of LMany of these methods have efficient algorithms, both
for distance matrix input
[E98],
and for points in low dimensional Euclidean spaces
[KL95].
However, Neighbor-Joining seems more difficult,
with the best known time bound being O(n^{3})
(and some commonly available implementations
taking even more than that).
However even the faster of these methods are too slow for data consisting
of millions of points in moderate to high dimensional Euclidean spaces.

The advantage of these methods is speed: they can scale to problem sizes that are beyond the bottom up methods. However, the quality of the results may be poorer, because the most important decisions are being made early in the algorithm before it has accumulated enough information to make them well. Also, because of the emphasis on speed, the nonhierarchical clustering methods used tend to be primitive (e.g. split the points by an axis-parallel hyperplane).

This is extremely fast, and good for applications such as nearest neighbor classification, but not very good for accurately reconstructing the clusters present in the original data model.

The L_{1} and L_{2} problems are NP-complete
for both the tree metric and ultrametric
variants [D87].
However, the L_{infinity}-nearest ultrametric can
be found in time linear in the distance matrix
size [FKW95].

Agarwala et al.
[ABFNPT98]
show that it is NP-complete to get within a factor of 9/8
of the L_{infinity}-nearest
tree metric; however they describe a fast approximation algorithm
that achieves a factor of three.

There has also been some work within the theoretical computer science community (motivated by approximation algorithms for various network design problems) on approximating a distance matrix by a tree metric with minimum "dilation" or "stretch" (maximum ratio of original distance to tree distance). The main result in this area is that one can define a random distribution on trees such that the expected dilation for a given pair of vertices is O(log n log log n) [CCGGP98] This doesn't seem to mean much about finding an individual tree with optimal or nearly-optimal dilation, but it does imply that one can find a tree with average dilation O(log n log log n).

Guénoche [G96] has proposed a way for measuring the nearness of a given distance matrix to a tree metric, that depends only on the ordering among distances and not so much on the exact distance values; in that sense it is "distance-free" similarly to the centerpoint and regression depth methods for single point estimation and linear regression. He defines the "order distance" Delta(x,y) to be the number of pairs (u,v) for which x and y disagree on the ordering of their distances: dist(x,u) < dist(x,v) but dist(y,u) > dist(y,v) (with a pair counted only half in case of an equality). He then suggests applying an incremental method using Delta instead of the original distance.

David Eppstein,
Theory Group,
Dept. Information & Computer Science,
UC Irvine.

Last update: