Computational Statistics

For a nontrivial elaboration of this model, consider the problem of subpixel-resolution imaging: if you take one digital image of a scene, you're pretty much stuck with the pixels of the image itself (well, you can use Bayesian methods to get some subpixel resolution, but only if you already have a good idea what you're looking at). But if you take several images, you can imagine their pixelations as being different applications of some kind of noise to the real image, and average them to get a higher-resolution image.

A related problem is consensus sequence estimation in molecular biology. In this problem, the data consists of a long sequence of characters drawn from a small alphabet; the noise model forms observed DNA sequences from this prototype by sequences of mutations. Another DNA single-point estimation problem is reconstruction of a whole sequence from fragments, as used in gene mapping systems.

Anyway, those examples have a very high dimension. I'm more interested in very low dimension cases where the parameter values really are geometric coordinates.

In one dimension, there are basically two choices for an estimator: the mean and the median. The median is obviously much less sensitive to outliers. As far as I can tell, even for the most well-behaved error model (Gaussian noise), both the median and the mean have the same accuracy (distance of estimation from original value): O(s/sqrt(n)) where s is the variance of the Gaussian. Further the median has better invariance properties (is distance-free). But it is not hard to come up with noise models where the mean is the better choice of the two. For instance, if you know nothing about the noise other than an a priori assumption that lower variances are more likely, then the mean is the max likelihood estimator (it minimizes the variance of the differences between the observation and the estimate, as can easily seen by noting that the variance is a unimodular quadratic function of the estimate and by computing its derivative).

There are various ways of generalizing these ideas to higher dimensions:

Therefore the centroid is the max likelihood estimator for settings with unknown noise and an a priori assumption that lower variance is more likely. It also inherits the accuracy of the one-dimensional mean for a Gaussian noise model (because Gaussian noise has the nice property that it can be generated by treating each coordinate independently).

There is nothing interesting to say about algorithms for computing centroids.

The uniqueness of
any L_{p} estimate for p>1
follows since the pth power of distance from each observation is a
strictly convex function,
the sum of strictly convex functions is convex, and a convex function has a
unique minimum. Unfortunately this argument doesn't quite work for p=1:
if all the points are colinear, and there is an even number of points,
then any point on a line segment between the two middle points
has an equal sum of distances. But this is the only exception.

I'm not sure of a noise model for which this is optimal, but since it uses a lower power of distance it's less sensitive than the centroid to outliers. It's also commonly used in management science for facility location e.g. of a central store location minimizing the travel distance to each of a population of customers.

There is no good combinatorial algorithm for the Fermat-Weber point, since its exact location is the solution to a high-degree polynomial in the observation's coordinates [B88,CM69]. However, gradient descent methods converge very rapidly and give good numerical approximations to its location [W37].

This is extremely sensitive to outliers and usually does not provide a good estimator. But one can imagine conditions under which it would be the appropriate choice; e.g. if we know nothing about the noise other than that each noisy observation is within some (unknown) bounded radius of the true estimate, then the circumcenter is the most accurate estimator (its maximum distance from the true data value is at most half the noise radius).

There exist several linear time algorithms for computing the circumcenter
of a collection of data points, based on linear-programming type techniques
(the problem is not itself a linear program, but can be solved with
much the same algorithms [G95]).
The current best time bounds have the form
O(d^{2} n + f(d) log n) where f is some
subexponential function of the dimension and does not affect the main
term of the running time.
Bernd Gärtner
has a good and well-documented C++ implementation available
[G99]
which can solve problems with hundreds of thousands of points,
up to dimension 20 or 30.

This provides a robust distance-free method of estimation: if at most n/(d+1)-1 of the observations are outliers chosen by an adversary to fool our algorithm, then its estimate must be within the convex hull of the remaining points and is therefore not strongly affected by those outliers.

Center points also have applications within computational geometry, e.g. in finding graph separators, which have further applications in mesh partitioning, intersection graph construction, etc [EMT95].

For points in the plane, a centerpoint can be computed in linear
time, and for points in three dimensions the time is O(n polylog(n))
[Citations to be filled in later].
For higher dimensions, the best known time bound is much slower:
O(n^{d+1}). However, there exists a fast approximation
algorithm [Citation to be filled in later].

(least median of squares)

A good question for discussion: what functional is most appropriate to optimize (for what noise models)?

The version in which one minimizes the circumradius
is also called the *least median of squares*
because it minimizes the median (squared) distance of any observation
from the estimate.

Various algorithms are known for these problems. Most such work has gone into variants where the cardinality of the selected subset is very small (in which case this is better viewed as a form of clustering than of robust estimation) [EE94] or very large (only tolerant of a small number of outliers).

It is often possible to show that, for problems like this,
the optimal solution consists of the k nearest neighbors of some
estimated center point. For instance, in the least median of squares
problem, the optimal subset consists of the nearest neighbors
to the optimal subset's circumcenter. Similarly, the k-element subset
minimizing the variance consists of the k points nearest
the optimal subset's median. Obviously, we don't know this center
point in advance, or we wouldn't need to use an estimation algorithm.
But, one can use this idea to limit the number of subsets to examine.
The "order-k Voronoi diagram" essentially encapsulates the family of
possible subsets: it is defined as a partition of space into cells,
such that any two points in the same cell have the same set of k nearest
neighbors. In the plane, this diagram is known to have O(kn) cells,
and can be constructed in time
O(n log n + kn 2^{O(log* n)})
[R99].
The variance of any one cell can be updated from its neighbor in
constant time, so the minimum variance k-element subset can
be found in the same time bound -- nearly quadratic for the
most statistically relevant case, when k is roughly n/2.
By only computing these diagrams for certain subsets of points,
one can remove the n 2^{O(log* n)} factor
when k is a little smaller than n [EE94].

The same idea works for least median of squares, but is slower
because it's more difficult to update the circumradius as one moves
from cell to cell
[E92].
Instead, the fastest known algorithms
use parametric search (a complicated variant of binary search)
based on a decision algorithm which tests a given circle radius
by finding the largest subset of points that fit within that radius.
The decision problem can be solved by drawing a circle of the given
radius around each point, building the arrangement of these circles,
and computing how many circles surround each cell
(in constant time per cell by adding or subtracting one to the value
from a neighboring cell).
The best time bounds (again, for k roughly n/2, with some improvement
possible for smaller k)
are O(n^{2} log n) for d=2
and O(n^{d} log^{2}n) for d>2
[EE94].

If we assume a priori some centrally symmetric noise distribution, or if we assume that the noise in each coordinate is an independent Gaussian (in which case the overall distribution ends up being centrally symmetric), then Euclidean distance makes sense. Otherwise, it seems that we should choose the distance function from p to q to be determined by the probability that a data point actually at p could be perturbed so that we measure it at q. There is no reason to expect such a distance function to obey the axioms of a metric or in any other way be well behaved.

Perhaps the fact that they avoid having to choose a distance function is one of the things that makes distance-free methods such as the centerpoint attractive.

David Eppstein,
Theory Group,
Dept. Information & Computer Science,
UC Irvine.

Last update: