Computational Statistics

______________ | | parameters ----> | DATA MODEL | |______________| | | data V _______________ | | | NOISE MODEL | |_______________| | | observations V _____________ | | | ALGORITHM | ----> estimate |_____________|

To some extent the division between the data model and noise model is arbitrary (both are describing physical processes that are parts of the real world) and comes from the division between which parameters we want to estimate and which other ones we want to treat as noise and ignore.

The data model sets the general flavor of problem being considered: single point estimation (data model: pass parameters unchanged as data coordinates), regression (data model: parameters are the coordinates of a linear function relating independent and dependent variables), clustering (data model: choose randomly among several different points or point distributions), or hierarchical clustering (e.g. evolutionary tree reconstruction, in which the parameters define an evolutionary tree and the data model uses this tree to define mutated gene sequences for each leaf of the tree).

More complicated data models of interest to the AI students in the course are hidden Markov models for time series data, and belief propagation methods for decoding error correcting codes ("turbocodes"). The problem of devising a good error correcting code could be seen as choosing a data model in such a way that whatever noise you have doesn't destroy too much of your data.

The noise model determines which to choose among several different estimation algorithms returning the same sort of output. E.g., if one is estimating a single point value, one might choose among least squares (Gaussian noise), the centroid (for noise distributions with known zero expectation and unknown radial components), the circumcenter (for arbitrary bounded-radius noise distributions), or the minimum circumcenter of n/2 points (for robust outlier elimination).

**Consistency.**- The estimates should converge (with enough data) to the original parameters: that is, there should exist a limit (as n, the number of observations, goes to infinity), and that limit should equal the parameters. If this is impossible due to the noise model, we would at least like the estimate to be near the parameters, e.g. by minimizing the worst case error or maximizing the likelihood.
**Invariance.**- The estimates should not be changed by irrelevant rescaling or other such transformations of the data. In extreme cases (distance-free methods) the estimates may depend only on the combinatorics of the data points (e.g. are they above/below the estimated regression line) and not on metric or distance information.
**Efficiency.**- The estimates should use only as much data as is required to solve the problem accurately. Another way of stating this is that the rate of convergence of the estimates should be high.
**Robustness.**- The estimates should tolerate as broad a class of noise models as possible.
A particularly important type of noise from the point of view of
robustness is
*outliers*: a noise model in which some fraction of the data may be arbitrarily corrupted (and should be treated as being set by an adversary which will try to make the estimates as inaccurate as possible). Robustness against outliers may be measured in terms of the number of outliers that can be tolerated before the estimate becomes arbitrarily inaccurate. **Speed.**- How quickly will an implementation of the algorithm run?
Although this is the most important characteristic from the theoretical
CS point of view, it is probably the least important from the
statistical point of view, and possibly the "greed for speed"
has led to overuse of fast estimation methods such as least squares.
One of the contributions algorithmic research can make in this area
is to convince users that methods other than least squares can
be efficient enough to be used when appropriate.
The simplest way of optimizing for speed without compromising statistical quality is to work on improved algorithms for already known statistical methods: that is, to find fast algorithms that produce exactly the same output as other slower algorithms already in use. One can also look for

*approximation algorithms*that quickly find a result "almost as good as" the original slower algorithm, but one must be careful to preserve the other important statistical qualities of the original algorithm -- generally it's safest to approximate the*estimated values*themselves, rather than to approximate whatever objective function the estimation algorithm is optimizing.

David Eppstein,
Theory Group,
Dept. Information & Computer Science,
UC Irvine.

Last update: