To begin with, I'm not a statistician, so take anything I say about
statistics (as opposed to statistical algorithms) with a big grain of salt.
But let's begin by talking about statistics in general.
The Big Picture
I view statistics as a sequence of transformations of numbers,
starting from parameters that define a model for the generation of data,
through a model for noise (describing how the data may be corrupted by
random events, transmission errors, or measurement errors),
through an algorithm that makes estimates from the observed data.
The hope is that the estimates in some sense match the original parameters.
parameters ----> | DATA MODEL |
| NOISE MODEL |
| ALGORITHM | ----> estimate
To some extent the division between the data model and noise model is
arbitrary (both are describing physical processes that are parts of the
real world) and comes from the division between which parameters we want
to estimate and which other ones we want to treat as noise and ignore.
Function Follows Form
The nature of the statistical algorithm to be used is determined
by the data and noise models.
The data model sets the general
flavor of problem being considered:
single point estimation
(data model: pass parameters unchanged as data coordinates),
regression (data model: parameters are the coordinates of a linear
function relating independent and dependent variables), clustering
(data model: choose randomly among several different points or point
distributions), or hierarchical clustering
(e.g. evolutionary tree reconstruction, in which the
parameters define an evolutionary tree and the data model
uses this tree to define mutated gene sequences for each leaf of the tree).
More complicated data models of interest to the AI students in the course
are hidden Markov models for time series data, and belief propagation
methods for decoding error correcting codes ("turbocodes").
The problem of devising a good error correcting code could be seen
as choosing a data model in such a way that whatever noise you
have doesn't destroy too much of your data.
The noise model determines which to choose among several different
estimation algorithms returning the same sort of output.
E.g., if one is estimating a single point value,
one might choose among least squares (Gaussian noise),
the centroid (for noise distributions with known zero expectation and unknown
radial components), the circumcenter
(for arbitrary bounded-radius noise distributions),
or the minimum circumcenter of n/2 points (for robust outlier elimination).
An estimation algorithm should have the following properties:
- The estimates should converge (with enough data) to the original
parameters: that is, there should exist a limit
(as n, the number of observations, goes to infinity),
and that limit should equal the parameters.
If this is impossible due to the noise model,
we would at least like the estimate to be near the parameters,
e.g. by minimizing the worst case error or maximizing the likelihood.
- The estimates should not be changed by irrelevant rescaling or other
such transformations of the data. In extreme cases (distance-free methods)
the estimates may depend only on the combinatorics of the data points
(e.g. are they above/below the estimated regression line)
and not on metric or distance information.
- The estimates should use only as much data as is required to solve
the problem accurately. Another way of stating this is that the rate of
convergence of the estimates should be high.
- The estimates should tolerate as broad a class of noise models as possible.
A particularly important type of noise from the point of view of
robustness is outliers: a noise model in which some fraction
of the data may be arbitrarily corrupted (and should be treated as
being set by an adversary which will try to make the estimates as
inaccurate as possible). Robustness against outliers may be measured
in terms of the number of outliers that can be tolerated before the
estimate becomes arbitrarily inaccurate.
- How quickly will an implementation of the algorithm run?
Although this is the most important characteristic from the theoretical
CS point of view, it is probably the least important from the
statistical point of view, and possibly the "greed for speed"
has led to overuse of fast estimation methods such as least squares.
One of the contributions algorithmic research can make in this area
is to convince users that methods other than least squares can
be efficient enough to be used when appropriate.
The simplest way of optimizing for speed
without compromising statistical quality is to work on improved
algorithms for already known statistical methods: that is, to find fast
algorithms that produce exactly the same output as other slower
algorithms already in use. One can also look for approximation
algorithms that quickly find a result "almost as good as"
the original slower algorithm, but one must be careful
to preserve the other important statistical qualities of the original
algorithm -- generally it's safest to approximate the estimated values
themselves, rather than to approximate whatever objective function the
estimation algorithm is optimizing.
Dept. Information & Computer Science,