A brief introduction to information theory is
provided in this section. The definitions and assumptions
necessary to a comprehensive discussion and evaluation of
data compression methods are discussed. The following string of
characters is used to illustrate the concepts defined:
`EXAMPLE` = `aa bbb cccc ddddd eeeeee fffffffgggggggg`.

source message codeword source message codewordThe oldest and most widely used codes, ASCII and EBCDIC, are examples of block-block codes, mapping an alphabet of 64 (or 256) single characters onto 6-bit (or 8-bit) codewords. These are not discussed, as they do not provide compression. The codes featured in this survey are of the block-variable, variable-variable, and variable-block types.a000aa0b001bbb1c010cccc10d011ddddd11e100eeeeee100f101fffffff101g110gggggggg110space111space111 Figure 1.1: A block-block code Figure 1.2: A variable-variable code.

When source
messages of variable length are allowed, the question
of how a message *ensemble* (sequence of messages) is parsed into
individual messages arises. Many of the algorithms
described here are *defined-word schemes*. That is, the set of
source messages is determined prior to the invocation of the
coding scheme.
For example, in text file processing each character may constitute
a message, or messages may be defined to consist of alphanumeric
and non-alphanumeric strings. In Pascal source code, each token
may represent a message. All codes involving fixed-length source
messages are, by default, defined-word codes.
In *free-parse* methods, the coding algorithm itself parses the ensemble
into variable-length sequences of symbols. Most of the known data
compression methods are defined-word schemes; the free-parse model
differs in a fundamental way from the classical coding paradigm.

A code is *distinct* if each codeword is distinguishable
from every other (i.e., the mapping from source messages to codewords is one-to-one).
A distinct code is *uniquely decodable* if every codeword is identifiable
when immersed in a sequence of codewords. Clearly, each of these features is
desirable. The codes of Figure 1.1 and Figure 1.2 are both distinct, but
the code of Figure 1.2 is not uniquely decodable. For example, the coded
message 11 could be decoded as either `ddddd` or `bbbbbb`.
A uniquely decodable code
is a *prefix code* (or *prefix-free code*) if it has the prefix property,
which requires that no codeword is a proper prefix
of any other codeword. All uniquely decodable block-block and
variable-block codes are
prefix codes. The code with codewords { 1, 100000, 00 } is an example
of a code which is uniquely decodable but which does not have the prefix
property. Prefix codes are *instantaneously
decodable*; that is, they have the desirable property that
the coded message can be parsed into codewords
without the need for lookahead. In order to decode a message encoded using
the codeword set { 1, 100000, 00 }, lookahead is required.
For example, the first codeword of the message 1000000001 is 1,
but this cannot be determined until the last (tenth) symbol of the message
is read (if the string of zeros had been of odd length, then the first
codeword would have been 100000).

A *minimal* prefix
code is a prefix code such that if `x` is a proper prefix of some codeword,
then `x sigma` is either a codeword or a proper prefix of a codeword,
for each letter `sigma` in `beta`. The set of codewords { `00, 01, 10` }
is an example of a prefix code which is not minimal. The fact that `1` is
a proper prefix of the codeword `10` requires that `11` be either a codeword
or a proper prefix of a codeword, and it is neither.
Intuitively, the minimality constraint prevents
the use of codewords which are longer than necessary. In the above example
the codeword `10` could be replaced by the codeword `1`, yielding a
minimal prefix code with shorter codewords. The codes discussed
in this paper are all minimal prefix codes.

In this section, a *code* has been defined to be a mapping from a
source alphabet to a code alphabet; we now define related terms.
The process of transforming a source ensemble into a coded message
is *coding* or *encoding*. The encoded message may be
referred to as an *encoding* of the source ensemble. The
algorithm which constructs the mapping and uses it to transform the
source ensemble is called the *encoder*. The *decoder*
performs the inverse operation, restoring the coded message to its
original form.

source message probability codewordA code isa2/40 1001b3/40 1000c4/40 011d5/40 010e6/40 111f7/40 110g8/40 00space5/40 101 Figure 1.3 -- A Huffman code for the messageEXAMPLE(code length=117).

source message probability codewordDynamic codes are also referred to in the literature asa2/6 10b3/6 0space1/6 11 Figure 1.4 -- A dynamic Huffman code table for the prefixaa bbbof messageEXAMPLE.

All of the adaptive methods are *one-pass* methods; only
one scan of the ensemble is required.
Static Huffman coding requires two passes:
one pass to compute probabilities and determine the mapping, and a
second pass for transmission. Thus, as long as the encoding and decoding
times of an adaptive method are not substantially greater than those of
a static method, the fact that an initial scan is not needed implies
a speed improvement in the adaptive case. In addition, the mapping
determined in the first pass of a static coding scheme
must be transmitted by the encoder to the decoder. The mapping may
preface each transmission (that is, each file sent), or a single mapping
may be agreed upon and used for multiple transmissions.
In one-pass methods the encoder defines and redefines the mapping dynamically,
during transmission. The decoder must define and redefine the mapping in
sympathy, in essence "learning" the mapping as codewords are received.
Adaptive methods are discussed in
Sections 4 and 5.

An algorithm may also be a *hybrid*, neither completely
static nor completely dynamic. In a simple hybrid scheme,
sender and receiver maintain identical *codebooks*
containing `k` static codes. For each transmission,
the sender must choose one of the `k` previously-agreed-upon codes
and inform the receiver of his choice (by transmitting first the
"name" or number of the chosen code).
Hybrid methods are discussed further in
Section 2 and Section 3.2.

Several common measures of compression have been
suggested: redundancy [Shannon and Weaver 1949], average message
length [Huffman 1952], and compression ratio [Rubin 1976;
Ruth and Kreutzer 1972]. These measures are defined below. Related to each of these measures
are assumptions about the characteristics of the source.
It is generally assumed in information theory that all statistical
parameters of a message source are known with perfect accuracy
[Gilbert 1971]. The most common model is that of a discrete
memoryless source; a source whose output is a sequence of letters
(or messages),
each letter being a selection from some fixed alphabet `a`,...
The letters are taken to be random, statistically independent
selections from the alphabet, the selection being made according
to some fixed probability assignment `p`(`a`),... [Gallager 1968].
Without loss of generality, the code alphabet is assumed
to be {0,1} throughout this paper. The modifications
necessary for larger code alphabets are straightforward.

It is assumed that any cost associated with the code letters is uniform. This is a reasonable assumption, although it omits applications like telegraphy where the code symbols are of different durations. The assumption is also important, since the problem of constructing optimal codes over unequal code letter costs is a significantly different and more difficult problem. Perl et al. and Varn have developed algorithms for minimum-redundancy prefix coding in the case of arbitrary symbol cost and equal codeword probability [Perl et al. 1975; Varn 1971]. The assumption of equal probabilities mitigates the difficulty presented by the variable symbol cost. For the more general unequal letter costs and unequal probabilities model, Karp has proposed an integer linear programming approach [Karp 1961]. There have been several approximation algorithms proposed for this more difficult problem [Krause 1962; Cot 1977; Mehlhorn 1980].

When data is compressed, the goal is to reduce redundancy,
leaving only the informational content. The measure of information
of a source message `x` (in bits) is -lg `p`(`x`)
[lg denotes the base 2 logarithm]. This definition
has intuitive appeal; in the case that `p`(`x`=1,
it is clear that `x` is not at all informative since it had to occur.
Similarly, the smaller the value of `p`(`x`,
the more unlikely `x`
is to appear, hence the larger its information content. The reader
is referred to Abramson for a longer, more elegant
discussion of the legitimacy of this technical definition of the
concept of information [Abramson 1963, pp. 6-13].
The average information content over
the source alphabet can be computed by weighting the information content
of each source letter by its probability of occurrence, yielding the
expression SUM{i=1 to n} [-`p`(`a`(`i`)) lg `p`(`a`(`i`))]. This quantity is
referred to as the `entropy` of a source letter, or the entropy of the
source, and is denoted by `H`.
Since the length of a codeword for message `a`(`i`)
must be sufficient to carry the information content of `a`(`i`),
entropy imposes a lower bound on the number of bits required for the
coded message. The total number of bits must be at least as large as
the product of `H` and the length of the source ensemble.
Since the value of `H` is generally not an integer, variable length
codewords must be used if the lower bound is to be achieved.
Given that message `EXAMPLE` is to be encoded one letter at a time,
the entropy of its source can be calculated using the probabilities
given in Figure 1.3:
`H` = 2.894, so that the minimum number of bits contained
in an encoding of `EXAMPLE` is 116.
The Huffman code given in Section 1.2 does not quite
achieve the theoretical minimum in this case.

Both of these definitions of information content are due to Shannon. A derivation of the concept of entropy as it relates to information theory is presented by Shannon [Shannon and Weaver 1949]. A simpler, more intuitive explanation of entropy is offered by Ash [Ash 1965].

The most common notion of a "good" code is one which
is *optimal* in the sense of having minimum redundancy. *Redundancy*
can be defined as: SUM `p`(`a`(`i`)) `l`(`i`)
- SUM [-`p`(`a`(`i`)) lg `p`(`a`(`i`))] where `l`(`i`) is
the length of the codeword representing message `a`(`i`). The expression
SUM `p`(`a`(`i`)) `l`(`i`) represents the lengths of the codewords weighted
by their probabilities of occurrence, that is, the average codeword length.
The expression SUM [-`p`(`a`(`i`)) lg `p`(`a`(`i`))] is entropy, `H`. Thus,
redundancy is a measure of the difference between average
codeword length and average information content. If a code has
minimum average codeword length for a given discrete probability distribution,
it is said to be a minimum redundancy code.

We define the term *local redundancy* to capture the notion
of redundancy caused by local properties of a message ensemble,
rather than its global characteristics. While the model used for
analyzing general-purpose coding techniques assumes a random distribution
of the source messages, this may not actually be the case. In particular
applications the tendency for messages to cluster in predictable patterns
may be known. The existence of predictable patterns may be exploited
to minimize local redundancy. Examples of applications in which local
redundancy is common, and methods for dealing with local redundancy,
are discussed in
Section 2 and Section 6.2.

Huffman uses *average message length*, SUM `p`(`a`(`i`)) `l`(`i`), as
a measure of the efficiency of a code. Clearly the meaning of
this term is the average length of a *coded* message.
We will use the term *average codeword length* to represent
this quantity. Since redundancy is defined to be average codeword
length minus entropy and entropy is constant
for a given probability distribution, minimizing average codeword
length minimizes redundancy.

A code is *asymptotically optimal* if it has the property
that for a given probability distribution, the ratio of average
codeword length to entropy approaches 1 as entropy tends to infinity.
That is, asymptotic optimality guarantees that average codeword length
approaches the theoretical minimum (entropy represents information content,
which imposes a lower bound on codeword length).

The amount of compression yielded by a coding scheme can be
measured by a *compression ratio*. The term compression ratio
has been defined in several ways. The definition
`C` = (average message length)/(average codeword length)
captures the common meaning, which is a comparison of the length of the coded
message to the length of the original ensemble [Cappellini 1985].
If we think of the characters of the ensemble `EXAMPLE` as 6-bit ASCII
characters, then the average message length is 6 bits. The Huffman
code of
Section 1.2 represents `EXAMPLE` in 117 bits,
or 2.9 bits per character. This yields a compression ratio of 6/2.9,
representing compression by a factor of more than 2. Alternatively,
we may say that Huffman encoding produces a file whose size is
49% of the original ASCII file, or that 49% compression has been achieved.
A somewhat different definition of compression ratio, by Rubin,
`C` = (`S` - `O` - `OR`)/`S`, includes the representation
of the code itself in the transmission cost [Rubin 1976]. In this
definition `S` represents the length of the source ensemble, `O` the
length of the output (coded message), and `OR` the size of the "output
representation" (eg., the number of bits required for the encoder to
transmit the code mapping to the decoder). The quantity `OR`
constitutes a "charge" to an algorithm for transmission of information
about the coding scheme. The intention is to measure the total size
of the transmission (or file to be stored).

In the area of data transmission, Huffman coding has been
passed over for years in favor of block-block codes, notably ASCII.
The advantage of Huffman coding is in the average number of bits per character
transmitted, which may be much smaller than the lg `n` bits per
character (where `n` is the source alphabet size) of a block-block
system. The primary difficulty associated with variable-length
codewords is that the rate at which bits are presented to the
transmission channel will fluctuate, depending on the relative
frequencies of the source messages. This requires buffering between
the source and the channel. Advances in technology have both overcome
this difficulty and contributed to the appeal of variable-length
codes. Current data networks allocate communication resources to
sources on the basis of need and provide buffering as part of the
system. These systems require significant amounts of protocol, and
fixed-length codes are quite inefficient for applications such as
packet headers. In addition, communication costs are beginning to
dominate storage and processing costs, so that variable-length coding
schemes which reduce communication costs are attractive even if they
are more complex. For these reasons, one could expect to see even
greater use of variable-length coding in the future.

It is interesting to note that the Huffman coding algorithm, originally developed for the efficient transmission of data, also has a wide variety of applications outside the sphere of data compression. These include construction of optimal search trees [Zimmerman 1959; Hu and Tucker 1971; Itai 1976], list merging [Brent and Kung 1978], and generating optimal evaluation trees in the compilation of expressions [Parker 1980]. Additional applications involve search for jumps in a monotone function of a single variable, sources of pollution along a river, and leaks in a pipeline [Glassey and Karp 1976]. The fact that this elegant combinatorial algorithm has influenced so many diverse areas underscores its importance.