The classic defined-word scheme was developed over 30 years ago in Huffman's well-known paper on minimum-redundancy coding [Huffman 1952]. Huffman's algorithm provided the first solution to the problem of constructing minimum-redundancy codes. Many people believe that Huffman coding cannot be improved upon, that is, that it is guaranteed to achieve the best possible compression ratio. This is only true, however, under the constraints that each source message is mapped to a unique codeword and that the compressed text is the concatenation of the codewords for the source messages. An earlier algorithm, due independently to Shannon and Fano [Shannon and Weaver 1949; Fano 1949], is not guaranteed to provide optimal codes, but approaches optimal behavior as the number of messages approaches infinity. The Huffman algorithm is also of importance because it has provided a foundation upon which other data compression techniques have built and a benchmark to which they may be compared. We classify the codes generated by the Huffman and Shannon-Fano algorithms as variable-variable and note that they include block-variable codes as a special case, depending upon how the source messages are defined.

In Section 3.3 codes which map the integers onto binary codewords are discussed. Since any finite alphabet may be enumerated, this type of code has general-purpose utility. However, a more common use of these codes (called universal codes) is in conjunction with an adaptive scheme. This connection is discussed in Section 5.2.

Arithmetic coding, presented in Section 3.4, takes a significantly different approach to data compression from that of the other static methods. It does not construct a code, in the sense of a mapping from source messages to codewords. Instead, arithmetic coding replaces the source ensemble by a code string which, unlike all of the other codes discussed here, is not the concatenation of codewords corresponding to individual source messages. Arithmetic coding is capable of achieving compression results which are arbitrarily close to the entropy of the source.

Figure 3.1 shows the application of the method to a particularly simple probability distribution. The length of each codeworda1/2 0b1/4 10c1/8 110d1/16 1110e1/32 11110f1/32 11111 Figure 3.1 -- A Shannon-Fano Code.

g8/40 00f7/40 010e6/40 011d5/40 100space5/40 101c4/40 110b3/40 1110a2/40 1111 Figure 3.2 -- A Shannon-Fano Code forEXAMPLE(code length=117).

Figure 3.3 -- The Huffman process: (a) The list; (b) the tree.

The Huffman algorithm determines the lengths of the codewords
to be mapped to each of the source letters `a`(`i`). There are many
alternatives for specifying the actual digits; it is necessary only
that the code have the prefix property. The usual assignment
entails labeling the edge from each parent to its left child with
the digit 0 and the edge to the right child with 1. The codeword for
each source letter is the sequence of labels along the path from
the root to the leaf node representing that letter.
The codewords for the source of Figure 3.3,
in order of decreasing probability, are
{01,11,001,100,101,000,0001}.
Clearly, this
process yields a minimal prefix code. Further, the algorithm is
guaranteed to produce an *optimal* (minimum redundancy) code [Huffman 1952]. Gallager
has proved an upper bound on the redundancy of a Huffman code of
`p`(`n`) + lg [(2 lg `e`)/`e`] which is approximately `p`(`n`) + 0.086,
where `p`(`n`) is the probability of the least likely source message
[Gallager 1978]. In a recent paper, Capocelli et al. provide new
bounds which are tighter than those of Gallagher for some probability
distributions [Capocelli et al. 1986].
Figure 3.4 shows a distribution for which the
Huffman code is optimal while the Shannon-Fano code is not.

In addition to the fact that there are
many ways of forming codewords of appropriate lengths, there are
cases in which the Huffman algorithm does not uniquely determine
these lengths due to the arbitrary choice among equal
minimum weights.
As an example, codes with codeword lengths of {1,2,3,4,4}
and of {2,2,2,3,3} both yield the same average codeword length
for a source with probabilities {.4,.2,.2,.1,.1}.
Schwartz defines a variation of the Huffman algorithm which
performs "bottom merging"; that is, orders a new parent node
above existing nodes of the same weight and always merges the last
two weights in the list. The code constructed is the Huffman code
with minimum values of maximum codeword length (MAX{ `l`(`i`) }) and
total codeword length (SUM{ `l`(`i`) }) [Schwartz 1964]. Schwartz
and Kallick describe an implementation of Huffman's algorithm with
bottom merging [Schwartz and Kallick 1964]. The Schwartz-Kallick
algorithm and a later algorithm by Connell [Connell 1973]
use Huffman's procedure to determine the lengths of the codewords,
and actual digits are assigned so that the code has the *numerical
sequence property*. That is, codewords of equal length form a
consecutive sequence of binary numbers.
Shannon-Fano codes also have the numerical
sequence property. This property can be exploited to achieve a compact representation
of the code and rapid encoding and decoding.

S-F HuffmanBoth the Huffman and the Shannon-Fano mappings can be generated ina(1) 0.35 00 1a(2) 0.17 01 011a(3) 0.17 10 010a(4) 0.16 110 001a(5) 0.15 111 000 Average codeword length 2.31 2.30 Figure 3.4 -- Comparison of Shannon-Fano and Huffman Codes.

As noted earlier, the
redundancy bound for Shannon-Fano codes is 1 and the bound for
the Huffman method is `p`(`n` + 0.086 where `p`(`n`) is the probability of the
least likely source message (so `p`(`n`) is less than or equal to .5,
and generally much less). It is important to note that in defining
redundancy to be average codeword length minus
entropy, the cost of transmitting the code mapping computed by
these algorithms is ignored. The overhead cost for any method
where the source alphabet has not been established prior to
transmission includes `n` lg `n` bits for sending the `n` source
letters. For a Shannon-Fano code, a list of codewords
ordered so as to correspond to the source letters could be
transmitted. The additional time required is then SUM `l`(`i`),
where the `l`(`i`) are the lengths of
the codewords. For Huffman coding, an encoding of the shape of
the code tree might be transmitted. Since any full binary tree may
be a legal Huffman
code tree, encoding tree shape may require as many as lg 4^`n` = 2`n` bits.
In most cases the message ensemble is very large, so that the
number of bits of overhead is minute by comparison to the total
length of the encoded transmission. However, it is imprudent to ignore
this cost.

If a less-than-optimal code is acceptable, the overhead costs
can be avoided through a prior agreement by sender and receiver as to
the code mapping. Rather than using a Huffman code based upon the
characteristics of the current message ensemble, the code
used could be based on statistics for a class of
transmissions to which the current ensemble is assumed
to belong.
That is, both sender and receiver could have access to
a *codebook* with `k` mappings in it; one for Pascal source, one for
English text, etc. The sender would then simply alert the receiver
as to which of the common codes he is using. This requires only
lg `k` bits of overhead. Assuming that classes of transmission with
relatively stable characteristics could be identified, this hybrid approach
would greatly reduce the redundancy due to overhead without
significantly increasing expected codeword length.
In addition, the cost of computing the mapping would be amortized
over all files of a given class. That is, the mapping would be
computed once on a statistically significant sample and then
used on a great number of files for which the sample is
representative. There is clearly a substantial risk associated
with assumptions about file characteristics and great care would
be necessary in choosing both the sample from which the mapping
is to be derived and the categories into which to partition
transmissions. An extreme example of the risk associated
with the codebook approach is provided by author Ernest V.
Wright who wrote a novel *Gadsby* (1939) containing no
occurrences of the letter E. Since E is the most commonly used
letter in the English language, an encoding based upon a sample
from *Gadsby* would be disastrous if used with "normal"
examples of English text. Similarly, the "normal" encoding
would provide poor compression of *Gadsby*.

McIntyre and Pechura describe an experiment in which the codebook approach is compared to static Huffman coding [McIntyre and Pechura 1985]. The sample used for comparison is a collection of 530 source programs in four languages. The codebook contains a Pascal code tree, a FORTRAN code tree, a COBOL code tree, a PL/1 code tree, and an ALL code tree. The Pascal code tree is the result of applying the static Huffman algorithm to the combined character frequencies of all of the Pascal programs in the sample. The ALL code tree is based upon the combined character frequencies for all of the programs. The experiment involves encoding each of the programs using the five codes in the codebook and the static Huffman algorithm. The data reported for each of the 530 programs consists of the size of the coded program for each of the five predetermined codes, and the size of the coded program plus the size of the mapping (in table form) for the static Huffman method. In every case, the code tree for the language class to which the program belongs generates the most compact encoding. Although using the Huffman algorithm on the program itself yields an optimal mapping, the overhead cost is greater than the added redundancy incurred by the less-than-optimal code. In many cases, the ALL code tree also generates a more compact encoding than the static Huffman algorithm. In the worst case, an encoding constructed from the codebook is only 6.6% larger than that constructed by the Huffman algorithm. These results suggest that, for files of source code, the codebook approach may be appropriate.

Gilbert discusses the construction of Huffman codes based on inaccurate source probabilities [Gilbert 1971]. A simple solution to the problem of incomplete knowledge of the source is to avoid long codewords, thereby minimizing the error of underestimating badly the probability of a message. The problem becomes one of constructing the optimal binary tree subject to a height restriction (see [Knuth 1971; Hu and Tan 1972; Garey 1974]). Another approach involves collecting statistics for several sources and then constructing a code based upon some combined criterion. This approach could be applied to the problem of designing a single code for use with English, French, German, etc., sources. To accomplish this, Huffman's algorithm could be used to minimize either the average codeword length for the combined source probabilities; or the average codeword length for English, subject to constraints on average codeword lengths for the other sources.

An advantage of universal codes over Huffman codes is that it is not necessary to know the exact probabilities with which the source messages appear. While Huffman coding is not applicable unless the probabilities are known, it is sufficient in the case of universal coding to know the probability distribution only to the extent that the source messages can be ranked in probability order. By mapping messages in order of decreasing probability to codewords in order of increasing length, universality can be achieved. Another advantage to universal codes is that the codeword sets are fixed. It is not necessary to compute a codeword set based upon the statistics of an ensemble; any universal codeword set will suffice as long as the source messages are ranked. The encoding and decoding processes are thus simplified. While universal codes can be used instead of Huffman codes as general-purpose static schemes, the more common application is as an adjunct to a dynamic scheme. This type of application will be demonstrated in Section 5.

Since the ranking of source messages is the essential parameter in universal coding, we may think of a universal code as representing an enumeration of the source messages, or as representing the integers, which provide an enumeration. Elias defines a sequence of universal coding schemes which map the set of positive integers onto the set of binary codewords [Elias 1975].

The first Elias code is one which is simple but not optimal. This code,gammadelta1 1 1 2 010 0100 3 011 0101 4 00100 01100 5 00101 01101 6 00110 01110 7 00111 01111 8 0001000 00100000 16 000010000 001010000 17 000010001 001010001 32 00000100000 0011000000 Figure 3.5 -- Elias Codes.

Source Frequency Rank Codeword messageA second sequence of universal coding schemes, based on the Fibonacci numbers, is defined by Apostolico and Fraenkel [Apostolico and Fraenkel 1985]. While the Fibonacci codes are not asymptotically optimal, they compare well to the Elias codes as long as the number of source messages is not too large. Fibonacci codes have the additional attribute of robustness, which manifests itself by the local containment of errors. This aspect of Fibonacci codes will be discussed further in Section 7.g8 1delta(1)=1f7 2delta(2)=0100e6 3delta(3)=0101d5 4delta(4)=01100space5 5delta(5)=01101c4 6delta(6)=01110b3 7delta(7)=01111a2 8delta(8)=00100000 Figure 3.6 -- An Elias Code forEXAMPLE(code length=161).

The sequence of Fibonacci codes described by Apostolico and Fraenkel
is based on the Fibonacci numbers of order `m` >= 2, where the
Fibonacci numbers of order 2 are the standard Fibonacci numbers:
1, 1, 2, 3, 5, 8, 13, .... In general, the Fibonnaci numbers
of order `m` are defined by the recurrence: Fibonacci numbers `F`(-`m`+1)
through `F`(0) are equal to 1; the `k`th number for `k` >= 1 is
the sum of the preceding `m` numbers. We describe only the order
2 Fibonacci code; the extension to higher orders is straightforward.

Every nonnegative integerNR(N)F(N) 1 1 11 2 1 0 011 3 1 0 0 0011 4 1 0 1 1011 5 1 0 0 0 00011 6 1 0 0 1 10011 7 1 0 1 0 01011 8 1 0 0 0 0 000011 16 1 0 0 1 0 0 0010011 32 1 0 1 0 1 0 0 00101011 21 13 8 5 3 2 1 Figure 3.7 -- Fibonacci Representations and Fibonacci Codes.

Fraenkel and Klein prove that the Fibonacci code of order 2 is universal,
with `c1`=2 and `c2`=3 [Fraenkel and Klein 1985]. It is not asymptotically
optimal since `c1`>1. Fraenkel and Klein also show that Fibonacci
codes of higher order compress better than the order 2 code if the
source language is large enough (i.e., the number of distinct source messages
is large) and the probability distribution is nearly uniform.
However, no Fibonacci code is asymptotically optimal. The Elias
codeword `delta`(`N`) is asymptotically shorter than any Fibonacci
codeword for `N`, but the integers in a very large initial range have
shorter Fibonacci codewords. For `m`=2, for example, the
transition point is `N`=514,228 [Apostolico and Fraenkel 1985].
Thus, a Fibonacci code provides better compression
than the Elias code until the size of the source language becomes very
large. Figure 3.8 shows a Fibonacci code for string `EXAMPLE`.
The number of bits transmitted using this mapping would be 153, which
is an improvement over the Elias code of Figure 3.6 but still
compares poorly with the Huffman code of Figure 1.3.

Source Frequency Rank Codeword messageg8 1F(1)=11f7 2F(2)=011e6 3F(3)=0011d5 4F(4)=1011space5 5F(5)=00011c4 6F(6)=10011b3 7F(7)=01011a2 8F(8)=000011 Figure 3.8 -- A Fibonacci Code forEXAMPLE(code length=153).

In arithmetic coding a source ensemble is represented by an interval between 0 and 1 on the real number line. Each symbol of the ensemble narrows this interval. As the interval becomes smaller, the number of bits needed to specify it grows. Arithmetic coding assumes an explicit probabilistic model of the source. It is a defined-word scheme which uses the probabilities of the source messages to successively narrow the interval used to represent the ensemble. A high probability message narrows the interval less than a low probability message, so that high probability messages contribute fewer bits to the coded ensemble. The method begins with an unordered list of source messages and their probabilities. The number line is partitioned into subintervals based on cumulative probabilities.

A small example will be used to illustrate the idea of arithmetic coding.
Given source messages {`A,B,C,D,#`} with probabilities
{.2, .4, .1, .2, .1}, Figure 3.9 demonstrates the initial partitioning
of the number line. The symbol `A` corresponds to the first 1/5 of
the interval [0,1); `B` the next 2/5; `D` the subinterval of size
1/5 which begins 70% of the way from the left endpoint to the right.
When encoding begins, the source ensemble is represented by the entire
interval [0,1). For the ensemble `AADB#`, the first `A` reduces the
interval to [0,.2) and the second `A` to [0,.04) (the first 1/5
of the previous interval). The `D` further narrows the interval to
[.028,.036) (1/5 of the previous size, beginning 70% of the distance
from left to right). The `B` narrows the interval to [.0296,.0328),
and the `#` yields a final interval of [.03248,.0328). The interval,
or alternatively any number `i` within the interval, may now be used to
represent the source ensemble.

Source Probability Cumulative Range message probabilityTwo equations may be used to define the narrowing process described above:A.2 .2 [0,.2)B.4 .6 [.2,.6)C.1 .7 [.6,.7)D.2 .9 [.7,.9)#.1 1.0 [.9,1.0) Figure 3.9 -- The Arithmetic coding model.

newleft = prevleft + msgleft*prevsize (1) newsize = prevsize * msgsize (2)The first equation states that the left endpoint of the new interval is calculated from the previous interval and the current source message. The left endpoint of the range associated with the current message specifies what percent of the previous interval to remove from the left in order to form the new interval. For

The size of the final subinterval determines the number of bits needed to
specify a number in that range. The number of bits needed to specify a
subinterval of [0,1) of size `s` is -lg `s`. Since the size of the final
subinterval is the product of the probabilities of the source messages
in the ensemble (that is, `s`=PROD{`i`=1 to `N`} `p`(source message `i`) where
`N` is the length of the ensemble), we have
-lg `s` = -SUM{`i`=1 to `N` lg `p`(source message `i`) =
- SUM{`i`=1 to `n`} `p`(`a`(`i`)) lg `p`(`a`(`i`)), where `n` is the number of unique
source messages `a`(1), `a`(2), ... `a`(`n`).
Thus, the number of bits generated
by the arithmetic coding technique is exactly equal to entropy, `H`.
This demonstrates the fact that arithmetic coding achieves compression
which is almost exactly that predicted by the entropy of the source.

In order to recover the original ensemble, the decoder must
know the model of the source used by the encoder (eg., the source messages
and associated ranges) and a single number within the interval determined
by the encoder. Decoding consists of a series of comparisons of the number
`i` to the ranges representing the source messages. For this example,
`i` might be .0325 (.03248, .0326, or .0327 would all do just as well).
The decoder uses `i` to simulate the actions of the encoder. Since `i` lies
between 0 and .2, he deduces that the first letter was `A` (since the range
[0,.2) corresponds to source message `A`). This narrows the interval
to [0,.2). The decoder can now deduce that the next message will further
narrow the interval in one of the following ways: to [0,.04) for `A`,
to [.04,.12) for `B`, to [.12,.14) for `C`, to [.14,.18) for `D`,
and to [.18,.2) for `#`. Since `i` falls into the interval [0,.04),
he knows that the second message is again `A`. This process continues until
the entire ensemble has been recovered.

Several difficulties become evident when implementation of arithmetic coding
is attempted. The first is that the decoder needs some way of knowing
when to stop. As evidence of this, the number 0 could represent any of
the source ensembles `A, AA, AAA,` etc. Two solutions to this problem
have been suggested. One is that the encoder transmit the size of the
ensemble as part of the description of the model. Another is that
a special symbol be included in the model for the purpose of signaling
end of message. The `#` in the above example serves this purpose.
The second alternative is preferable for several reasons. First, sending
the size of the ensemble requires a two-pass process and precludes the use
of arithmetic coding as part of a hybrid codebook scheme (see
Sections 1.2 and 3.2).
Secondly, adaptive methods of arithmetic coding are easily
developed and a first pass to determine ensemble size is inappropriate in
an on-line adaptive scheme.

A second issue left unresolved by the fundamental concept of arithmetic
coding is that of incremental transmission and reception. It appears
from the above discussion that the encoding algorithm transmits nothing
until the final interval is determined. However, this delay is not
necessary. As the interval narrows, the leading bits of the left
and right endpoints become the same. Any leading bits that are the same may be
transmitted immediately, as they will not be affected by further narrowing.
A third issue is that of precision. From the description of arithmetic
coding it appears that the precision required grows without bound as the
length of the ensemble grows. Witten et al. and Rubin address this issue
[Witten et al. 1987; Rubin 1979]. Fixed precision registers may be used
as long as underflow and overflow are detected and managed. The degree
of compression achieved by an implementation of arithmetic coding is
not exactly `H`, as implied by the concept of arithmetic coding. Both the use
of a message terminator and the use of fixed-length arithmetic reduce
coding effectiveness. However, it is clear that an end-of-message symbol
will not have a significant effect on a large source ensemble.
Witten et al. approximate the overhead due to the use of fixed
precision at 10^-4 bits per source message, which is also negligible.

The arithmetic coding model for ensemble `EXAMPLE` is given in
Figure 3.10. The final interval size is
`p`(a)^2*`p`(b)^3*`p`(c)^4*`p`(d)^5*`p`(e)^6*`p`(f)^7*`p`(g)^8*`p`(`space`)^5.
The number of bits needed to specify a value in the interval
is -lg(1.44*10^-35)=115.7. So excluding overhead, arithmetic coding
transmits `EXAMPLE` in 116 bits, one less bit than static Huffman coding.

Source Probability Cumulative Range message probabilityWitten et al. provide an implementation of arithmetic coding, written in C, which separates the model of the source from the coding process (where the coding process is defined by Equations 3.1 and 3.2) [Witten et al. 1987]. The model is in a separate program module and is consulted by the encoder and by the decoder at every step in the processing. The fact that the model can be separated so easily renders the classification static/adaptive irrelevent for this technique. Indeed, the fact that the coding method provides compression efficiency nearly equal to the entropy of the source under any model allows arithmetic coding to be coupled with any static or adaptive method for computing the probabilities (or frequencies) of the source messages. Witten et al. implement an adaptive model similar to the techniques described in Section 4. The performance of this implementation is discussed in Section 6.a.05 .05 [0,.05)b.075 .125 [.05,.125)c.1 .225 [.125,.225)d.125 .35 [.225,.35)e.15 .5 [.35,.5)f.175 .675 [.5,.675)g.2 .875 [.675,.875)space.125 1.0 [.875,1.0) Figure 3.10 -- The Arithmetic coding model ofEXAMPLE.