INTRODUCTION

1. FUNDAMENTAL CONCEPTS

1.1 Definitions

1.2 Classification of Methods

1.3 A Data Compression Model

1.4 Motivation

2. SEMANTIC DEPENDENT METHODS

3. STATIC DEFINED-WORD SCHEMES

3.1 Shannon-Fano Coding

3.2 Static Huffman Coding

3.3 Universal Codes and Representations of the Integers

3.4 Arithmetic Coding

4. ADAPTIVE HUFFMAN CODING

4.1 Algorithm FGK

4.2 Algorithm V

5. OTHER ADAPTIVE METHODS

5.1 Lempel-Ziv Codes

5.2 Algorithm BSTW

6. EMPIRICAL RESULTS

7. SUSCEPTIBILITY TO ERROR

7.1 Static Codes

7.2 Adaptive Codes

8. NEW DIRECTIONS

9. SUMMARY

REFERENCES

Concepts from information theory, as they relate to the goals and evaluation of data compression methods, are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported and possibilities for future research are suggested.

A simple characterization of data compression is that it involves transforming a string of characters in some representation (such as ASCII) into a new string (of bits, for example) which contains the same information but whose length is as small as possible. Data compression has important application in the areas of data transmission and data storage. Many data processing applications require storage of large volumes of data, and the number of such applications is constantly increasing as the use of computers extends to new disciplines. At the same time, the proliferation of computer communication networks is resulting in massive transfer of data over communication links. Compressing data to be stored or transmitted reduces storage and/or communication costs. When the amount of data to be transmitted is reduced, the effect is that of increasing the capacity of the communication channel. Similarly, compressing a file to half of its original size is equivalent to doubling the capacity of the storage medium. It may then become feasible to store the data at a higher, thus faster, level of the storage hierarchy and reduce the load on the input/output channels of the computer system.

Many of the methods to be discussed in this paper are implemented
in production systems. The UNIX utilities *compact* and
*compress* are based on methods to be discussed in
Sections 4 and
5 respectively [UNIX 1984]. Popular file archival
systems such as *ARC* and *PKARC* employ techniques presented
in Sections 3
and 5 [ARC 1986; PKARC 1987]. The savings achieved
by data compression can be dramatic; reduction as high as 80%
is not uncommon [Reghbati 1981]. Typical values of compression provided
by *compact* are: text (38%), Pascal source (43%),
C source (36%) and binary (19%). *Compress* generally
achieves better compression (50-60% for text such as
source code and English), and takes less time to compute [UNIX 1984].
Arithmetic coding (Section 3.4)
has been reported to reduce
a file to anywhere from 12.1 to 73.5% of its original size
[Witten et al. 1987]. Cormack reports that data compression
programs based on Huffman coding
(Section 3.2) reduced the size
of a large student-record database by 42.1% when only some of
the information was compressed. As a consequence
of this size reduction, the number of disk operations required
to load the database was reduced by 32.7% [Cormack 1985].
Data compression
routines developed with specific applications in mind have
achieved compression factors as high as 98% [Severance 1983].

While coding for purposes of data security (cryptography) and codes which guarantee a certain level of data integrity (error detection/correction) are topics worthy of attention, these do not fall under the umbrella of data compression. With the exception of a brief discussion of the susceptibility to error of the methods surveyed (Section 7), a discrete noiseless channel is assumed. That is, we assume a system in which a sequence of symbols chosen from a finite alphabet can be transmitted from one point to another without the possibility of error. Of course, the coding schemes described here may be combined with data security or error correcting codes.

Much of the available literature on data compression approaches the topic from the point of view of data transmission. As noted earlier, data compression is of value in data storage as well. Although this discussion will be framed in the terminology of data transmission, compression and decompression of data files for storage is essentially the same task as sending and receiving compressed data over a communication channel. The focus of this paper is on algorithms for data compression; it does not deal with hardware aspects of data transmission. The reader is referred to Cappellini for a discussion of techniques with natural hardware implementation [Cappellini 1985].

Background concepts in the form of terminology and a model for the study of data compression are provided in Section 1. Applications of data compression are also discussed in Section 1, to provide motivation for the material which follows.

While the primary focus of this
survey is data compression methods of general utility,
Section 2
includes examples from the literature in which ingenuity applied
to domain-specific problems has yielded interesting coding
techniques.
These techniques are referred to as *semantic dependent* since they
are designed to exploit the context and semantics of the data to
achieve redundancy reduction. Semantic dependent techniques include
the use of quadtrees, run length encoding, or difference mapping
for storage and transmission of image data [Gonzalez and Wintz 1977;
Samet 1984].

General-purpose techniques, which assume no knowledge of the information content of the data, are described in Sections 3-5. These descriptions are sufficiently detailed to provide an understanding of the techniques. The reader will need to consult the references for implementation details. In most cases, only worst-case analyses of the methods are feasible. To provide a more realistic picture of their effectiveness, empirical data is presented in Section 6. The susceptibility to error of the algorithms surveyed is discussed in Section 7 and possible directions for future research are considered in Section 8.