Technology Dock - Information Technology Articles

Compression and PKZIP - June 1991
(Archive Article)
by Dean S. Tripodes

[Man at PC]

How many of you are suffering from hard drives filling to the rim? DOS 5.0 relieves your RAM memory quite well, but it takes over two megabytes of disk space. Soft fonts for the laser printer can take up to a megabyte each for a 72 point font, and five megabytes is not uncommon for a fair selection of 10, 12, 14, 16, 28, and 36 point fonts with normal, bold, and italics attributes. WordPerfect 5.1 can easily take four megabytes and more if all the printer (.ALL) files and graphics (.WPG) files are loaded. Spreadsheets and databases can take up a tremendous amount of space, and that is just the beginning. You can fill up a forty megabyte hard drive just by installing all of your favorite programs -- not including data. Storing programs on floppy diskettes is one solution, but that keeps you from having quick access to your programs or data. If it is not at your fingertips, then chances are you won't bother to use it unless there is an emergency requirement for that specific information or program.

But computers are supposed to help us get more work done, not limit us. Even though there are practical limitations to what we can save and store permanently on our computers -- we can't all get 300 Megabyte hard drives, there is actually some logic to not saving everything. Even if we did have monstrous hard drives, we would find a way to fill them eventually. We actually get spoiled with too much space. Would you have deleted WordPerfect 5.0 when you installed version 5.1 if there were limitless space? Probably not. Chances are you kept it on for awhile, but when you started using 5.1's advanced features such as tables, the equation editor, or improved graphics handling, then 5.0 got deleted. It was either that or store your documents on floppy disk. Try that for a few projects longer than one or two pages and that quickly proves to be a bad solution.

You might have heard about data compression in various advertised software and hardware products. SQZ will compress Lotus or Quattro spreadsheets. Fastback will compress data as it backs it up to floppy diskettes. PKZIP and the older PKARC will compress data from one or multiple files into one tightly compressed .ZIP or .ARC file.

But after a moments consideration, the concept of data compression seems too good to be true. The idea of shrinking information without losing any of it looks to be a something-for- nothing proposition that violates what should have been one of Newton's lesser known laws: the law of conservation of data.

Despite the aura of mystique that surrounds it, data compression is based on a simple idea. Data that is represented with one group of symbols is mapped to another, more concise series of symbols. Data compression programs and dedicated compression hardware use several different algorithms to achieve this end.

Two different schemes, Huffman coding and LZW coding (named for Lempel and Ziv, its creators, and Welch, who made substantial modifications), form the basis for much of the compression that we use from day to day. These techniques also represent two distinct schools of compression algorithms. An understanding of how each algorithm works provides and excellent lesson about compression in general. Both Huffman and LZW coding are lossless compression techniques, that is, they are appropriate to use for compressing any kind of data because the expanded representation is identical to the original input given to the compressor. There are other algorithms that can achieve fantastic compression ratios at the expense of exact data reproduction. These techniques work well for images and sound data, but they are not appropriate for general data.

Huffman coding, originally proposed sometime in the early 1950's, reduces the number of bits used to represent frequent characters and increases the number of bits used for infrequent characters. The LZW method, on the other hand, encodes strings of characters, using the input stream to build an expanded alphabet based on the strings that it sees. These two very different approaches both work by reducing redundant information in the input data.

Huffman coding is probably the best known method of data compression. A descendant of Huffman coding is used as one stage in PKZIP's powerful "imploding" algorithm. The technique works on the premise that some symbols are used more often than others in data representation. The most common representation, the ASCII alphabet, uses 8 bits for each character. In English, the letter "e" is much more likely to appear than the letters "j", "q", or "z". Even though we know this, the same number of bits is used to represent each. If we used only 4 bits for an "e" and 12 bits for a "j", we would save some bits whenever storing English text. Essentially, Huffman coding formalizes the idea of relating symbol length to the probability of a symbol's occurrence. The greatest difficulty with Huffman codes, is that they require a table of probabilities for each type of data to compress. This is not a problem if you know you will always compress English text. In that case, you simply provide a suitable English text tree to the compressor and decompressor. Fortunately, a dynamic version of Huffman compression can construct the Huffman tree on the fly while reading and actively compressing. The tree is constantly updated to reflect the changing probabilities of the input data, allowing the compressor to give smaller representation (fewer bits) to more common strings of data and longer representation (more bits) to less frequent strings.

The LZW algorithm, which was first presented by Welch in 1984, has become a widely used technique during the last few years. CompuServe's GIF file format used LZW compression, as do ARC, Unix's compress, Stuffit, and PKZIP. The algorithm itself is patented by Sperry.

LZW works by using additional characters to represent strings of regular characters, in effect, lengthening the alphabet. To use LZW compression on 8 bit ASCII codes, you extend the alphabet by using 9 bit or larger codes. The additional 256 characters that the 9 bit codes give you are used to store strings of 8 bit codes, which are determined from strings in the input. The compressor maintains a string table with strings and their corresponding codes. The string table corresponds to the extended alphabet. Initially, the compressor starts with a string table with only the 256 literal codes defined (such as "A", "B", "C", "3", "$", etc). If you are using 9 bit codes, the string table has an additional 256 empty entries for strings (such as "OR", "THE", "AND"). If you are using 10 bit codes, it has 768 empty entries, and so on.

The compression algorithm works like this: Start with a null string. Read in a character, and append it to the string. If the string is in the string table, continue reading and appending characters until you find a string that is not. Add this string to the string table. Write the code for the last known string that matched the output. Use the last character as the basis for the next string, and continue reading until you run out of input. That is really all there is to it.

Unfortunately, this simple compression algorithm will sooner or later run out of string locations. Even using a 13 bit table, with 16 bit pointers, and 16 kilobytes of RAM for quick table access will not hold everything. Luckily, there is a simple way out. Each new string has a possibility of being comprised of a new character plus an old string, or vice versa. For example, if the string "ANGER" is represented as code 260, and we come upon the string "RANGER", we can represent it as "R"+260 without using one of our string locations.

The new areas to be developed in data compression come when we try to deal with the exceptions. If our string table is full, how do we store the string "glockenspiel" (a synonym for xylophone)? Do we simply copy it character for character and gain no data compression? One idea on this problem is to save a history of when each string in the data table was used last. The oldest string could be cleared as newer strings appeared. This solution poses the problem, however, with our modified algorithm. When we store our string data as base code plus character combinations, such as "RANGER" = "R"+260 in the previous example, we could erroneously eliminate the previous definition of code 260. Because of this potential problem, we would need to keep track of only those strings that are not used by other codes to form their representation. In this case, we could not eliminate "ANGER" as code 260. No matter if code 260 was never used again to represent the word "ANGER", it would need to remain because it was used in the representation of "RANGER".

Although data compression may appear complex and fraught with danger, it is actually quite a valuable and reliable tool. The compressor, PKZIP, and the decompressor, PKUNZIP, are shareware programs, and are available for free.


Terms of Use For Baywalk - Use of Baywalk signifies your agreement to the terms of use.



Top of Page