LZW compression is a common form of compression used in some early games to compress data and by most early games to compress their executables. It is notable in being one of the first compression methods to not compress on the byte level (Along with Huffman Compression) and for its efficiency.
The basic concept for LZW is universal, though the implementations differ. In essence it involves replacing data strings that have been encountered before with references to already decompressed data. (Known as a 'dictionary') This can be done in a number of ways, the two main approaches differing on whether the dictionary is separate or integrated
Separate Dictionary Approach
In this approach the dictionary is separate from the data being decompressed, that is it is stored in a separate location in memory. In this it behaves more like one would expect a dictionary to work; when a codeword is found in the data, it is looked up in the dictionary and the corresponding string copied to output. (As an example dictionary entry 42 could represent the string 'life', thus whenever the code '42' is encountered the string 'life' is added to the decompressed data.)
The advantage of this method is that the efficiency of compression increases as the amount of data to compress increases. The following points differ between implementations:
- The initial dictionary. Just how large the initial dictionary is varies. Some implementations start with no dictioanry at all, others set a number of entries, usually 255, covering all possible 1-byte values.
- The maximum size of the dictionary. Many older implementations with less resources were forced to cap the dictionary at a certain size, usually a power of two entries long. (512, 1024...) Unlimited implementations are rare as modern methods (e.g. the DEFLATE algorithm.) usually rely on several compression methods at once. Sometimes the dictionary is 'reset' when it reaches too large a size.
- Whether the codestream is made partly or entirely of codewords. Often the compressed data is made entirely of codewords, even non-repeating strings, which means that initially compression can sometimes be rather poor. Other implementations use codeowords only for repeating strings. Differences in how codewords vs literal are indicated and how dictionaries are built up may occur.
This is a general decoding algorithm for separate dictionary LZW. It will need to be altered slightly when dealing with different implementations. Notably it assumes that the codestream is composed entirely of codewords and that the dictionary can keep growing indefinitely.
1 At the start the dictionary contains all possible roots; 2 cW := the first code word in the codestream (it denotes a root); 3 output the string.cW to the charstream; 4 pW := cW; 5 cW := next code word in the codestream; 6 Is the string.cW present in the dictionary? a if it is, i output the string.cW to the charstream; ii P := string.pW; iii C := the first character of the string.cW; iv add the string P+C to the dictionary; b if not, i P := string.pW; ii C := the first character of the string.pW; iii output the string P+C to the charstream iv add the string P+C to the dictionary (now it corresponds to the cW); 7 Are there more code words in the codestream? a if yes, go back to step 4; b if not, END.
In Commander Keen 1-3 LZW is used to compress the EGALATCH and EGASPRIT files in episode 1 (It can also be used in episodes 2 and 3, but isn't.) The game uses two error-checking methods in this implementation, firstly it reserves two dictionary values, $100 to indicate an error (This is written by the compression program and will make the executable abort.) and $101 to indicate the end of data. (If the program reaches the end of the data without encountering this it will also abort.) The compressed data is also prefixed with a dword giving the decompressed data size, so this can be compared with the output.
This method is a typical separate dictionary approach. It starts with a dictionary of 256 9-bit codewords representing the 8-bit strings $00-$FF (Plus some special cases.) The dictionary is allowed to grow to 4096 entries. The following is the initial dictionary:
0000 - 00 (character) 0001 - 01 (character) ... 00FE - FE (character) 00FF - FF (character) 0100 - Reserved for errors... 0101 - Reserved for end of compressed data... 0102 - (not set) 0103 - (not set)
It will be immediately noticed that 4096 entries cannot be represented by 9-bit codes but at the least by 12-bit codes. To further conserve space the length of the codewords is increased every time the dictionary grows too large. Thus when it reaches $01FF entries codewords become 10 bits long, at $03FF they are 11 bits and finally at $07FF 12 bits. At $0FFF entries the dictionary stops growing.
The following data is taken from the EGALATCH file from Keen 1. Notice that the first six bytes are ignored. (The first four give the decompressed data size, the next two are an executable variable of unknown use.) The first few steps of decompression follows.
0000 80 D3 01 00 0C 00 00 40 - A0 70 31 E9 F8 F8 78 38 0010 08 08 00 07 FC 39 FF 04 - 5E 41 E1 30 B3 C4 5A 2F
The first code word encountered is 000000000 (First 9 bits) and thus outputs the string $00 (First dictionary entry.) This has set us up to step 4 and now things work slightly differently.
The second code word is 100000010 (Next 9 bits) which is entry $102. Since this entry is NOT found in the dictionary yet, we will create this entry then output it. Entry $102 is created by taking the previous codeword's string adding to it the first byte of that string. In this case the previous codewrode's (0) string is $00. $00 + $00 is $00 $00 Entry $102 thus represents the string $00 $00
The next code word is 100000011 which is entry $103, which again doesn't exist. Entry $103 is created just like with $102, except now since the previous codeword is $00 $00, entry $103 is $00 $00 $00
The next code word is again $103, this IS found in the dictionary and is outputted ($00 $00 $00) however we now create dictionary entry $104 just like $103. (It is $00 $00 $00 $00)
The next code word is $3B. It is outputted and entry $105 created ($00 $00 $00 $00 $3B) Note that now the previous codeword is $3B. This pattern continues.
Integrated Dictionary Approach
In this approach the dictionary is the decompressed data itself. The codewords do not represent an entry in the dictionary structure, but rather directions in the decompressed code as to where a repeated string is located. That is a repeat string may well be represented by a codeword that states 'copy seven bytes from byte 132 in the decompressed data'
The advantage of this approach is that it doesn't require a separate construct for the dictionary and can just use the already decompressed data. The downside is that it will always need some method to distinguish literals and codewords, and, since codewords are nearly always of a fixed length there is an inherent limit both to how long the copied string can be and where it can be read from. this means that eventually the compression efficiency will level off and stop improving.
There This is often called the 'sliding window' and it represents the data that can be 'reached' by the codewords. It is named as such because it is of a fixed length and 'slides' along the output stream as it gets longer. The following features differ between implementations:
- Differentiating codewords from literals. There must be a way to tell codewords apart from data that is just to be read and outputted. Sometimes this is integrated into the codewords themselves, but more often the 'flag' precedes a codeword. The flag may indicate only codewords or both codewords and literals. ('Following data is made of two codewords and six literals', etc.)
- Codeword format. Most implementations use codewords of a fixed format that must encode both the length of data to copy and the location to copy it from. Codewords are usually two or four bytes long.
- Zero point location. Different implementations use different places in the output as zero. If the start of the code is used as zero only the first x bytes of the output can be used as a reference. Most implementations use a more complex, but seldom less effective 'sliding window'; the zero point is the start of the data until the data becomes too long at which point it moves forward so that the most recent x bytes of output can be used. It is also possible for zero to be the most recent byte with all locations being 'x bytes from the end', which also produces a sliding window. Finally it may be possible to have both negative and positive locations.
- Sliding window. The nature of this is dictated by the format of the codewords. Common sizes are 1KB, 2KB or 4KB. The window will always be present, but it may not 'slide' if the implementation uses a fixed location as the zero point.
Many vintage executables are compressed with the program LZEXE, interesting in that the compressed file contains its own decompressor, meaning that it is in essence a self-extracting archive. However the unique feature of LZEXE executables is that they extract the compressed data to memory and run it. To the user this is indistinguishable from the decompressed executable, though it takes slightly longer to start up and takes up much less space.
UNLZEXE can be used to extract the decompressed executables from this which will run perfectly with other game files. It can be obtained here: http://www.dosclassics.com/download/198 It is currently unknown specifically how the LZW compression is implemented in this case, but with the source code for decompression is available.
The LZEXE compression is similar to the SoftDisk Library Approach described below, but it uses UINT16LE values instead of byte values to store the flag bits and the sliding window has a size of 8192 (0x2000) bytes. Also, the flag bits have different meanings (you could argue that they are in fact Huffman codes):
1 -> copy 1 literal byte 10 -> next two bytes contain length and distance 0000 -> length is 2, next byte contains distance 1000 -> length is 3, next byte contains distance 0100 -> length is 4, next byte contains distance 1100 -> length is 5, next byte contains distance
The real distance value is always a signed 16 bit integer (you can use unsigned but then you have to bitwise-and the resulting index value with 0xFFFF). If the length is given by the flag bits/Huffman code, the real distance is b | 0xFF00 or b - 256 where b is the byte value read from the file. If the length value is not given, read the byte values b0 and b1 and calculate length and distance like this:
length = (b1 mod 8)+2 distance = b0 + (b1/8)*256 - 8192
length = (b1 & 0x07)+2 distance = b0 | ((b1 & 0xF8)<<5) | 0xE000
If the length value calculated from b1 is 2, this indicates that another byte value b2 must be read. Depending in the value of b2, one of three things can happen:
b2 = 0: end of compressed data - stop decompressing b2 = 1: end of segment (decompressor may write contents of buffer to output) set length to 0 or jump to the part where the decompressor reads the next flag bits/Huffman code otherwise: set length to b2+1
Now the decompressor must only add distance to the current buffer index (since distance is negative, the index goes backwards) and copy length bytes from there to the current index:
WHILE length > 0 buffer[index] = buffer[index+distance] index = index + 1 length = length - 1 END WHILE
Please refer to the ULZEXE source code for further information.
SoftDisk Library Approach
This is used as the first form of compression in the Softdisk Library Format. Flags are 1-byte long and divide the datastream into segments of eight 'values' which can be either literals or codewords. Codewords are 2 bytes long, literals 1 byte. (Therefore there will be a flag byte every 8 to 16 bytes of data.) The value of each bit (In little endian) indicates whether a value will be a literal (1) or codeword (0) Thus a value of 199 (11000111 in binary) indicates three codewords, three literals and two codewords in that order. (Total of 13 bytes.)
Literals are sequences that have never been seen in the datastream before, they cannot be compressed and are thus the same in the compressed and decompressed datastreams. (If the data is text they become quite obvious.) Any string less than 3 bytes long that has not been read before or cannot be pointed to (See below) will be stored as literals.
Codewords are reference to data that has already been read. They are two bytes long, with the first 12 bits giving the location to read data from and the last 4 bits giving the length of data to read.
The lower nybble (4 bits) of the second codeword byte holds the length of repeat data to read minus three. (This makes sense, the shortest sequence it makes sense to code is three bytes which can be given the value 0.) It will be immediately apparent that the maximum length of repeated data that can be stored as a codeword is 18 bytes.
The high nybble of the second byte is multiplied by 16 then added to the first byte to give the location of the data to read in the 'sliding window' minus 19. (This is due to the way the decompression is set up in memory.)
It will be immediately obvious that the codewords can encode values between +-2048, or about 2KB. If the decompressed data is less than 2KB in size then zero is the start of the data, if it is larger than it is 2048 bytes from the data end.
It will be noted that it is probable that the compressed datastream will not be perfectly divisible by flag bytes. In this case the unused bits are set to 0. The decompressor stops when the decompressed data size is equal to the value given in the chunk header. (If it runs out of data it will abort.)
As a simple example the sentence 'I am Sam. Sam I am!' will be compressed to:
FF Flag byte, 8 literals follow 49 20 61 6D 20 53 61 6D 'I am Sam' as literals 2B Flag byte, 2L, P, L, P, L 2Blanks ($2B = 43 = 00101011) 2E 20 ' .' as literals F2 F0 codeword, read 0 + 3 = 3 bytes from $FF2, or -14 + 19 = 5 in the data. This is 'Sam' 20 ' ' as literal ED F1 codeword, read 1 + 3 = 4 bytes from $FED or -19 + 19 = 1 in the data. This is 'I am' 21 '!' as literal
Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Keen's LZW algorithm in its various implementations.
Keen 1-3 Implementation
These segments f code work with the Keen 1-3 implementation only and will not for example decompress LZEXE compressed executables.
This code does not suffer from the 64K memory limit imposed by QuickBasic and so is less efficient, but runs faster. It can be compiled with FreeBasic compiler using the -lang=qb switch. Aside from memory concerns, all code here is compatible with QuickBasic.
The code before the subroutine is used to make a string containing the bit expansion of all values from 0 to 255. The subroutine takes a filename, reads the entire file into memory then expands each bit of data to a byte using the aforesaid string as Basic cannot deal with bits directly. cw$ is codeword, pw$ is the previous codeword, lun is the lowest dictionary entry that is empty, p is the location in the compressed data stream and bl is the length of codes in bits (Starting at nine bits increasing to 12)
The dictionary is set before decompression. The first 258 are the starting dictionary, the remainder are cleared. (It is vital to reset the dictionary for each file) An error occurs if entry 256 is found in the data, 'distrupt' is printed when the newest dictionary entry is not the lowest possible entry (This shouldn't happen but is possible.) Decompression ends at encountering entry 257, or when there is no more data to read.