LZSS compression

From ModdingWiki
Jump to navigation Jump to search
LZSS compression
Format typeCompression algorithm
TypeStream
I/O unit size1-bit (bit-level variant), 1-byte (byte-level variant)
Games

LZSS (Lempel-Ziv-Storer-Szymanski) is a dictionary compression algorithm first proposed by James A. Storer and Thomas G. Szymanski in 1982. The method is lossless, and builds upon the existing LZ77 technique.

LZSS makes use of a sliding window just like its predecessor, however differs primarily in its more selective approach to substitution. Where LZ77 simply replaces every string it processes, LZSS will only substitute a string with a "length-distance" pair if the size of the string, in bits, exceeds a preset minimum, leaving the string's symbols as literals if not. This prevents a pair from ever replacing a string whose size is less than its own, and thus avoids such a case from unnecessarily increasing the size of the final output (a problem which can occur in LZ77). As alluded to above, LZSS also does away with the codeword component of LZ77's "length-distance-codeword" triple, leaving only length and distance. The algorithm also introduces a 1-bit flag which prefixes each element in the code, in order to signify whether the succeeding element is a literal or a "length-distance" pair.

Like LZ77, the length and distance in LZSS's "length-distance" pair describe the location, within the search buffer, of the string that a particular pair substitutes (with the buffer containing symbols that have already been processed, or decompressed (if decoding)). During the decompression process, the decoder, upon encountering one of these pairs, can, in an informal sense, determine the string to replace said pair with by looking back into the search buffer by the amount described by "distance", and reading forward by the amount specified by "length". The string that is read is the string in which to use. It should also be noted that it is sometimes possible, depending on the implementation, for the string being read to pass into the look ahead buffer, that is to say, it is possible for "length" to be longer than "distance". The process for handling such cases is fairly straightforward however, once the string passes into the look ahead buffer, the decoder would just need to start reading from the symbols that have already been substituted as a part of processing the current pair (kind of like driving the train while placing the tracks).

Algorithm

There are two variants, which differ in the way the flags are specified. The original algorithm used a stream of bits, however this is much slower than operating at the byte level, so the algorithm was optimised by some to use bytes instead.

Often the bit-level variant will use length and offset sizes that do not sum to a multiple of 8 (e.g. length 2, offset 8 = total 10, not a multiple of 8) whereas byte-level variants will always sum to a multiple of 8 (e.g. length 4, offset 12 = total 16, exactly two bytes).

Variables

  • size_length: Size of length field, in bits
  • size_distance: Size of distance field, in bits
  • min_len: Minimum length of a backreference (at least 2 to break even; 3 is better as it represents a savings of 1 byte)
  • Whether the flag bit is 0 for normal byte, 1 for length-distance pair, or the opposite
  • Whether the length field comes before distance, or distance before length

Decoding Procedure

Bit-level variant

  1. Read one bit as the flag
    • If it's 1: (or 0 if flags are opposite)
      1. Read size_length bits, add min_len to the value, and use it as the LZSS length value
      2. Read size_distance bits, add 1 to the value, and use it as the LZSS distance value
      3. Look back distance bytes in the output data, and copy length bytes to the end of the output data
    • Otherwise: (the flag is a 0, or 1 if flags are opposite)
      1. Read 8 bits, and write them as a byte on the end of the output data
  2. If there is more input data, go back to step 1

Byte-level variant

  1. Read one byte as the next eight flags
  2. For each bit:
    • If it's 1: (or 0 if flags are opposite)
      1. Read a 16-bit value
      2. Split the value into length and distance fields, based on size_length and size_distance
      3. Add min_len to the length value, and use it as the LZSS length value
      4. Look back distance bytes in the output data, and copy length bytes to the end of the output data
    • Otherwise: (the flag is a 0, or 1 if flags are opposite)
      1. Read another byte, and write it as a byte on the end of the output data
  3. If there is more input data, go back to step 1

Example code

Bit-level variant

Byte-level variant

See also

SkyRoads uses a variant of LZSS where the variables can be different for each file, and are stored at the start of the compressed data, as well as allowing two types of length-distance codes in the same file.

Credits

Malvineous discovered this compression algorithm was used by Prehistorik. If you find this information helpful in a project you're working on, please give credit where credit is due. (A link back to this wiki would be nice too!)