# Huffman Compression

**Huffman Compression**

Format type | Compression algorithm |
---|---|

Type | Stream |

I/O unit size | 1-byte in, 1-bit out |

Games |

**Huffman Compression** is a compression algorithm used in many classic games. It is about as efficient as LZW or RLE compression, with many games using the three formats simultaneously.

Huffman compression involves changing how various characters are stored. Normally all characters in a given segment of data are equal and take an equal amount of space to store. However by making more common characters take up less space while allowing less commonly used characters to take up more the overall size of a segment of data can be reduced. Any Huffman compressed data will thus be associated with a dictionary file (internal or external) used to decompress it. This page will go through each aspect of Huffman compression in a stepwise manner before presenting some examples of usage and source code.

## Contents

## Frequency tables

To illustrate the various aspects of Huffman compression the example text "She sells seashells by the sea shore." will be Huffman compressed and the processes and outputs shown in various sections on this page.

In order to compress data using Huffman compression the compressor must first build a 'frequency table' of all the elements it wishes to compress. In most cases this will be the 256 values a single byte can take, though it need not be limited to this. (Due to the binary nature of the compression method, sets of elements that match powers of 2 are both easiest and most efficient to compress.) The data is scanned and each occurrence of a given value. Taking the letters in our example sentence we can construct the following frequency table:

Char Freq 's' 7 'e' 7 ' ' 6 'h' 4 'l' 4 'a' 2 'S' 1 'b' 1 'o' 1 'r' 1 't' 1 'y' 1 '.' 1

## The Binary Tree

The binary tree is core to how Huffman compression compresses data. This is done by constructing a 'binary tree', so named because of its branching structure. Each branching point or 'node' has two options, 'left' and 'right' which lead either to another node or a character. The tree is constructed so that every character can be placed upon it and every path ends in a character.

This is done by first building a point of origin, 'root node' and the adding further nodes from there. Using our example we can build the following binary tree simply by listing the nodes alphabetically (and using '_' to stand in for the space for clarity):

[Root] ____|_______ / \ __|__ | / \ / \ | | | . / \ / \ / \ | | | | | | / \ / \ / \ / \ / \ / \ _ S a b e h l o r s t y

On this tree we can use 'L' to specify a left path and 'R' to specify a right. As such the letter 'a' is represented by 'LLRL' and 'y' by 'RLRR'. Note that since the number of different characters is not a power of two (E.g. 16) not all paths from the root to a character are the same length, one is considerably shorter. The '.' character is represented by simply 'RR'

Assuming that 'R' and 'L' are '0' and '1' each 8-bit character in the original text can be represented by just 4 bits on average, halving the message length. This is far from optimal however and there is a specific procedure used to construct an optimal tree for the shortest message length. To do this advantage is taken of the fact that paths can be different lengths. If one set of paths is made a bit shorter at the expense of making a smaller set of paths a bit longer then the overall message will be made shorter.

### Constructing an optimal tree

The first step in creating an optimal tree again is to create a 'root node'. To populate its two options the compressor repeatedly divides the frequency table. This is done by starting with the most common character and seeing if it contains more occurrences than the rest of the table combined. If not then the second most common character is added and so on.

In our example there are a total of 37 characters. Three characters make up more than half of this, 's', 'e' and ' '. Because of the binary structure of the tree however *four* options are needed here so 'h' is also included. All of these characters are placed on the left option of the root node while the remaining characters are placed on the right.

The next step creates a new node with left and right options and repeats the process. The two most common characters of the four, 's' and 'e' go to the left branch of this new node. A third node is created and, now that there are only two options, each is given a branch. The tree now looks like this:

[Root] / \ | l, a, etc / \ | _, h / \ s e

Now the character 's' can be represented as LLL and 'e' as LLR.

The compressor now moves 'up' to the previous node which still has a branch with more than one option in it. Another node is created and, since there are only two options, one is assigned to each branch. Te compressor then returns to the root node to deal with the right half of the tree. When it has assigned very character to a node the following optimal tree is produced:

___[Root]____ / \ _|_ ___|_______ / \ / \ | | | | / \ / \ / \ / \ s e _ h l | | | / \ / \ / \ a | o r t | / \ / \ S b y .

Notice that 4 characters now have paths that are just 3 choices long while 4 have paths 4 choices long and 4 have paths 5 choices long. Overall the average path length is the same as our previous tree. However 24 occurrences are one choice shorter, 5 are identical, 3 are one choice longer and '.' is three choices longer. This gives us a compressed message that is a total of 18 choices (or bits) shorter.

### Labeling the tree

Now that a tree has been constructed it needs to be labelled in some form so that it can be stored as data either in the decompressor or to be sent with the compressed code. There are two major issues, what system to use when assigning numbers to the various nodes, and what bit value, 0 or 1, to assign to left or right options.

The most common (but far from unique) system is to label nodes from the 'bottom' of the tree to the top, going left-to-right across each 'layer' of the tree. (In our example the node that leads to 'S' and 'b' would thus be the first numbered node.) Nodes are usually numbered starting with 0 and ending with the root node having the highest number. Our tree would be numbered as follows using this system:

____[11]_____ / \ [9] __[10]_____ / \ / \ [5] [6] [7] [8] / \ / \ / \ / \ s e _ h l [2] [3] [4] / \ / \ / \ a [0] o r t[1] / \ / \ S b y .

Usually the left branch is given a bit value of 0 and the right branch a bit value of 1. Using this scheme 's' in our example resides on node 1 and is represented by the bit string '000' The complete compressed message in bit form is now `10110 011 001 010 000 001 100 100 000 010 000 001 1010 000 011 001 100 100 000 010 10111 11110 010 1110 011 001 010 000 001 1010 010 000 011 1100 1101 001 11111` or in bytes `$B3 $28 $19 $02 $06 $83 $32 $05 $7F $2E $65 $03 $48 $3C $D3 $F0` (The final byte is padded with nul bits.) This is a total of 16 bytes used for a 37 byte message.

### Standard tree size

The most common implementation of Huffman compression is '1 byte in, 1 bit out'. This involves compressing one byte of data at a time and converting it into a bitstream. In this case the compressor expects to find 256 different possible values of data. This requires a tree with 255 nodes, including the root, and an average path length of 8 choices.

This is not the only possibility however and Huffman can compress data in words or dwords or in more exotic manners. for example combines Huffman with LZW compression, building a binary tree of a fixed size and filling it with the most recently encountered strings of bits.

## The dictionary

Often the binary tree must be stored in some form, usually called a *dictionary*. This will either be fixed and known by both the compressor and decompressor. (Which minimizes individual message sizes but can limit compression efficiency overall.) or tailored for each message and sent with said message.

Dictionary format is quite standard. It is merely a numerical list of nodes (however they have been labelled.) starting with node 0 and ending with the root. Each node usually taking up four bytes, two bytes per branch. Each branch will have one byte dedicated to identifying if it leads to another node or a character and the other byte identifying the node or character. Usually '0' identifies the option asleading to a character and '1' to another node, but again this can be reversed.

Using this system, our example tree could be stored as either of the two following:

0S0b 0y0. 0a10 0o0r 0t11 0s0e 0_0h 0l12 1314 1516 1718 19110

$00 $53 $00 $62 $00 $79 $00 $2E $00 $6F $00 $72 $00 $74 $01 $01 $00 $73 $00 $65 $00 $20 $00 $68... - S - b - y - . - o - r - t | n1 - s - e - _ - h ... (node 0) (node 1) (node 3) (node 4) (node 5) (node 6) ...

The dictionary is thus usually 1020 bytes long, containing 255 nodes of 4 bytes each, though again this can vary depending on how readable the dictionary is to humans and whether or not the dictionary itself is compressed.

## Huffman implementation in ID Software games

Games created by Id Software have a shared but distinct implementation of Huffman compression.

In older games such as the dictionary and data are included together in one file. First is a four-byte header,, 'HUFF' followed by four bytes that give the decompressed size of the data (for memory purposes.). Following that is the dictionary, 1020 bytes in length then the compressed data.In newer games (such as Commander Keen 4-6 or Wolfenstein 3-D) the data is stored in an external file which consists of a number of concatenated 'chunks', each chunk beginning with a dword giving its decompressed size (again for memory purposes.). These chunks are addressed by an internal or external 'header' file consisting of a series of three (or sometimes four) byte offsets to the start of various chunks in the data file. (The end of the header is easily found as the last entry is the length of the data file.) The dictionary is 1024 bytes long and stored in the executable.

The dictionary consists of 255 nodes (in newr games plus four null bytes of padding) and there are two or three dictionaries stored in the executable depending on the game. The end of these dictionaries can be found by looking for node 255 (number 254!) which universally consists of the string `$00 $00 $FD $01` due to the fact that the byte `$00` is the single most common character in game data and thus always occupies the left branch of the root node while the other branch must go to the next lowest node, number 253.

An important but subtle difference is how data is read. While bytes are read sequentially (1, 2, 3...) bits within each byte are read in reverse order (8, 7, 6...). (Technically each byte is little endian.)

### Trivial ID dictionary

The following is a complete 'trivial' dictionary. This dictionary is constructed such that data compressed with it is not altered in any way, the output and input are the same. When used to construct a binary tree all paths are of equal length, 8 choices or bits. And the nodes are arranged in a logical order with an easily seen pattern; the first half of the tree consists of terminal nodes, the second half of branch nodes. (Of the second half the first half of *that* consists of branch nodes that lead to terminal nodes while the second half consists of branch nodes to two branch nodes and so on.)

$00 $00 $80 $00 $40 $00 $C0 $00 $20 $00 $A0 $00 $60 $00 $E0 $00 $10 $00 $90 $00 $50 $00 $D0 $00 $30 $00 $B0 $00 $70 $00 $F0 $00 $08 $00 $88 $00 $48 $00 $C8 $00 $28 $00 $A8 $00 $68 $00 $E8 $00 $18 $00 $98 $00 $58 $00 $D8 $00 $38 $00 $B8 $00 $78 $00 $F8 $00 $04 $00 $84 $00 $44 $00 $C4 $00 $24 $00 $A4 $00 $64 $00 $E4 $00 $14 $00 $94 $00 $54 $00 $D4 $00 $34 $00 $B4 $00 $74 $00 $F4 $00 $0C $00 $8C $00 $4C $00 $CC $00 $2C $00 $AC $00 $6C $00 $EC $00 $1C $00 $9C $00 $5C $00 $DC $00 $3C $00 $BC $00 $7C $00 $FC $00 $02 $00 $82 $00 $42 $00 $C2 $00 $22 $00 $A2 $00 $62 $00 $E2 $00 $12 $00 $92 $00 $52 $00 $D2 $00 $32 $00 $B2 $00 $72 $00 $F2 $00 $0A $00 $8A $00 $4A $00 $CA $00 $2A $00 $AA $00 $6A $00 $EA $00 $1A $00 $9A $00 $5A $00 $DA $00 $3A $00 $BA $00 $7A $00 $FA $00 $06 $00 $86 $00 $46 $00 $C6 $00 $26 $00 $A6 $00 $66 $00 $E6 $00 $16 $00 $96 $00 $56 $00 $D6 $00 $36 $00 $B6 $00 $76 $00 $F6 $00 $0E $00 $8E $00 $4E $00 $CE $00 $2E $00 $AE $00 $6E $00 $EE $00 $1E $00 $9E $00 $5E $00 $DE $00 $3E $00 $BE $00 $7E $00 $FE $00 $01 $00 $81 $00 $41 $00 $C1 $00 $21 $00 $A1 $00 $61 $00 $E1 $00 $11 $00 $91 $00 $51 $00 $D1 $00 $31 $00 $B1 $00 $71 $00 $F1 $00 $09 $00 $89 $00 $49 $00 $C9 $00 $29 $00 $A9 $00 $69 $00 $E9 $00 $19 $00 $99 $00 $59 $00 $D9 $00 $39 $00 $B9 $00 $79 $00 $F9 $00 $05 $00 $85 $00 $45 $00 $C5 $00 $25 $00 $A5 $00 $65 $00 $E5 $00 $15 $00 $95 $00 $55 $00 $D5 $00 $35 $00 $B5 $00 $75 $00 $F5 $00 $0D $00 $8D $00 $4D $00 $CD $00 $2D $00 $AD $00 $6D $00 $ED $00 $1D $00 $9D $00 $5D $00 $DD $00 $3D $00 $BD $00 $7D $00 $FD $00 $03 $00 $83 $00 $43 $00 $C3 $00 $23 $00 $A3 $00 $63 $00 $E3 $00 $13 $00 $93 $00 $53 $00 $D3 $00 $33 $00 $B3 $00 $73 $00 $F3 $00 $0B $00 $8B $00 $4B $00 $CB $00 $2B $00 $AB $00 $6B $00 $EB $00 $1B $00 $9B $00 $5B $00 $DB $00 $3B $00 $BB $00 $7B $00 $FB $00 $07 $00 $87 $00 $47 $00 $C7 $00 $27 $00 $A7 $00 $67 $00 $E7 $00 $17 $00 $97 $00 $57 $00 $D7 $00 $37 $00 $B7 $00 $77 $00 $F7 $00 $0F $00 $8F $00 $4F $00 $CF $00 $2F $00 $AF $00 $6F $00 $EF $00 $1F $00 $9F $00 $5F $00 $DF $00 $3F $00 $BF $00 $7F $00 $FF $00 $00 $01 $01 $01 $02 $01 $03 $01 $04 $01 $05 $01 $06 $01 $07 $01 $08 $01 $09 $01 $0A $01 $0B $01 $0C $01 $0D $01 $0E $01 $0F $01 $10 $01 $11 $01 $12 $01 $13 $01 $14 $01 $15 $01 $16 $01 $17 $01 $18 $01 $19 $01 $1A $01 $1B $01 $1C $01 $1D $01 $1E $01 $1F $01 $20 $01 $21 $01 $22 $01 $23 $01 $24 $01 $25 $01 $26 $01 $27 $01 $28 $01 $29 $01 $2A $01 $2B $01 $2C $01 $2D $01 $2E $01 $2F $01 $30 $01 $31 $01 $32 $01 $33 $01 $34 $01 $35 $01 $36 $01 $37 $01 $38 $01 $39 $01 $3A $01 $3B $01 $3C $01 $3D $01 $3E $01 $3F $01 $40 $01 $41 $01 $42 $01 $43 $01 $44 $01 $45 $01 $46 $01 $47 $01 $48 $01 $49 $01 $4A $01 $4B $01 $4C $01 $4D $01 $4E $01 $4F $01 $50 $01 $51 $01 $52 $01 $53 $01 $54 $01 $55 $01 $56 $01 $57 $01 $58 $01 $59 $01 $5A $01 $5B $01 $5C $01 $5D $01 $5E $01 $5F $01 $60 $01 $61 $01 $62 $01 $63 $01 $64 $01 $65 $01 $66 $01 $67 $01 $68 $01 $69 $01 $6A $01 $6B $01 $6C $01 $6D $01 $6E $01 $6F $01 $70 $01 $71 $01 $72 $01 $73 $01 $74 $01 $75 $01 $76 $01 $77 $01 $78 $01 $79 $01 $7A $01 $7B $01 $7C $01 $7D $01 $7E $01 $7F $01 $80 $01 $81 $01 $82 $01 $83 $01 $84 $01 $85 $01 $86 $01 $87 $01 $88 $01 $89 $01 $8A $01 $8B $01 $8C $01 $8D $01 $8E $01 $8F $01 $90 $01 $91 $01 $92 $01 $93 $01 $94 $01 $95 $01 $96 $01 $97 $01 $98 $01 $99 $01 $9A $01 $9B $01 $9C $01 $9D $01 $9E $01 $9F $01 $A0 $01 $A1 $01 $A2 $01 $A3 $01 $A4 $01 $A5 $01 $A6 $01 $A7 $01 $A8 $01 $A9 $01 $AA $01 $AB $01 $AC $01 $AD $01 $AE $01 $AF $01 $B0 $01 $B1 $01 $B2 $01 $B3 $01 $B4 $01 $B5 $01 $B6 $01 $B7 $01 $B8 $01 $B9 $01 $BA $01 $BB $01 $BC $01 $BD $01 $BE $01 $BF $01 $C0 $01 $C1 $01 $C2 $01 $C3 $01 $C4 $01 $C5 $01 $C6 $01 $C7 $01 $C8 $01 $C9 $01 $CA $01 $CB $01 $CC $01 $CD $01 $CE $01 $CF $01 $D0 $01 $D1 $01 $D2 $01 $D3 $01 $D4 $01 $D5 $01 $D6 $01 $D7 $01 $D8 $01 $D9 $01 $DA $01 $DB $01 $DC $01 $DD $01 $DE $01 $DF $01 $E0 $01 $E1 $01 $E2 $01 $E3 $01 $E4 $01 $E5 $01 $E6 $01 $E7 $01 $E8 $01 $E9 $01 $EA $01 $EB $01 $EC $01 $ED $01 $EE $01 $EF $01 $F0 $01 $F1 $01 $F2 $01 $F3 $01 $F4 $01 $F5 $01 $F6 $01 $F7 $01 $F8 $01 $F9 $01 $FA $01 $FB $01 $FC $01 $FD $01

Using this example we can see the unusual manner in which ID games read bits within bytes. As an example consider the character $80; in binary this is '10000000' However, following this path using the dictionary given above will output the character $01 (00000001). Instead the bits must be read in reverse order. Doing so will lead to the following nodes starting at the root node: 254(root) -> 252 -> 248 -> 240 -> 224 -> 192 -> 128 -> 0(terminal node for characters $00 and $80)

## Source code

Some example code is available in various languages showing how to decompress (and in some cases compress) files using the Huffman algorithm.

### QuickBasic

' ' DANGEROUS DAVE 2 - IN THE HAUNTED MANSION - Huffman Decompressor ' - by Napalm with thanks to Adurdin's work on ModKeen ' ' This source is Public Domain, please credit me if you use it. ' ' DECLARE SUB HUFFDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING) TYPE NODE BIT0 AS INTEGER BIT1 AS INTEGER END TYPE ' Input filename, output filename HUFFDECOMPRESS "TITLE1.DD2", "TITLE1.PIC" SUB HUFFDECOMPRESS (INNAME AS STRING, OUTNAME AS STRING) ' by Napalm DIM INFILE AS INTEGER, OUTFILE AS INTEGER, I AS INTEGER DIM SIG AS LONG, OUTLEN AS LONG, BITMASK AS INTEGER DIM CURNODE AS INTEGER, NEXTNODE AS INTEGER DIM CHRIN AS STRING * 1, CHROUT AS STRING * 1 DIM NODES(0 TO 254) AS NODE ' Open input file INFILE = FREEFILE OPEN INNAME FOR BINARY ACCESS READ AS INFILE ' Check file signature GET INFILE, , SIG IF SIG <> &H46465548 THEN ' Hex for: HUFF in little endian PRINT "INVALID FILE!" EXIT SUB END IF ' Get output length OUTLEN = 0 GET INFILE, , OUTLEN ' Read in the Huffman binary tree FOR I = 0 TO 254 GET INFILE, , NODES(I).BIT0 GET INFILE, , NODES(I).BIT1 NEXT I ' Open output file OUTFILE = FREEFILE OPEN OUTNAME FOR BINARY ACCESS WRITE AS OUTFILE ' Decompress input data using binary tree CURNODE = 254 DO BITMASK = 0 GET INFILE, , CHRIN DO ' Decide which node to travel down depending on ' input bits from CHRIN. IF ASC(CHRIN) AND 2 ^ BITMASK THEN NEXTNODE = NODES(CURNODE).BIT1 ELSE NEXTNODE = NODES(CURNODE).BIT0 END IF ' Is this next node another part of the tree or ' is it a end node? Less than 256 mean end node. IF NEXTNODE < 256 THEN ' Get output char from end node and save. CHROUT = CHR$(NEXTNODE AND &HFF) PUT OUTFILE, , CHROUT ' Amend output length and start from top of ' binary tree. OUTLEN = OUTLEN - 1 CURNODE = 254 ELSE ' Travel to next node CURNODE = (NEXTNODE AND &HFF) END IF ' Move to next input bit BITMASK = BITMASK + 1 LOOP WHILE BITMASK < 8 AND OUTLEN > 0 ' Loop while we still need to output data LOOP WHILE OUTLEN > 0 ' Clean up CLOSE OUTFILE CLOSE INFILE END SUB

SUB MAKHUF 'Mak a degenerate huffman tree, store as string huffq OPEN "HUFF.DD2" FOR BINARY AS #8 aq = "HUFF" PUT #8, 1, aq x = 9 FOR t = 0 TO 255 b = t va = 0 vb = 0 vc = 0 vd = 0 ve = 0 vf = 0 vg = 0 vh = 0 IF b > 127 THEN LET va = va + 1 b = b MOD 128 IF b > 63 THEN LET vb = vb + 1 b = b MOD 64 IF b > 31 THEN LET vc = vc + 1 b = b MOD 32 IF b > 15 THEN LET vd = vd + 1 b = b MOD 16 IF b > 7 THEN LET ve = ve + 1 b = b MOD 8 IF b > 3 THEN LET vf = vf + 1 b = b MOD 4 IF b > 1 THEN LET vg = vg + 1 b = b MOD 2 IF b = 1 THEN LET vh = vh + 1 b = (vh * 128) + (vg * 64) + (vf * 32) + (16 * ve) + (8 * vd) + (4 * vc) + (2 * vb) + va aq = MKI$(b) PUT #8, x, aq x = x + 2 NEXT t FOR t = 0 TO 253 aq = MKI$(t + 256) PUT #8, x, aq x = x + 2 NEXT t GET #8, 1, huffq CLOSE #8 KILL "HUFF.DD2" END SUB

### Visual Basic .NET

#### Huffman tree representation

This class, BinaryTreeNode, represents a binary tree whose branch nodes carry no value, like a Huffman dictionary tree.

Public Class BinaryTreeNode(Of T) ' By Fleexy, public domain, credit where credit is due Private Branch As Boolean Private Children As BinaryTreeNode(Of T)() Private HoldValue As T Public Sub New(LeafValue As T) Branch = False HoldValue = LeafValue End Sub Public Sub New(LeftChild As BinaryTreeNode(Of T), RightChild As BinaryTreeNode(Of T)) Branch = True Children = {LeftChild, RightChild} End Sub Public ReadOnly Property HasValue As Boolean Get Return Not Branch End Get End Property Public Property Value As T Get If Branch Then Throw New InvalidOperationException Return HoldValue End Get Set(value As T) If Branch Then Throw New InvalidOperationException HoldValue = value End Set End Property Public Property Child(Side As ChildSide) As BinaryTreeNode(Of T) Get If Not Branch Then Throw New InvalidOperationException Return Children(Side) End Get Set(value As BinaryTreeNode(Of T)) If Not Branch Then Throw New InvalidOperationException Children(Side) = value End Set End Property Public Enum ChildSide As Byte Left = 0 Right = 1 End Enum Public ReadOnly Property Count As Integer Get If Not Branch Then Return 1 Return Children(0).Count + Children(1).Count End Get End Property Public ReadOnly Property Depth As Integer Get If Not Branch Then Return 1 Return Math.Max(Children(0).Depth, Children(1).Depth) + 1 End Get End Property Public Overrides Function ToString() As String Return "{Count = " & Count & ", Depth = " & Depth & "}" End Function End Class

#### Huffman tree reading

This piece of code will read in a stored Huffman dictionary in the format described at the top of this article and store it in a BinaryTreeNode(Of Byte) as shown above.

' By Fleexy, public domain, credit where credit is due Dim fDict As New IO.FileStream(DictionaryFile, IO.FileMode.Open) Dim raw(254) As Tuple(Of UShort, UShort) For x = 0 To 254 raw(x) = Tuple.Create(ReadUShort(fDict), ReadUShort(fDict)) Next fDict.Close() Dim GenerateTree As Func(Of UShort, BinaryTreeNode(Of Byte)) GenerateTree = Function(NextNode As UShort) As BinaryTreeNode(Of Byte) Dim n As Tuple(Of UShort, UShort) = raw(NextNode) Dim a, b As BinaryTreeNode(Of Byte) If n.Item1 < 256 Then a = New BinaryTreeNode(Of Byte)(n.Item1) Else a = GenerateTree(n.Item1 - 256) End If If n.Item2 < 256 Then b = New BinaryTreeNode(Of Byte)(n.Item2) Else b = GenerateTree(n.Item2 - 256) End If Return New BinaryTreeNode(Of Byte)(a, b) End Function Dim dict As BinaryTreeNode(Of Byte) = GenerateTree(254) fDict.Close()