Extracting data files from executables

From ModdingWiki
Jump to: navigation, search

This is how you can find and extract data files from executables. If the executables are compressed with an EXE packer (LZEXE, PKLite, ...), they need to be decompressed before they can be searched for internal data files.

Extracting Huffman dictionaries

The head node of every Huffman tree/dictionary is always node 254 ($FE) and points to a character ($00xx) and node 253 ($01FD). So scan through the file and look for the string $FD $01 ($01FD in little endian). After you've read that string, you should be at position 1020 in the dictionary file, so go back 1020 bytes to get to the start of the dictionary file. Now read 510 WORDS (the whole dictionary) and check if each of them is smaller than $0200 (and not negative). If the check fails for one WORD, the data can't possibly by a Huffman dictionary. If they all fit, the data is VERY likely to be a Huffman dictionary. To extract it, seek to the starting position and read 1020 bytes (or 1024 bytes if you want).

The faster method would be to look for the string $FD $01 $00 $00 $00 $00 (this assumes that the dictionary is padded with nulls at the end to reach a file size of 1024 bytes) and then go back 1024 bytes to get to the start of the dictionary file. But since not all dictionaries are padded like this, some dictionaries will not be recognized by this method.

Extracting xGAHEAD/AUDIOHED files

To find the header file for graphics or audio, you need to know the size of the xGAGRAPH or AUDIO/AUDIOT file first. To find the header, use the knowledge that the last offset in the header file points to the end of the data file which, in most cases, is the size of the data file. However, if the last four bytes in the data file form the string "!ID!", you need to use the file size minus four for the following steps.

Note that there are 4-byte and 3-byte versions of the header files. This means that the offsets in the header are each four bytes or only three bytes long (always little endian!). Audio headers are always in 4-byte format. Graphics headers for early games that use this format (Keen Dreams engine and the Keen 4 Demo) are 4-byte while all later games (up to Wolfenstein 3D and Rise of the Triad) use the more compact 3-byte version.

Scan through the file byte by byte and read offsets (seek to position 0, read 3 or 4 bytes; seek to position 1, read 3 or 4 bytes ...) until you find an offset that contains the size of the data file. Remember the position of this offset.

If you found the file size, keep going back to the previous offset and read it (go back 2*4 bytes or 2*3 bytes in the stream) until you find an offset that is 0. This would be the first offset in the header.

To extract the header, seek to the start of the first offset (the one that's 0) and read bytes until after the offset containing the file size.

If you want to verify that the data is correct header data, make sure that every offset is either $FFFFFFFF (or $FFFFFF for the 3-byte version) or something less or equal to the size of the data file (and not negative). This can be done while searching for the start of the header file (0-offset).

Extracting MAPHEAD files

Locating the start of a MAPHEAD file is pretty easy, because they all start with a WORD giving the RLEW flag and this WORD usually is $ABCD, so just look for the string $CD $AB in the file. The starting position of this string is likely to be the start of the MAPHEAD file.

Automated calculation of the real size of a MAPHEAD file is almost impossible without knowing the size of the TileInfo data (which requires the number of Tile16's and Tile16M's and knowledge of the structure of the TileInfo planes). So it's better to read only the first 402 bytes of the MAPHEAD file (RLEW flag and 100 map offset DWORDS) if you don't know the exact size of the MAPHEAD file already.

If you want to verify that the data is correct MAPHEAD data, make sure that every offset is either $FFFFFFFF or something less or equal to the size of the GAMEMAPS/MAPTEMP file (and not negative).

Extracting SOUNDS files

Locating the start of an internal SOUNDS file is simple. The file will start with the string 'SND' + $00 or 'SPK' + $00. The file length then follows. This can be used to extract the entire file, however there may be following 'unused' space in the executable set aside for the sound file.

There are a number of ways to verify that the sound file has been found. Well behaved sound files will end with the string $FF $FF, the entries for sounds at the start of the file will contain the value $08 at addresses (x * 16) + 3 for at least a dozen sounds and the pointer to the start of sound data at address 16 will be equal to the number of sounds (at address 6) multiplied by 16. Each segment of sound data pointed to in the header will also end with $FF $FF.


When looking for data files in an executable, you can usually start searching "late" in the executable (e.g. from position 100000) to avoid false positives.