Recently I wrote a small tool (mediaextract) that can extract several file formats from arbitrary binary files. While writing this I had to look at many file format specifications. It quickly came apparent to me what are the dos and what the don'ts when it comes to binary file formats and their documentations, so I thought I summarize them here.
Binary files often have some kind of identifying signature at the beginning. Very often this is a four character code (FourCC). Sometimes it's a binary string of a different length and sometimes it is not right at the beginning. Sometimes there isn't any such thing. This signature is often called the file magic or magic number. Programs like
xdg-mime maintain a list of such signatures to identify file types.
A good file format should have an unambiguous magic number, preferable longer than 4 bytes. While FourCCs consisting of 7-bit ASCII letters makes it easy for humans to determine the file type when opening the file in a text editor, choosing values that are not ASCII letters makes the file (mostly) unambiguously a binary file format. PNG uses a 8 bytes long mixture that combines both points:
(decimal) 137 80 78 71 13 10 26 10 (hexadecimal) 89 50 4e 47 0d 0a 1a 0a (ASCII C notation) \211 P N G \r \n \032 \n
Microsoft's ASF on the other hand uses 16 byte long GUIDs, which is another nice choice.
Note that there are file formats that don't have the magic at the beginning of the file and there even formats that don't have any magic numbers at all. And there are formats that allow an fixed amount of arbitrary (comment) data at the beginning. Have fun detecting the type of such files.
There are two kinds of overall file structures that make sense to me as shown in the following figures:
The first has a limited file size that is known upfront (it is written in the file header). However, this limits the maximum file size to the maximum value of the type you have chosen for the size field. Another maybe better option is to segment the file into blocks. Each type of block should have it's own signature so you can easily distinguish between header blocks and data blocks. There should be also a way to determine the last block (the end of the file). E.g. a flag in the last block or a dedicated trailer block. The second variant is not only unlimited in it's size, it can also be used for streaming where the file size is simply not known upfront.
Note that it always makes sense to include a size field for each block or even for the header of the first variant. This way one can easily extend the file format (new block types, more fields in header) and old programs can still read the new files. A size field for the body is also necessary in order to detect whether the file was truncated.
Formats that use the first scheme are e.g. RIFF (WAV, AVI, RMI etc.), AIFF (=RIFF WAVE but big-endian) and BMP. Formats that use the second scheme but without a trailer are e.g. ASF, MP4 and Ogg (although it uses the same FourCC for each frame). Formats that use the second scheme including a trailer are e.g. PNG, GIF and with some limitations JPEG (the "scan" section has no size, but instead never includes a marker byte sequence so you can scan for the end by searching for such a sequence).
Note that there are other formats that don't have any size fields. In order to determine if files of these formats are truncated or contain junk at the end you have to fully parse and probably even decode them. And then there are formats that have data segments with headers but don't have a valid size in the header. E.g. in ZIP in some cases the size comes after the data of said size and the central directory comes at the end of the file. So cut a few bytes from the end or even add some and these ZIP files become unreadable. If you want to find/extract such ZIP files in/from a heap of data-junk you have to parse it backwards.
Redundancy is good
Strictly speaking things like checksums are redundant, but it always makes sense to include them. E.g. a CRC 32 checksum for each block in your file. If you have a nested block structure then a size field in a parent block is redundant, but it makes parsing the file more easy and allows some error detection. So does using field sizes that can hold more then defined range of values.
Don't over-complicate things
Define a byte order (big-endian or little-endian) for your format and stick to it. Also place your fields so that they fit the alignment rules of C to make parsing more easy (especially when
mmap is used for file reading). As a rule of thumb use field sizes of powers of 2 bytes and sort them from largest to smallest. Everything within such a structure should be aligned.
Note that I'm only talking about the file and block headers here. In the sections I simply labeled "Data" in the figures above you may as well use different field sizes and pack things together using bit operations etc. in order to save space. But the overhead introduced in the headers using the mentioned strategies should be minimal and helps a lot for file type detection and parsing.
In case your file format includes compression or encryption save at least the signatures and block sizes uncompressed/unencrypted.
The documentation of your file format should include the following things:
- Specify the byte order upfront.
- Make nice overview diagrams like the ones above.
- Write a detailed table for each block type/section of your format (see below).
- Explicitly state what block sizes define in particular. (E.g. all fields including the block size and signature or only all other fields.)
- Examples are always a good idea.
Example of a detailed description of a section using GIF:
|6||2||uint16_t||Logical screen width|
|8||2||uint16_t||Logical screen height|
|11||1||uint8_t||Background color index|
|12||1||uint8_t||Pixel aspect ratio|
|13||s*3||uint8_t[s]||Global color table if GCTF is 1|
There are many variations on such tables possible (e.g. you could merge the Type/Contents column with the Description column), but you should always write such tables. An informal description or even a more simple list is never so clear and quickly understandable.
Note that often people use
(U)QWORD as types. I rather use the C99 types
(u)int64_t because they are more explicit.