Parsing binary records with struct

The struct module provides functions to parse fields of bytes into a tuple of Python objects, and to perform the opposite conversion, from a tuple into packed bytes. struct can be used with bytes, bytearray, and memoryview objects.

The struct module is powerful and convenient, but before using it you should seriously consider alternatives, so that’s the first short section in this post.

Contents:

Should we use struct?
Struct 101
Structs and Memory Views

Should we use `struct`?

Proprietary binary records in the real world are brittle and can be corrupted easily. The super simple example in Struct 101 will expose one of many caveats: a string field may be limited only by its size in bytes, it may be padded by spaces, or it may contain a null-terminated string followed by random garbage up to a certain size. There is also the problem of endianness: the order of the bytes used to represent integers and floats, which depends on the CPU architecture.

If you need to read or write from an existing binary format, I recommend trying to find a library that is ready to use instead of rolling your own solution.

If you need to exchange binary data among in-company Python systems, the pickle module is the easiest way—but beware that different versions of Python use different binary formats by default, and reading a pickle may run arbitrary code, so it’s not safe for external use.

If the exchange involves programs in other languages, use JSON or a multi-platform binary serialization format like MessagePack or Protocol Buffers.

Struct 101

Suppose you need to read a binary file containing data about metropolitan areas, produced by a program in C with a record defined as Example 1

Example 1. MetroArea: a struct in the C language.

struct MetroArea {
    int year;
    char name[12];
    char country[2];
    float population;
};

Here is how to read one record in that format, using struct.unpack:

Example 2. Reading a C struct in the Python console.

>>> from struct import unpack, calcsize
>>> FORMAT = 'i12s2sf'
>>> size = calcsize(FORMAT)
>>> data = open('metro_areas.bin', 'rb').read(size)
>>> data
b"\xe2\x07\x00\x00Tokyo\x00\xc5\x05\x01\x00\x00\x00JP\x00\x00\x11X'L"
>>> unpack(FORMAT, data)
(2018, b'Tokyo\x00\xc5\x05\x01\x00\x00\x00', b'JP', 43868228.0)

Note how unpack returns a tuple with four fields, as specified by the FORMAT string. The letters and numbers in FORMAT are Format Characters described in the struct module documentation.

Table 1 explains the elements of the format string from Example 2.

Table 1. Parts of the format string `'i12s2sf'`.
part	size	C type	Python type	limits to actual content
`i`	4 bytes	`int`	`int`	32 bits; range -2,147,483,648 to 2,147,483,647
`12s`	12 bytes	`char[12]`	`bytes`	length = 12
`2s`	2 bytes	`char[2]`	`bytes`	length = 2
`f`	4 bytes	`float`	`float`	32-bits; approximante range ± 3.4×10³⁸

One detail about the layout of metro_areas.bin is not clear from the code in Example 1: size is not the only difference between the name and country fields. The country field always holds a 2-letter country code, but name is a null-terminated sequence with up to 12 bytes including the terminating b'\0'—which you can see in Example 2 right after the word Tokyo.^[1]

Now let’s review a script to extract all records from metro_areas.bin and produce a simple report like this:

$ python3 metro_read.py
2018    Tokyo, JP       43,868,228
2015    Shanghai, CN    38,700,000
2015    Jakarta, ID     31,689,592

Example 3 showcases the handy struct.iter_unpack function.

Example 3. metro_read.py: list all records from metro_areas.bin

from struct import iter_unpack

FORMAT = 'i12s2sf'                             # (1)

def text(field: bytes) -> str:                 # (2)
    octets = field.split(b'\0', 1)[0]          # (3)
    return octets.decode('cp437')              # (4)

with open('metro_areas.bin', 'rb') as fp:      # (5)
    data = fp.read()

for fields in iter_unpack(FORMAT, data):       # (6)
    year, name, country, pop = fields
    place = text(name) + ', ' + text(country)  # (7)
    print(f'{year}\t{place}\t{pop:,.0f}')

The struct format.
Utility function to decode and clean up the bytes fields; returns a str.^[2]
Handle null-terminated C string: split once on b'\0', then take the first part.
Decode bytes into str.
Open and read the whole file in binary mode; data is a bytes object.
iter_unpack(…) returns a generator that produces one tuple of fields for each sequence of bytes matching the format string.
The name and country fields need further processing by the text function.

The struct module provides no way to specify null-terminated string fields. When processing a field like name in the example above, after unpacking we need to inspect the returned bytes to discard the first b'\0' and all bytes after it in that field. It is quite possible that bytes after the first b'\0' and up to the end of the field are garbage. You can actually see that in Example 2.

Memory views can make it easier to experiment and debug programs using struct, as the next section explains.

Structs and Memory Views

Python’s memoryview type does not let you create or store byte sequences. Instead, it provides shared memory access to slices of data from other binary sequences, packed arrays, and buffers such as Python Imaging Library (PIL) images,^[3] without copying the bytes.

Example 4 shows the use of memoryview and struct together to extract the width and height of a GIF image.

Example 4. Using memoryview and struct to inspect a GIF image header

>>> import struct
>>> fmt = '<3s3sHH'  # (1)
>>> with open('filter.gif', 'rb') as fp:
...     img = memoryview(fp.read())  # (2)
...
>>> header = img[:10]  # (3)
>>> bytes(header)  # (4)
b'GIF89a+\x02\xe6\x00'
>>> struct.unpack(fmt, header)  # (5)
(b'GIF', b'89a', 555, 230)
>>> del header  # (6)
>>> del img

struct format: < little-endian; 3s3s two sequences of 3 bytes; HH two 16-bit integers.
Create memoryview from file contents in memory…
…then another memoryview by slicing the first one; no bytes are copied here.
Convert to bytes for display only; 10 bytes are copied here.
Unpack memoryview into tuple of: type, version, width, and height.
Delete references to release the memory associated with the memoryview instances.

Note that slicing a memoryview returns a new memoryview, without copying bytes.^[4]

I will not go deeper into memoryview or the struct module, but if you work with binary data, you’ll find it worthwhile to study their docs: Built-in Types » Memory Views and struct — Interpret bytes as packed binary data.

1. \0 and \x00 are two valid escape sequences for the null character, chr(0), in a Python str or bytes literal.

2. This is the first example using type hints in a function signature in this book. Simple type hints like these are quite readable and almost intuitive.

3. Pillow is PIL’s most active fork.

4. Leonardo Rochael—one of the technical reviewers—pointed out that even less byte copying would happen if I used the mmap module to open the image as a memory-mapped file. That module is outside the scope of this book, but if you read and change binary files frequently, learning more about mmap — Memory-mapped file support will be very fruitful.

Parsing binary records with struct

Should we use struct?

Struct 101

Structs and Memory Views

Should we use `struct`?