struct MetroArea {
int year;
char name[12];
char country[2];
float population;
};
Parsing binary records with struct
The struct module provides functions to parse fields of bytes into a tuple of Python objects,
and to perform the opposite conversion, from a tuple into packed bytes.
struct can be used with bytes, bytearray, and memoryview objects.
The struct module is powerful and convenient, but before using it
you should seriously consider alternatives, so that’s the first short section in this post.
Contents:
Should we use struct?
Proprietary binary records in the real world are brittle and can be corrupted easily. The super simple example in Struct 101 will expose one of many caveats: a string field may be limited only by its size in bytes, it may be padded by spaces, or it may contain a null-terminated string followed by random garbage up to a certain size. There is also the problem of endianness: the order of the bytes used to represent integers and floats, which depends on the CPU architecture.
If you need to read or write from an existing binary format, I recommend trying to find a library that is ready to use instead of rolling your own solution.
If you need to exchange binary data among in-company Python systems, the pickle module is the easiest way—but beware that different versions of Python use different binary formats by default, and reading a pickle may run arbitrary code, so it’s not safe for external use.
If the exchange involves programs in other languages, use JSON or a multi-platform binary serialization format like MessagePack or Protocol Buffers.
Struct 101
Suppose you need to read a binary file containing data about metropolitan areas, produced by a program in C with a record defined as Example 1
Here is how to read one record in that format, using struct.unpack:
>>> from struct import unpack, calcsize
>>> FORMAT = 'i12s2sf'
>>> size = calcsize(FORMAT)
>>> data = open('metro_areas.bin', 'rb').read(size)
>>> data
b"\xe2\x07\x00\x00Tokyo\x00\xc5\x05\x01\x00\x00\x00JP\x00\x00\x11X'L"
>>> unpack(FORMAT, data)
(2018, b'Tokyo\x00\xc5\x05\x01\x00\x00\x00', b'JP', 43868228.0)
Note how unpack returns a tuple with four fields, as specified by the FORMAT string.
The letters and numbers in FORMAT are Format Characters described in the struct module documentation.
| part | size | C type | Python type | limits to actual content |
|---|---|---|---|---|
|
4 bytes |
|
|
32 bits; range -2,147,483,648 to 2,147,483,647 |
|
12 bytes |
|
|
length = 12 |
|
2 bytes |
|
|
length = 2 |
|
4 bytes |
|
|
32-bits; approximante range ± 3.4×1038 |
One detail about the layout of metro_areas.bin is not clear from the code in Example 1:
size is not the only difference between the name and country fields.
The country field always holds a 2-letter country code,
but name is a null-terminated sequence with up to 12 bytes including the terminating
b'\0'—which you can see in Example 2 right after the word
Tokyo.[1]
Now let’s review a script to extract all records from metro_areas.bin and produce a simple report like this:
$ python3 metro_read.py
2018 Tokyo, JP 43,868,228
2015 Shanghai, CN 38,700,000
2015 Jakarta, ID 31,689,592
Example 3 showcases the handy struct.iter_unpack function.
metro_areas.binfrom struct import iter_unpack
FORMAT = 'i12s2sf' # (1)
def text(field: bytes) -> str: # (2)
octets = field.split(b'\0', 1)[0] # (3)
return octets.decode('cp437') # (4)
with open('metro_areas.bin', 'rb') as fp: # (5)
data = fp.read()
for fields in iter_unpack(FORMAT, data): # (6)
year, name, country, pop = fields
place = text(name) + ', ' + text(country) # (7)
print(f'{year}\t{place}\t{pop:,.0f}')
-
The
structformat. -
Utility function to decode and clean up the
bytesfields; returns astr.[2] -
Handle null-terminated C string: split once on
b'\0', then take the first part. -
Decode
bytesintostr. -
Open and read the whole file in binary mode;
datais abytesobject. -
iter_unpack(…)returns a generator that produces one tuple of fields for each sequence of bytes matching the format string. -
The
nameandcountryfields need further processing by thetextfunction.
The struct module provides no way to specify null-terminated string fields.
When processing a field like name in the example above,
after unpacking we need to inspect the returned bytes to discard the first b'\0' and all bytes after it in that field.
It is quite possible that bytes after the first b'\0' and up to the end of the field are garbage. You can actually see that in Example 2.
Memory views can make it easier to experiment and debug programs using struct, as the next section explains.
Structs and Memory Views
Python’s memoryview type does not let you create or store byte sequences.
Instead, it provides shared memory access to slices
of data from other binary sequences, packed arrays,
and buffers such as Python Imaging Library (PIL) images,[3] without copying the bytes.
Example 4 shows the use of memoryview and struct together to extract the width and height of a GIF image.
>>> import struct
>>> fmt = '<3s3sHH' # (1)
>>> with open('filter.gif', 'rb') as fp:
... img = memoryview(fp.read()) # (2)
...
>>> header = img[:10] # (3)
>>> bytes(header) # (4)
b'GIF89a+\x02\xe6\x00'
>>> struct.unpack(fmt, header) # (5)
(b'GIF', b'89a', 555, 230)
>>> del header # (6)
>>> del img
-
structformat:<little-endian;3s3stwo sequences of 3 bytes;HHtwo 16-bit integers. -
Create
memoryviewfrom file contents in memory… -
…then another
memoryviewby slicing the first one; no bytes are copied here. -
Convert to
bytesfor display only; 10 bytes are copied here. -
Unpack
memoryviewinto tuple of: type, version, width, and height. -
Delete references to release the memory associated with the memoryview instances.
Note that slicing a memoryview returns a new memoryview, without copying bytes.[4]
I will not go deeper into memoryview or the struct module,
but if you work with binary data, you’ll find it worthwhile to study their docs:
Built-in Types » Memory Views and struct — Interpret bytes as packed binary data.
\0 and \x00 are two valid escape sequences for the null character, chr(0), in a Python str or bytes literal.
mmap module to open the image as a memory-mapped file. That module is outside the scope of this book, but if you read and change binary files frequently, learning more about mmap — Memory-mapped file support will be very fruitful.