Page 1 of 2
There is a close but complicated relationship between bytes and strings. Find out how it all works in this extract from my new book Programmer's Python: Everything is Data.
Everything is Data
Is now available as a print book: Amazon
- Python – A Lightning Tour
- The Basic Data Type – Numbers
- Truthy & Falsey
- Dates & Times
- Sequences, Lists & Tuples
Extract Unicode Strings
- Regular Expressions
- The Dictionary
Extract The Dictionary ***NEW!!!
- Iterables, Sets & Generators
- Data Structures & Collections
- Bits & Bit Manipulation
Extract Bytes And Strings
Extract Byte Manipulation
- Binary Files
- Text Files
- Creating Custom Data Classes
- Python and Native Code
Extract Native Code
Appendix I Python in Visual Studio Code
Appendix II C Programming Using Visual Studio Code
It it now time to look at grouping bits together to form bytes. Python is unusual in that it can represent a bit pattern of arbitrary length using its bignum data type. Most other languages have to build up bigger bit patterns by using multiple smaller units – in most cases the byte, i.e. 8-bits. Python has a range of data objects which can be used to work with bytes and the good news is that the majority are sequences which makes them more sophisticated than you might expect.
There is also a strong relationship between objects that work in terms of bytes and strings. In particular, this chapter is where we come to grips with encodings and learn how to deal with legacy protocols that work in terms of ASCII.
A second aspect of working with bytes is dealing with the raw buffers that other programs and devices might expose to a Python program. This is where the memoryview comes into its own.
Bytes and Bytearray
The bytearray is a mutable sequence of elements that are bytes. The bytes object is its immutable counterpart. They both work in much the same way with the exception of any operations which would attempt to modify the immutable bytes object.
The way that bytes and bytearray work is mostly determined by how things used to work in the days of ASCII text. It was usual for programming languages to make use of strings to provide arrays of byte values. Each extended ASCII character in the string was used to code for a value between 0 and 255, i.e a single byte,
So for example the string "Hello World" has the ASCII representation:
['0x48', '0x65', '0x6c', '0x6c', '0x6f', '0x20', '0x57', '0x6f',
'0x72', '0x6c', '0x64']
In other languages, this string would be used as if it was a byte array. In Python, strings are stored using UTF-8 and hence the simplicity of one character to a byte is lost and makes using strings as byte arrays difficult. The Python objects bytes and bytearray essentially implement ASCII strings so that you can continue to use them as if they were arrays of bytes.
This explains the form of the bytes literal which essentially adds a b to the start of the string literal and restricts the characters you can use to the ASCII encoding. For example, the bytes literal for Hello World is:
myBytes = b"Hello World"
This creates a bytes object initialized to the bytes determined by the ASCII codes of each character in the string Hello World. The restriction to strict ASCII characters means that the maximum value that can be specified is 127. If you want to specify a value greater than this you need to use an escape code and bytes supports the same set as string. Particularly relevant are:
\ooo Character with octal value OOO
\xhh Character with hex value hh
These allow you to go beyond 127. For example,
myBytes = b"\xFF\xFF"
creates a myBytes object with 255 stored in each element.
There are other ways than a bytes literal to create an initialized bytes or bytearray object. In both cases you can use the constructor to convert a suitable iterator into a bytes or bytearray object. The format is:
b = bytes(source)
b = bytearray(source)
where source is an iterable. Of course, as each element has to be 0 to 255 this also has to be true of the source. For example, bytes([0xFF,0xAA,0x55]) creates a three-element bytes object with the elements as indicated in the list. As an alternative you can use the fromhex class method which allows you to specify hex values using a string:
bytesarray.fromhex(“FF AA 55”)
If you aren’t familiar with ASCII, or if the byte sequence you want is specified as numeric values, these are a more direct way of creating bytes and bytearrays. However, you can’t completely escape the relationship between ASCII strings and byte arrays. If you print a bytes or bytearray object then it is printed as an ASCII string, apart from any elements that are greater than 127 which are printed as \hh where hh are hex symbols. Similarly any non-printable ASCII characters are displayed as escape sequences. This can be confusing. For example:
It is very easy to miss ASCII characters embedded in long sequences of escape codes.
If you want to see the bytes in hex, which is easier to read if you are happy with hex, you can use the hex class method which returns a string of hex values:
This doesn’t have any escape codes, but it doesn’t separate the bytes. You can specify a separator character and how many hex digits should be grouped using the two parameters in the hex method. For example:
alternatively you can convert to a list and format the result as needed.
Notice that you can create an uninitialized bytearray using:
b = bytearray(n)
where n specifies the number of elements.