Programmer's Python Data

Programmer's Python Data - Native Code

Written by Mike James

Monday, 20 March 2023

Article Index
Programmer's Python Data - Native Code
Marshaling
Complex Data Types
Unicode
Unions

Page 3 of 5

Complex Data Types

Passing simple data types is easy. What generally causes the problem is when a C function accepts a data structure such as a struct, array or buffer. The important thing to realize is that however complex this data structure is, it is just a byte sequence and what it represents is just a matter of interpretation and use.

The ctypes module directly supports the two most common complex data types, arrays and structures / unions. When it comes to strings, however, you are mostly on your own. The reason is that in C a string is just a special form of array – an array of char types. As a char type is nearly always the same size as a byte, a string in C is just an array of bytes. As this is the case let’s start with a look at the array.

The Array

An array is simply a byte sequence composed of a single data type. So in C you might declare an array as:

int myArray[10];

which would give you an array of ten integers. This might look like a Python list, but remember that a Python list can have any data type as its elements whereas an array can only have a single type.

Python lets you create a ctype class that can produces instances suitable for marshaling. All you have to do is use the overloaded multiplication operator on one of the basic ctypes. For example:

tenInts=ctypes.c_int*10

creates a ctype that stores ten integers, c_ints to be precise. The new type can be use to create an instance:

myArray=tenInts(0,1,2,3,4,5,6,7,8,9)

the initializer really should be a sequence of c_ints, but Python converts the integers automatically. In general, Python implements reasonable conversion rules for array types. The new ctype behaves like a standard Python sequence and you can use it as if it was a sequence. For example:

myArray=tenInts()
for i in range(0,10):
    myArray[i]=ctypes.c_int(i)

and

for x in myArray:
	print(x)

The first for loop is a very general way of initializing arrays with complex elements.

Once you have an array you can pass it to a C function that makes use of it. For example, to sum the elements of the array:

__declspec(dllexport) int sumArray(int[], int);
int sumArray(int myArray[], int n)
{
    int temp=0;
    for(int i=0;i<n;i++){
        temp = temp + myArray[i];
    }
    return temp;
}

The Linux version is the same minus the __declspec line.

After rebuilding the library we can make use of this function, but there is one small complication. In C arrays are not passed by value. Instead a pointer to the array is passed to the function and we have to do the same thing in Python. There are two ways of doing this, by using a Pointer ctype or by using the byref function. The byref function is more efficient in this case as we don’t need to additional features of the Pointer ctype:

tenInts=ctypes.c_int*10
myArray=tenInts()
for i in range(0,10):
    myArray[i]=ctypes.c_int(i)
s=lib.sumArray(ctypes.byref(myArray),10)
print(s)

Notice the use of the byref function to pass a pointer to myArray, which is what the C function expects. Assuming that myArray is set to 0 to 9, the result displayed is 45.

Notice that in this case the marshaling provided by the tenInts ctype also ensures that the array is available as a block of memory that can be safely used by the C program. If you are wondering how to create an array with a variable number of elements, this is something that C doesn’t support. Arrays in C are usually forced to be fixed in size (there are exceptions) and you should adopt the same rule in Python.

Strings and Buffers

A C string is just an array of char which is equivalent to an array of bytes. There is no encoding associated with the string and it is usually treated as an ASCII/ANSI representation. What this means is that to pass a string to a C program you generally have to first convert a UTF-8 Python string into a UTF-8 encoded byte sequence, which in most cases will be treated as ASCII by the C function. C doesn’t provide any standard support for Unicode, but there is nothing stopping you from implementing functions that work with UTF-8 encoded strings.

The other thing you need to know is that a C string is null-terminated, that is the final byte is a zero and the byte sequence has one more byte than characters. Many C functions use the final null byte to mark the end of the string, but this is considered not to be safe so many also insist that you also pass the length of the string as well.

As C strings are arrays, they are passed by reference and this is allowed for in Python by using the c_char_p type which is automatically a pointer to the bytes object that represents the string. For example, the following C function counts the number of characters in the ASCII string it is passed:

__declspec(dllexport) int countString(char[]);
int countString(char myString[])
{
    int temp = 0;
    for (int i = 0; i < 100; i++)
    {
        if (myString[i] == 0)
            break;
        temp = temp + 1;
    }
    return temp;
}

This simply iterates through the char array until a null byte is found when the loop ends. There are more compact ways of writing this, but this form is more like Python than C. Notice that the for loop sets a limit of 100 characters.

The Python program to make use of this is fairly simple as long as you are following the idea that you have to convert the string to an ASCII byte sequence:

bytes="Hello World".encode(encoding="utf8")
count=lib.countString(ctypes.c_char_p(bytes))
print(count)

If you run this you will see 11 printed. Notice also that you have to pass the array as a pointer as before, but in this case the ctype c_char_p is already a pointer to the sequence of bytes. The marshaling creates a pointer to the byte sequence and adds a zero byte at the end. The C program can use the string, but it cannot safely modify it so you should not pass a c_char_p type to a C function that modifies or mutates the string.

So how can you pass a string that a C function can modify? This is a matter of using the utility function:

ctypes.create_string_buffer(size, size=None)

This creates a buffer of size bytes or creates a buffer one byte bigger than a byte sequence supplied. You can override the size of the buffer created by a byte sequence by specifying the optional size parameter.

For example, the following C function writes the characters of “Hello World” to the input parameter:

__declspec(dllexport) int helloString(char[]);
int helloString(char myString[])
{
    char msg[]="Hello World";
    int i=0;
    while(msg[i]!=0){
        myString[i]=msg[i];
        i=i+1;
    }
    myString[i+1]=0;
    return 0;
}

This can be used in Python something like:

myBuffer=ctypes.create_string_buffer(50)
result=lib.helloString(myBuffer)
print(myBuffer.value)

which displays

b'Hello World'.

Notice that the returned string is zero-terminated, but the value attribute returns a byte sequence without the final zero as the termination. The size of the buffer you use is important in that the C function cannot use more bytes than you allocate without the system crashing.

The buffer that is created in this example is a general purpose byte buffer and not just useful for strings. You can use a string buffer wherever a function needs a mutable byte buffer, but converting it to whatever it represents is up to you.

<< Prev - Next >>

Last Updated ( Wednesday, 22 March 2023 )