Programmer's Python Data - The Dictionary
Written by Mike James   
Monday, 18 September 2023
Article Index
Programmer's Python Data - The Dictionary
Hashing
Dictionary Versus List

Hashing

In principle, that's all there is to a dictionary, but in practice there are some minor complications. There is also a lot of similarity between a dictionary and a list, enough to occasionally confuse the beginner. Let’s examine how we do things without the help of a dictionary.

For example, if you wanted to store Lucy's age in a list you would write:

age_list[0]=19

and to retrieve the age:

print(age_list[0])

Notice that in this case the 0 plays the role of the name "lucy". In both cases we are using an index to access the data in a data structure. In the case of a list you can only use integer indexes, but for a dictionary you can use a much wider range of types – although in practice you mostly use strings or integers.

If you want to look up Lucy’s age then you need some way of locating the Lucy entry in the list, i.e. you need to know where it is stored in the array. One way of doing this is to store the name with the age. For example:

age_list.append(("lucy",19))

stores a (name,age) tuple in the list. In this case name is playing the part of the key and age the value. To find Lucy’s age, assuming that there are lots of other entries in the list, you now need to search the list to find the correct entry:

for person in age_list:
	if person[0]==”lucy”:
		print(person[1])

You can see that this search takes a time proportional to the number of items in the list. Looking things up in a list can be slow, too slow if there are a lot of items to search. There are many solutions, including sorting the list to make keys quicker to find, but the dictionary solution has lots of advantages and it is based on hashing.

To act as a dictionary key an object has to be hashable – i.e. has to support the __hash__ method. The hash method returns a number that doesn’t change for the object’s lifetime. In simple terms, the hash is used to derive a value from the key that is used to determine where the value will be stored. When you need the value again the key is hashed and the value located – this is why dictionaries are fast.

You can think of the dictionary storage as:

age_list[__hash__(key)] = value

and to find the value you simply use:

age_list[__hash__(key)]

no need to search for the key – the hash function gives you the location of the data at once.

Looking something up in a dictionary always takes the same time and doesn’t depend on the number of items stored – in the jargon it is, order one, O(1). This roughly means that the time needed to complete a task is constant and doesn’t depend on the number of items involved. This is the best sort of algorithm you can have.

Of course, the cost of this efficiency is that you have to allocate enough storage to accommodate the possible hash values and this means you need more storage than is strictly necessary. You need storage locations for keys that haven’t yet occurred. This is another example of the general principle that you can trade time for storage. A dictionary uses more memory than a list, but it is much faster to find a keyed value. In practice, the smallest dictionary has space for eight elements, irrespective of how many you actually use. When this runs out of space the dictionary is resized to be four times the size and when you reach 50,000 elements the resizing is to two times. Note that deleting elements can cause a dictionary to resize to a smaller size.

In general, nearly all Python’s built-in immutable objects are hashable. Mutable objects are not hashable and for an immutable container to be hashable all of its elements have to be hashable. It sounds complicated, but in practice nearly everything you might want to use as a key is hashable and so can be used. However, for all of this to work we need a good hash function.

A good hash function has to be easy to compute and has to spread the storage of keys over a given range and produce few “collisions”. An ideal hash function would send every key to a different storage location. If you have 1 million keys to store then the ideal hash function should produce 1 million different hash values and this in turn would require1 million different locations to store the values.

A practical hash function produces hash values in the range of the storage you have allocated and this means that for some keys it will produce the same hash value – this is called a collision and it is a fact of life for hash storage. A good hash function produces an acceptably small number of collisions and spreads the key locations across the available storage – bunching would increase the chance of a collision. Good hash functions are difficult to invent.

Python’s Hashing

You don’t need to know how a dictionary is implemented to be able to use it, but it does explain a lot of things about how they behave. The way that the dictionary works was changed in Python 3.7 to be more sophisticated and more efficient and this is what is described here. In practice, the data structures used to implement a dictionary aren’t Python data structures, in the default compiler CPython they are C data structures, but they work in the same way as the Python data structures you know.

The first change from simple hashing is that the actual data is stored in a list in the order in which it is added to the dictionary. The data is stored as a tuple:

(hash, key, value)

For example a dictionary update is roughly:

data_table.append((hash, key, value))

Using lucy as an example this would be something like:

data_table.append((231,"lucy",19))

where 231 is the assumed hash value of the string "lucy", that is: __hash__(“lucy”) == 231.

Notice that the use of a list to store the values means that you can access the dictionary in the order in which the data was entered and it provides a natural order for iterating through the dictionary.

So far all we have is a list – how does hashing enter the picture? The answer is that when a dictionary is allocated, another table is created that is indexed by the hash value and stores the index of the data in the data_table. That is:

hash_table[hashvalue] = index

where index is the location in data_table where the data is actually stored. Assuming lucy was the 42nd entry in the table this would be:

hash_table[231] = 42

and this implies that:

data_table[42]

contains

(231,"lucy",19)

Now you can see how the dictionary works. When you write:

age["lucy"]

the system works out the hash value and then looks this up in hash_table to get the location in data_table where the data is stored:

data = data_table(hash_table(__hash__("lucy")))

This seems like a complicated way to do the job, but it has big advantages. The first is that the entries in the are fixed in size and small compared to the entries in the data table. This is important because it is the hash table that has to be allocated to have enough entries to accommodate the range of hash values and so has to be allocated to be larger than the number of items you are actually storing.

In our example, you may only be storing one entry in the table corresponding to Lucy’s data, but you still have to allocate at least 231 entries in the hashtable. The datatable on the other hand can have just one entry. This saves a lot of storage space – typically 20% over the previous implementation used before Python 3.7.

The second advantage is that in all hash storage implementations it is possible to run out of space in the hashtable. When this happens the range of the hash used has to be extended and the table has to be reorganized to reflect the new hash range. In this scheme only the hashtable has to reorganized – the datatable can remain as it was and this much faster.

There is one complication worth mentioning. Recall that it is possible for two different objects to have the same hash value, a hash collision, and while they don’t happen very often with a well-designed hash function, they do occur because there are usually more objects than possible values for the hash.

How does Python’s hashtable cope with a collision? The simple answer is that if the location indicated by the hash value is already occupied the next free space is used. This means that searching for an object that is subject to a collision involves a little hunting until the item is found or an empty location is encountered, but this is still much better than searching the entire list. Collisions are a nuisance, but they don’t spoil the advantage of the approach.



Last Updated ( Monday, 18 September 2023 )