Page 1 of 2
Hashing is arguably one of the great ideas of computing and it has hidden depths. Extensible hashing and perfect hashing are ideas that are worth exploring.
When we covered the basics of hashing in Hashing - The Greatest Idea In Programming some exciting ideas were left out - extensible hashing and perfect hashing.
Don't be put off by the jargon because they are both simple and the sort of idea that you wish you'd thought of first. If you want to be reminded of what hashing is all about read the next section - Simple Hashing before continuing - otherwise skip to the next section.
One of the most difficult things in computing is storing something where you can find it again.
The most obvious method is to store the data in sorted order and search using a binary search for example.
Hashing is a much more ingenious way of doing the same job.
All you need is a suitable hashing function
say which will take the data value k, the key, and convert it into a storage location.
The data is then stored at the location and if you ever want it again you simply use HASH(k) to find it.
For example, to keep a list of names and addresses keyed on surname you might use a hash function which added together the ASCII values of each letter in the surname. The resulting sum could then be used as the index to an array of records used to store the name and address.
When you want the record again you simply hash the surname and go straight to the location where it is stored.
The only complication with hashing is that hash functions are imperfect and often send different keys to the same location. For example, two addresses with the different surnames might be mapped to the same array element.
This is called a collision and hashing methods vary in the way that they cope with the problem. Open hashing simply chains the collided items off the one array location so effectively storing them all at the same array index value as a linear list. Closed hashing uses a hashing function for a second time to give another location in the array where the item that caused the collision can be stored.
For an example of hashing and more on collisions see - Hashing - The Greatest Idea In Programming
One of the practical difficulties with hashing is the way that the hash table has to be set up at a fixed size before any data is added to it.
The reason is that the hash function has to generate locations in the range 1 to M, where M is the size of the array used to form the hash table. (In practice this is usually achieved by taking the output of the hash function MOD M - see The Mod function.)
If you want to increase the size of the hash table then the amount of work involved is far too much. You would essentially have to re-hash every single item in the table!
There is a solution.
Extensible hashing is a technique that is particularly suited to storing data on disk. Instead of using the hash function to determine which exact location data is stored in, it is used to specify a fixed sized block of storage.
Each block will be used to store as many items as are hashed to it. At the moment this sounds just like a limited form of open hashing but there is more.
The trick is that while a full N-bit hashing function is computed, only the first few bits are actually used as an index to a table of blocks.
For example, if there are only two blocks in use then the index table will have only two entries which are selected using just the first bit of the hash result.
You can see that using this scheme on average half of the key values are stored in each block.
Now consider what happens when one of the blocks becomes full. The obvious thing to do is increase the number of blocks involved in the hashing by increasing the size of the index table. If we use another bit of the hash value the size of the table doubles and it can accommodate twice the total number of blocks.
This increasing use of the bits provided by the hash function is what makes the technique extensible.
The only problem that remains is that we have to reorganise the existing blocks so that the full blocks are spilt into two new blocks containing the different values of the second bit of the hash value. Each of the full blocks can be split into two blocks by simply using the values of the extra hash bit. That is if the full block held all the data that hashed to 1 say then this block can be split into all the data that hashes to 10 and 11.
The nice thing is that we don't have to do anything to the blocks that are not full. The block table can simply map both values of the newly used hash bit to the same block. For example, if the block that holds data hashed to 0 isn't full the table simply maps 00 and 01 to the same block.
When a block that has repeated entries in the table finally fills up it can be split in the usual way.
The use of a block table to map the bits of the hash in use to the storage blocks is the key idea. You can reuse blocks until they are full. Blocks only have to be split when they fill up.
Of course you still have to search the block for the exact value in which you are interested, but this isn't a serious overhead. If the blocks are arranged to be units of disk storage then you are still guaranteed to access the correct block, i.e. the one that contains the data you are looking for, in a single disk read.
Once the block is in memory it can be searched quickly even using a simple linear search. The only real overhead is the need to keep an index table that increases by a power of two each time another bit of the hash value is used to increase the number of blocks, but again a little arithmetic shows that this too isn't a problem as long as the block size is reasonably large.
The way that the data is reorganised by splitting blocks is very similar to the way tree structured storage methods such as B trees work. Indeed this approach makes hashing as flexible, more efficient and easier to implement than a B tree index.