Page 1 of 2
When we covered the basics of hashing some exciting ideas were left out - extensible hashing and perfect hashing. Don't be put off by the jargon because they are both simple and the sort of idea that you wish you'd thought of first. If you want to be reminded of what hashing is all about read Simple Hashing before continuing - otherwise skip to the next section.
One of the most difficult things in computing is storing something where you can find it again. The most obvious method is to store the data in sorted order and search using a binary search. Hashing is a much more ingenious way of doing the same job. All you need is a suitable hashing function HASH(k) say which will take the data value k, the key, and convert it into a storage location. The data is then stored at the location and if you ever want it again you simply use HASH(k) to find it. For example, to keep a list of names and addresses keyed on surname you might use a hash function which added together the ASCII values of each letter in the surname. The resulting sum could then be used as the index to an array of records used to store the name and address. When you want the record again you simply hash the surname and go straight to the location where it is stored.
The only complication with hashing is that hash functions are imperfect and often send different keys to the same location. For example, two addresses with the different surnames might be mapped to the same array element. This is called a collision and hashing methods vary in the way that they cope with the problem. Open hashing simply chains the collided items off the one array location so effectively storing them all at the same array index value as a linear list. Closed hashing uses a hashing function for a second time to give another location in the array where the item that caused the collision can be stored.
For an example of hashing and more on collisions see the preliminary article on this topic.
One of the practical difficulties with hashing is the way that the hash table has to be set up at a fixed size before any data is added to it. The reason is that the hash function has to generate locations in the range 1 to M, where M is the size of the array used to form the hash table. (In practice this is usually achieved by taking the output of the hash function MOD M.) If you want to increase the size of the hash table then the amount of work involved is far too much. You would essentially have to re-hash every single item in the table!
There is a solution.
Extensible hashing is a technique that is particularly suited to storing data on disk. Instead of using the hash function to determine which exact location data is stored in, it is used to specify a fixed sized block of storage.
Each block will be used to store as many items as are hashed to it. At the moment this sounds just like a limited form of open hashing but there is more.
The trick is that while an N-bit hashing function is computed, only the first few bits are actually used as an index to a table of blocks. For example, if there are only two blocks in use then the index table will have only two entries which are selected using the first bit of the hash result.
You can see that using this scheme on average half of the key values are stored in each block.
Now consider what happens when one of the blocks becomes full. The obvious thing to do is increase the number of blocks involved in the hashing by increasing the size of the index table. If we use another bit of the hash value the size of the table doubles and it can accommodate twice the total number of blocks.
The only problem that remains is that we have to reorganise the existing blocks so that the full blocks are spilt into two new blocks containing the different values of the second bit of the hash value. The simplest way to do this is to split the full block into two new blocks - and arrange the items according to the two bits of the hash. The blocks that aren't full can be left and used to hold items that hash to both values of the second bit.
Notice that as the table is used to specify the block used to store data with a given hash value you don't have to split all of the blocks. There is nothing to say that the same block cannot be used for storing data from two different hash values. Blocks only have to be split when they fill up.
Of course you still have to search the block for the exact value in which you are interested, but this isn't a serious overhead. If the blocks are arranged to be units of disk storage then you are still guaranteed to access the correct block, i.e. the one that contains the data you are looking for, in a single disk read.
Once the block is in memory it can be searched quickly even using a simple linear search. The only real overhead is the need to keep an index table that increases by a power of two each time another bit of the hash value is used to increase the number of blocks, but again a little arithmetic shows that this too isn't a problem as long as the block size is reasonably large.
The way that the data is reorganised by splitting blocks is very similar to the way tree structured storage methods such as B trees work. Indeed this approach makes hashing as flexible, more efficient and easier to implement than a B tree index.