Universal Hashing

Written by Mike James

Friday, 27 January 2023

Article Index
Universal Hashing
Parity

Page 2 of 2

Parity

This looks like a very easy and very efficient way of creating an m bit hash needing only m ands and m shifts.

The problem is that there is still the parity function to write.

As this needs to examine each bit to effectively count the number of ones in the data this seems to need m operations. In fact the job can be done in a number of operations that equals the number of bits set.

This algorithm is attributed Brian Kernigan but its based on a technique well worth knowing.

How do you zero the least significant set bit?

At first it seems impossibly difficult to do without using shifts and tests to find the first bit set. However if you start out with a value and subtract 1 then this always zeros the least significant bit that was set - think about it...

For example:

101110 - 1 = 101101

101100 - 1 = 101011

and so on. At this point you might think that no progress has been made because while you have zeroed the least significant set bit you have set other lower order bits. Of course the trick is that the original data already had these bits zeroed. So if you And the new value with the original then the result will have all of those bits zeroed in addition to the least significant set bit.

Thus to zero the least significant set bit in x all you have to do is:

x= x & (x-1);

If you keep doing this then eventually you will zero all of the bits and x will be zero.

Now you can put this to use in a parity function that only iterates the number of set bits times:

Int32 parity(Int32 p) { int flag=0; while(p!=0) { flag=flag ^ 1; p=p&(p-1); } return flag; }

Once again this can be written more compactly as:

Int32 parity(Int32 p) { int flag=0; while(p!=0) { flag ^= 1; p &=(p-1); } return flag; }

Second Method

With a fast parity function the method described above can return a hash in m*b operations where b is the number of bits set in the intermediate result - say m^2/2 on average.

The second method can be more efficient but it makes use of the columns in the matrices rather than the rows. So in this case we need to generate m bit values to represent the columns - but why bother? It costs the same to generate to generate 32 bit positive values.

We can simply truncate the result at the end to m bits and get the same result as if we had truncated the columns to m bits before the calculation.

To avoid interacting with the previous code we can change the constructor to create a second random matrix in column order:

Int32[] hashmatrix2 =new Int32[31];for (int i = 0; i < 31; i++) { hashmatrix2[i] = R.Next(); }

Now the elements of the array, or at least the first m bits are regarded as the columns of the matrix.

Now to work out the hash we select the columns that correspond to the bits set in the data and exclusive or them together:

public Int32 Hash2(int x) { Int32 sum = 0; for (int i = 0; i < 31; i++) { if ((x & 0x01) == 0x01) { sum ^=hashmatrix2[i]; } x >>= 1; } return sum>>(31-M); }

At the end we truncate the result to m bits by right shifting the result down to leave only m bits.

With this in place you can now try out the alternative method of calculation.

Int32 h2 = hashobj.Hash2(x);

Notice that you get a different result because they are using two different matrices.

Can these routines be improved?

The answer is yes but how exactly depends on the language and architecture you are working with.

The most obvious improvement is to unroll any fixed sized loop to eliminate the loop structure. A more complex optimization would be to use a very long binary word representation or to use a GPU to parallelize the algorithms. Notice that either algorithm lends itself to being implemented in hardware quite easily.

If you need a universal family of hash functions to try out another algorithm then either of the two methods works well.

Comments

or email your comment to: comments@i-programmer.info

Programmer's Guide To Theory - Splitting the Bit

Information theory – perhaps one of the most remarkable inventions of the twentieth century - naturally leads on to the consideration of how information can be coded and hence coding theory.

+ Full Article

Data Structures - Trees

Classic data structures produce classic tutorials. In this edition of Babbage's Bag we investigate the advanced ecology of trees - perfectly balanced trees, AVL trees and B-Trees.

+ Full Article

Recent Articles

Recent Book Reviews

Popular Articles

Parity

Second Method

Further reading

Related Articles

Comments