Universal Hashing
Written by Mike James   
Friday, 27 January 2023
Article Index
Universal Hashing
Parity

Parity

This looks like a very easy and very efficient way of creating an m bit hash needing only m ands and m shifts.

The problem is that there is still the parity function to write.

As this needs to examine each bit to effectively count the number of ones in the data this seems to need m operations. In fact the job can be done in a number of operations that equals the number of bits set.

This algorithm is attributed Brian Kernigan but its based on a technique well worth knowing.

How do you zero the least significant set bit?

At first it seems impossibly difficult to do without using shifts and tests to find the first bit set. However if you start out with a value  and subtract 1 then this always zeros the least significant bit that was set - think about it...

For example:

  101110 - 1 = 101101

or

  101100 - 1 = 101011

and so on.  At this point you might think that no progress has been made because while you have zeroed the least significant set bit you have set other lower order bits. Of course the trick is that the original data already had these bits zeroed. So if you And the new value with the original then the result will have all of those bits zeroed in addition to the least significant set bit.

Thus to zero the least significant set bit in x all you have to do is:

x= x & (x-1);

If you keep doing this then eventually you will zero all of the bits and x will be zero.

Now you can put this to use in a parity function that only iterates the number of set bits times:

Int32 parity(Int32 p)
{
 int flag=0;
 while(p!=0)
 {
  flag=flag ^ 1;
  p=p&(p-1);
 }
 return flag;
}

Once again this can be written more compactly as:

Int32 parity(Int32 p)
{
 int flag=0;
 while(p!=0)
 {
  flag ^= 1;
  p &=(p-1);
 }
 return flag;
}

hash3

Second Method

With a fast parity function the method described above can return a hash in m*b operations where b is the number of bits set in the intermediate result - say m^2/2 on average.

The second method can be more efficient  but it makes use of the columns in the matrices rather than the  rows. So in this case we need to generate m bit values to represent the columns - but why bother? It costs the same to generate to generate 32 bit positive values.

We can simply truncate the result at the end to m bits and get the same result as if we had truncated the columns to m bits before the calculation.

To avoid interacting with the previous code we can change the constructor to create a second random matrix in column order:

Int32[] hashmatrix2 =new Int32[31];
for (int i = 0; i < 31; i++)
{
 hashmatrix2[i] = R.Next();
}

Now the elements of the array, or at least the first m bits are regarded as the columns of the matrix.

Now to work out the hash we select the columns that correspond to the bits set in the data and exclusive or them together:

public Int32 Hash2(int x)
{
 Int32 sum = 0;
 for (int i = 0; i < 31; i++)
 {
  if ((x & 0x01) == 0x01)
  {
   sum ^=hashmatrix2[i];
  }
  x >>= 1;
 }
 return sum>>(31-M);
}

At the end we truncate the result to m bits by right shifting the result down to leave only m bits.

With this in place you can now try out the alternative method of calculation.

Int32 h2 = hashobj.Hash2(x);

Notice that you get a different result because they are using two different matrices.

Can these routines be improved?

The answer is yes but how exactly depends on the language and architecture you are working with.

The most obvious improvement is to unroll any fixed sized loop to eliminate the loop structure. A more complex optimization would be to use a very long binary word representation or to use a GPU to parallelize the algorithms. Notice that either algorithm lends itself to being implemented in hardware quite easily.

If you need a universal family of hash functions to try out another algorithm then either of the two methods works well.

Further reading

Random Matrix Hashing

 

You can download the code for both the Windows Forms version and the WPF version in a single ZIP file from the CodeBin (note you have to register first).

Related Articles

Hashing

Hashing solves one of the basic problems of computing - finding something that you have stored somewhere.

Advanced Hashing

Extensible hashing and perfect hashing are ideas that are worth exploring.

Universal Hashing

A look at a way of creating a set of hash functions.

The Bloom Filter

The Invertible Bloom Filter

Storage Mapping Functions       

    

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

pico book

 

Comments




or email your comment to: comments@i-programmer.info

 
Banner


Compilers, Interpreters, VMs and JIT

The distinction between a compiler and an interpreter is one that can cause controversy. One programmer's compiler is another's interpreter and the whole subject gets very murky when you throw in the  [ ... ]



Inside Random Numbers

We often refer to things that are unpredictable as being "random" but this is not the same as truly random behavior - which is something we have to work hard to achieve. Put another way - how can a lo [ ... ]


Other Articles
 

 .

<ASIN:0071460519>

<ASIN:3540431012>



Last Updated ( Friday, 27 January 2023 )