Page 2 of 2
A Bloom filter in C#
To make sure that you follow the way the Bloom filter actually works let's implement a simple version in C#. This is not a production version and it certainly isn't optimized. Optimization would depend very much on what you were using it for. What this implementation does is focus your attention on some of the difficulties of implementing a good Bloom filter.
The first simplicity is that instead of custom crafting a bit array we can use the supplied BitArray object. This can be found in the Collections namespace:
using System.Collections;
We can also create a Bloom filter class:
class Bloom { const int m = 1000; const int k = 5; BitArray bloom = new BitArray(m);
The constants m and k set the size of the bit array and the number of hash functions to use.
Before we can do anything else we need to solve the problem of finding k hash functions. This is difficult because finding one hash function is tough to find k is very tough. Fortunately there are a number of easy ways out of the problem. The first is that we can use the built in GetHashCode method that is built into every string. While this is not likely to be optimized it is supposed to be good enough for hash storage.
This gives us one hash function. The simplest way to generate k1 more is to use the hash value as the seed for a random number generator and use the next k1 random numbers as additional hash values. This isn't ideal and in practice it tends to produce a Bloom filter that is about as good in terms of false positives as using k/2 true hash values.
An alternative way of doing that job that is nearly as efficient is to generate two independent hash values h0 and h1 and then forming:
h[k]=h0+i*h1;
for i = 2,,, k1. This is easy but it needs two independent hash functions and this would mean having to actually code one in C#.
The simple random number solution to finding k hash values is very easy to implement:
private int[] hashk(string s, int k) { int[] hashes = new int[k]; hashes[0] = Math.Abs(s.GetHashCode()); Random R = new Random(hashes[0]); for (int i = 1; i < k; i++) { hashes[i] = R.Next(); } return hashes; } }
When the method returns we have k hash values in an int array ready to be used.
The Bloom class needs two other methods. One to add a data value:
public void AddData(string s) { int[] hashes = hashk(s, k); for (int i = 0; i < k; i++) { bloom.Set(hashes[i] % m,true); } }
Notice the way that the BitArray uses a Set method to set the bit at the correct location to either true i.e. 1 or false i.e. 0. When this method returns all of the locations indicated by the hash array have been set to 1.
The final method is the one that looks up a value to see it if is already in the filter:
public Boolean LookUp(string s) { int[] hashes = hashk(s, k); for (int i = 0; i < k; i++) { if (bloom[hashes[i] % m] == false) return false; } return true; }
This has to simply scan the locations indicated by the hash array and return true if they are all true. Notice that the only certainty is if you find a false i.e.0 BitArray element because then you can be sure that the string has never been entered into the filter.
To try it out you would do something like:
Bloom MyBloom = new Bloom(); string data = "Hello Bloom Filter"; MyBloom.AddData(data);
and to lookup a string:
string data = "Hello Bloom Filter"; MessageBox.Show( MyBloom.LookUp(data).ToString());
Of course in a real situation you would add lots of data and look up lots of data.
If you plan to make use of a Bloom filter then it is worth first writing two good independent hash functions. However you are also going to want to implement it using something closer to the machinelike C, if not assembler, to get the final drop of performance out of it.
Related articles
Hashing
Hashing solves one of the basic problems of computing  finding something that you have stored somewhere.
Advanced Hashing
Extensible hashing and perfect hashing are ideas that are worth exploring.
The Invertible Bloom Filter
A Bloom filter that you can delete, retrieve and list all data values.
Universal Hashing
A look at a way of creating a set of hash functions
More Information
Paper on using two hash values to create more:
Less Hashing, Same Performance…
MurmurHash  a good general purpose hash.
