You may never have heard of a Bloom Filter, but this ingenious algorithm is used in Google's BigTable database to save time fruitlessly searching for data that isn't there.

In programming, perhaps in life, there are certain well known trade-offs. You can usually trade space for time as in the more storage you can throw at a problem the faster you can make it run. There is also a lesser known trade-off which is much more sophisticated. In general, you can trade certainty for time. This the basis of many random algorithms where the solution returned isn't certain but it is fast compared to a deterministic calculation of the same quantity.

These are ideas that fascinate many programmers who become completely absorbed by the study of algorithms for their own sake, but they also have practical application. Take for example the currently hot topic of the Bloom Filter. This might sound like something a creative photographer might put in front of his lens but it is in fact an intriguing algorithm that mixes trading both space and certainty for time. It can tell you if you have seen a particular data item before in ultra quick time - but it might be wrong!

Bloom filters are used by Google's BigTable database to reduce lookups for data rows that haven't been stored. The Squid proxy server uses one to avoid looking up things that aren't in the cache and so on...

Using hash functions

The algorithm invented in 1970 by Burton Bloom is very simple but still ingenious. It relies on the use of a number of different hash functions.

A hash function is a function that will take an item of data and process it to produce a value or key. For example, you could simply add up the code values for each character in a string and return the result mod some given value. A hash function always produces the same hash value from the same data but it is possible and in fact usual for two different data values to produce the same hash value. That is the hash value isn't unique to a given item of data and you can't reverse the hashing function to get the data values. The hash function is a many-one deterministic function. A good hash function also has other desirable properties such as spreading the hash values obtained as evenly as possible over the output range but for the moment let's just concentrate on the basic hash function.

A Bloom filter starts off with a bit array Bloom[i] initialized to zero. To record a data value you simply compute k different hash functions and treat the resulting k values as indices into the array and set each of the k array elements to 1. You repeat this for every data item that you encounter.

In this case three hash functions are used to set three elements in the bit array for each data item. Notice that as illustrated these two data items both set the 4th element.

Now suppose a data item turns up and you want to know if you have seen it before. All you have to do is apply the k hash functions and look up the indicated array elements. If any of them are zero you can be 100% sure that you have never encountered the item before - if you had the bit would have been set to 1.

However even if all of them are one then you can't conclude that you have seen the data item before because all of the bits could have been set by the k hash functions applied to multiple other data items. All you can conclude is that it is likely that you have encountered the data item before.

Notice that it is impossible to remove an item from a Bloom filter. The reason is simply that you can't unset a bit that appears to belong to a data item because it might also be set by another data item.

If the bit array is mostly empty i.e. set to zero and the k hash functions are independent of one another then the probability of a false positive i.e. concluding that we have seen a data item when we actually haven't is low.

For example, if there are only k bits set you can conclude that the probability of a false positive is very close to zero as the only possibility of error is that you entered a data item that produced the same k hash values - which is unlikely as long as the hash functions are independent.

As the bit array fills up the probability of a false positive slowly increases. Of course when the bit array is full every data item queried is identified as having been seen before. So clearly you can trade space for accuracy as well as for time.

Interestingly a Bloom filter can also trade accuracy for space. If you think that to store an n byte string takes n bytes then in a Boom filter it only takes k bits and k comparisons but there is the possibility of false positives. As k is increased the storage needed increases along with the number of comparisons and the possibility of a false positive decreases.

As with any trade-off situation that are optimal values. The approximate false positive (i.e. error) rate is:

where k is the number of hash functions, m is the size of the bit array and n is the number of items stored in the bit array.

Using this you can work out the optimal k for given m and n, which is:

So for example, if the bit array has 1000 elements and you have already stored 10 data items then the optimum k is 70. You can also work out the size of the bit array to give any desired probability of error for a fixed n (assuming an optimum k):

So for example if you are storing 10 data items and you want the probability of a false positive to be around 0.001 you need an array with around 138 elements and a k around 10.

A more realistic example, is when you are storing 100,000 items and need the probability of false positive to be around 0.000001 then you need an array that has around 350Kbytes of storage and k is around 20.

A Bloom filter in C#

To make sure that you follow the way the Bloom filter actually works let's implement a simple version in C#. This is not a production version and it certainly isn't optimized. Optimization would depend very much on what you were using it for. What this implementation does is focus your attention on some of the difficulties of implementing a good Bloom filter.

The first simplicity is that instead of custom crafting a bit array we can use the supplied BitArray object. This can be found in the Collections namespace:

using System.Collections;

We can also create a Bloom filter class:

class Bloom { const int m = 1000; const int k = 5; BitArray bloom = new BitArray(m);

The constants m and k set the size of the bit array and the number of hash functions to use.

Before we can do anything else we need to solve the problem of finding k hash functions. This is difficult because finding one hash function is tough to find k is very tough. Fortunately there are a number of easy ways out of the problem. The first is that we can use the built in GetHashCode method that is built into every string. While this is not likely to be optimized it is supposed to be good enough for hash storage.

This gives us one hash function. The simplest way to generate k-1 more is to use the hash value as the seed for a random number generator and use the next k-1 random numbers as additional hash values. This isn't ideal and in practice it tends to produce a Bloom filter that is about as good in terms of false positives as using k/2 true hash values.

An alternative way of doing that job that is nearly as efficient is to generate two independent hash values h0 and h1 and then forming:

h[k]=h0+i*h1;

for i = 2,,, k-1. This is easy but it needs two independent hash functions and this would mean having to actually code one in C#.

The simple random number solution to finding k hash values is very easy to implement:

private int[] hashk(string s, int k) { int[] hashes = new int[k]; hashes[0] = Math.Abs(s.GetHashCode()); Random R = new Random(hashes[0]); for (int i = 1; i < k; i++) { hashes[i] = R.Next(); } return hashes; }

When the method returns we have k hash values in an int array ready to be used.

The Bloom class needs two other methods. One to add a data value:

public void AddData(string s) { int[] hashes = hashk(s, k); for (int i = 0; i < k; i++) { bloom.Set(hashes[i] % m,true); } }

Notice the way that the BitArray uses a Set method to set the bit at the correct location to either true i.e. 1 or false i.e. 0. When this method returns all of the locations indicated by the hash array have been set to 1.

The final method is the one that looks up a value to see it if is already in the filter:

public Boolean LookUp(string s) { int[] hashes = hashk(s, k); for (int i = 0; i < k; i++) { if (bloom[hashes[i] % m] == false) return false; } return true; }

This has to simply scan the locations indicated by the hash array and return true if they are all true. Notice that the only certainty is if you find a false i.e.0 BitArray element because then you can be sure that the string has never been entered into the filter.

To try it out you would do something like:

Bloom MyBloom = new Bloom(); string data = "Hello Bloom Filter"; MyBloom.AddData(data);

and to lookup a string:

string data = "Hello Bloom Filter"; MessageBox.Show(MyBloom.LookUp(data).ToString());

Of course in a real situation you would add lots of data and look up lots of data.

If you plan to make use of a Bloom filter then it is worth first writing two good independent hash functions. However you are also going to want to implement it using something closer to the machine-like C, if not assembler, to get the final drop of performance out of it.

I am always amazed by the subtlety of probability. You can cite the Monty Hall problem or The Fisher Yates Shuffle, but what about the Grasshopper problem? Easy to state, but very difficult to solve a [ ... ]

Apache has released Bigtop 1.2.1 with support for OpenJDK 8, and a new sandbox feature that lets you run big data pseudo clusters on Docker. Bigtop is an Apache Foundation project that you can use for [ ... ]