|The Invertible Bloom Filter|
|Written by Mike James|
|Thursday, 29 December 2016|
Page 1 of 2
If you think that the Bloom filter is magic, wait until you see the invertible Bloom filter. This not only keeps a record of data, it allows you to add, delete and make a list of the data you have stored.
There is something special about using hash functions to manage storage. It seems to give you magical powers at no cost. The Bloom filter, for example, can tell you almost instantly if you have ever encountered an item of data before. The price is that if it tells you that you have never seen the data then it is always 100% correct, but if it tells you that you have then it might be wrong. You can make the probability of a false positive as small as you like, but it is the price you pay for the lightening lookup time.
Bloom Filter Basics
The Bloom filter is easy to describe but it you want a full account, including a C# implementation demonstrating how it works, then see: The Bloom Filter.
For a quick summary:
Assuming you have k hash functions h1,h2 .. hk and a bit array B, then when an item of data arrives you set the bits stored in the bit array at h1(d), h2(d) .. hk(d).
That is, after the update:
are all set to 1.
When a new item of data x arrives and you want to know if you have encountered it before you simply work out the hash functions h1(x), h2(x) .. hk(x) and look in the corresponding locations in the bit array B if any one is zero then you can conclude with certainty that you have not encountered the data before - if you had the bit would have been set.
If all of the bits are set to one you can't conclude with certainty that you have seen the data item because it is in the nature of a hash function- a hash function can map two different data items to the same location. In other words, for some data items, a and b, it occasionally happens that h(a)=h(b). This is usually referred to as a hash collision.
What this means is that other data might have set some of the bits.
The method can tolerate a few bits that are accidentally set, but it is possible for them all to be set by data other than x.
However, by using a lot of hash functions and a big bit array you can make the probability of a false positive as small as you like. You trade off the slight chance that you get a false positive for the speed and storage economy offered by a Bloom filter.
In general, Bloom filters are ideal when you need to check for the presence/absence of some data element and the cost of getting the presence test wrong is low.
An Invertible Bloom Filter
The principle of Bloom filters is both clever and satisfying but has some drawbacks. In particular you can't remove a data item from a filter because you might zero a bit that was also set by another data item and so mark it as not being in the filter as well.
You also cannot use a Bloom filter to make a list of what is stored in the filter or retrieve an item of data based on a key. Sometimes not being able to retrieve a value is a good thing from the point of view of security or privacy, but other applications need retrieval.
The invertible Bloom filter works in more or less the same way as a basic Bloom filter but it works with with key value pairs (x,y) and instead of a bit array is uses a three-component data structure that can store the key x, the value y and a count.
So B[i].count is the number of times B[i] has been used, B[i].key is the key and B[i].value is the value stored.
When a key value pair (x,y) needs to be stored all you do is compute the hash functions on the key h1(x), h2(x) .. hk(x) store y in each of the locations and increment the corresponding count.
Of course, this being a Bloom filter, the hash functions will result in storing multiple data items in the same location.
So how do we cope with this?
You can't simply store x and then store z in the same location because this would wipe out all trace of x. The solution is to use a reversible storage function. For example, if you store a value in B by adding it:
you can remove it by subtracting it:
If B already stored a value before you added x then when you subtract x you get the value back again.
You can use addition but a much easier function to use is XOR. If you XOR a value with another then XORing it a second time returns you to the original value. For example:
Do the same operation on the result
and you get the number you started with.
In other words:
and XOR is its own inverse operation.
To create the invertible Bloom filter all we do is XOR the data into the value element:
This is the complete algorithm for storing an element and it corresponds to the operation:
INSERT(x; y): insert the key-value pair, (x; y), into B.
This operation always succeeds, assuming that all keys are distinct.
This is all fairly simple but notice that due to hash collisions some elements of the array may be used to store more than one data element. You can tell which these are by the value of the count field. Any array element that has a count field greater than one cannot be used to retrieve the data as the value and key are storing something that results from XORing with multiple data values.
However, any item of data is stored in k different locations and for there to be no location with a count of one would mean that all of the hash functions had been subject to a collision - and that's not very likely.
So what this means is we can implement a probabilistic retrieval operation:
You can see the overall idea search each of the locations indicated by each has function for an entry that has a count of 1. Check that the keys match and return the value. If you can't find such an entry or if the keys don't match then return a null to indicated that that the item isn't in the filter.
Why do we need to check the key value?
The answer is that just as in the case of the standard Bloom filter is is possible that due to collisions all of the entries could have been generated by other data items. So even if all of the entries you check have non-zero counts they could have been generated by chance due to collisions and the only way to rule this out is to store and check the key.
Of course there is a price to pay for this fast retrieval. It is possible that the data item is stored in the array but hash collisions have stored other data items in the same locations. In this case the method returns a not_found to indicate that the value may be in the filter even though it cannot be retrieved.
This corresponds to the following operation:
GET(x): return the value y such that there is a key-value pair, (x; y), in B. If null is returned, then (x; y) is not in B for any value of y. With low (but constant) probability, this operation may fail, returning a “not found” error condition. In this case there may or may not be a key-value pair (x; y) in B.
The next operation to consider is how to delete or remove an entry. This is surprisingly easy as it corresponds to an insert operation but one that decrements the count.
All you have to do is compute the hash functions and XOR the values again but subtract one from the count:
This works because the XOR undoes the previous XOR operation - recall that it is its own inverse.
Notice that deleting a data value might reduce an element's count back to one and so it undoes any previous collisions. This is, in fact, also the way you can list as many values as possible from those that have been stored in the filter.
The operation corresponds to:
DELETE(x; y): delete the key-value pair, (x; y), from B. This operation always succeeds, provided (x; y) is in B.
|Last Updated ( Thursday, 29 December 2016 )|