The Invertible Bloom Filter
Written by Mike James   
Friday, 15 July 2022
Article Index
The Invertible Bloom Filter
Retrieval

Retrieval

This is all fairly simple but notice that due to hash collisions some elements of the array may be used to store more than one data element. You can tell which these are by the value of the count field. Any array element that has a count field greater than one cannot be used to retrieve the data as the value and key are storing something that results from XORing with multiple data values.

However, any item of data is stored in k different locations and for there to be no location with a count of one would mean that all of the hash functions had been subject to a collision - and that's not very likely. 

So what this means is we can implement a probabilistic retrieval operation:

for each hi(x) do
 if B[hi(x)].count=0 return null
 if B[hi(x)].count=1 AND
        B[hi(x)].key=x return B[hi(x)].value
 else return null
end for
return not_found

You can see the overall idea search each of the locations indicated by each has function for an entry that has a count of 1. Check that the keys match and return the value. If you can't find such an entry or if the keys don't match then return a null to indicated that that the item isn't in the filter.

Why do we need to check the key value?

The answer is that just as in the case of the standard Bloom filter is is possible that due to collisions all of the entries could have been generated by other data items. So even if all of the entries you check have non-zero counts they could have been generated by chance due to collisions and the only way to rule this out is to store and check the key.

Of course there is a price to pay for this fast retrieval. It is possible that the data item is stored in the array but hash collisions have stored other data items in the same locations. In this case the method returns a not_found to indicate that the value may be in the filter even though it cannot be retrieved.

This corresponds to the following operation:

GET(x): return the value y such that there is a key-value pair, (x; y), in B. If  null is returned, then (x; y) is not in B for any value of y. With low (but constant) probability, this operation may fail, returning a “not found” error condition. In this case there may or may not be a key-value pair (x; y) in B.


Delete

The next operation to consider is how to delete or remove an entry. This is surprisingly easy as it corresponds to an insert operation but one that decrements the count.

All you have to do is compute the hash functions and XOR the values again but subtract one from the count:

for each hi(x) do
 XOR B[hi(x)].key with x
 XOR B[hi(x)].value with y

 subtract
one from B[hi(x)].count
end for

This works because the XOR undoes the previous XOR operation - recall that it is its own inverse.

Notice that deleting a data value might reduce an element's  count back to one and so it undoes any previous collisions. This is, in fact, also the way you can list as many values as possible from those that have been stored in the filter.

The operation corresponds to:

DELETE(x; y): delete the key-value pair, (x; y), from B. This operation always succeeds, provided (x; y) is in B.

 

bloomlogo

List entries

Listing is the final operation and if you have been following the descriptions of the other operations you should be able to guess how it is going to work.

First all of the elements in the filter with a count of one have valid data and key storage elements. So the first task is to scan the storage and extract all of the elements with B[i].count=1. You can then add

(B[i].key,B[i].value) 

to the output list.

Notice there will be duplicates but this doesn't matter as you can remove duplicate keys.

This isn't the end of what you can do, however, because if you remove all such entries from the filter then there is a chance that it will undo any collisions that they caused. If this reduces the count of any element to one then another data element can be retrieved.

So the full algorithm is:

while there is an B[i].count=1 do
 add (B[i].key,B[i].value) to the output list
 perform DELETE(B[i].key,B[i].value)
end while
if the list is not empty set incomplete_list
return output_list

You can see that if all of the collisions can be undone then the list should be empty at the end of the operation,

This corresponds to the operation:

LISTENTRIES(): list all the key-value pairs being stored in B. With low (inverse polynomial in t) probability, this operation may return a partial list along with an incomplete_list status.

So how well does this work?

In terms of performance, if you are using k hash functions and the array has t elements then insertions, deletions and lookups take O(k) and listing takes O(t).

The probability that a GET operation will return a "not found" error, i.e. the data might be in the table but due to collisions it cannot be found,  when the data actually is in the table is the same as the false positive rate for the corresponding Bloom filter. 

form2

where k is the number of hash functions, m is the size of the bit array and n is the number of items stored in the bit array.

This can be made as small as you like by increasing the number of hash functions in use and the size of the table. The same is true of the probability of getting an incomplete listing of the data from the table - roughly O(t-k+2) as long as some conditions are met.

The invertible Bloom filter has lots of surprising uses including working out if two databases or two sets store the same elements. Basically what you do is create an invertible filter from one set, then delete all the elements of the other set and the elements that are left at the end of the operation are those that were only in one of the two sets.

You can even discover what the elements are and which set they belonged to, but this is another story.

bloomicon

Further Reading

The Bloom Filter

Invertible Bloom Lookup Tables

Related articles

Hashing

Hashing solves one of the basic problems of computing - finding something that you have stored somewhere.

Advanced Hashing

Extensible hashing and perfect hashing are ideas that are worth exploring.

Universal Hashing

A look at a way of creating a set of hash functions.

Killer Algorithms

killercover

Contents

  1. Algorithms - clever ways of doing things
  2. Finding Things With Binary Search
  3. Hashing - The Greatest Idea In Programming 
  4. Advanced Hashing*
  5. Universal Hashing*
  6. QuickSort Exposed*
  7. Quick Median*
  8. The Fast Fourier Transform
  9. Backprop
  10. The Bloom Filter 
  11. The Invertible Bloom Filter !!!NEW
  12. Kalman Filter
  13. Public Key Encryption*
  14. Linear Programming

* First Draft

espbook

 

Comments




or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Virtual Memory

Virtual memory is a way of pretending that your computer has more memory than it really has. But like all good things it comes at a cost. Virtual memory is an example of trading speed for storage.



Cache Memory And The Caching Principle

The caching principle is very general but it is best known for its use in speeding up the CPU. We take a look a the basics of cache memory, how it works and what governs how big it needs to be to do i [ ... ]


Other Articles

 



Last Updated ( Friday, 15 July 2022 )