Bloom filters are one of the really ingenious and simple building blocks for randomized data structures. A great summary is the paper by Broder and Mitzenmacher. In this post I will briefly review its key ideas since it forms the basis of the Count-Min sketch of Cormode and Muthukrishnan, it will also be necessary for an accelerated version of the graph kernel of Shervashidze and Borgwardt, and finally, a similar structure will be needed to compute data streams over time for a real-time sketching service.
At its heart a bloom filter uses a bit vector of length N and a set of k hash functions mapping arbitrary keys x into their hash values (h_i(x) in [1 .. N]) where (i in 1 .. k\) denotes the hash function. The Bloom filter allows us to perform approximate set membership tests where we have no false negatives but we may have a small number of false positives.
Initialize(b): Set all (b[i] = 0)
Insert(b,x): For all (i in 1 .. k\) set (b[h_i(x)] = 1)
Query(b, x): Return true if (b[h_i(x)] = 1) for all (i in 1 .. k\), false otherwise
Furthermore, unions and intersections between sets are easily achieved by performing bit-wise OR and AND operations on the bloom hashes of the corresponding sets respectively.
It is clear that if we inserted x into the Bloom filter the query will return true, since all relevant bits in b are 1. To analyze the probability of a false positive take the probability of a bit being 1. After inserting m items using k hash functions on a range of N we have
$$Pr(b[i] = mathrmTRUE) = 1 – (1 – frac1N)^k m approx 1 – e^-frackmN$$
For a false positive to occur we need to have all k bits associated with the hash functions to be 1. Ignoring the fact that the hash functions might collide the probability of false positives is given by
$$p approx (1 – e^-frackmN)^k$$
Taking derivatives with respect to (frackmN) shows that the minimum is obtained for (log 2), that is (k = fracNm log 2).
One of the really nice properties of the Bloom filter is that all memory is used to store the information about the set rather than an index structure storing the keys of the items. The downside is that it is impossible to read out b without knowing the queries. Also note that it is impossible to remove items from the Bloom filter once they’ve been inserted. After all, we do not know whether some of the bits might have collided with another key, hence setting the corresponding bits to 0 would cause false negatives.