Bloom filter - Brendan Shih

# Bloom filter ![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/article0.00998d930354.png) ## Metadata - Author: [[Wikimedia Foundation, Inc.]] - Full Title: Bloom filter - Category: #articles - Summary: In computing, a Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. ## Highlights - [False positive](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors) matches are possible, but [false negatives](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors) are not – in other words, a query returns either "possibly in set" or "definitely not in set". ([View Highlight](https://read.readwise.io/read/01jr3atv2nc055k2jr03m79cf7)) - Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if "conventional" error-free [hashing](https://en.wikipedia.org/wiki/Hash_function) techniques were applied. ([View Highlight](https://read.readwise.io/read/01jr3awn20phvw3bhk1ksz2dev)) - An *empty Bloom filter* is a [bit array](https://en.wikipedia.org/wiki/Bit_array) of m bits, all set to 0. It is equipped with k different [hash functions](https://en.wikipedia.org/wiki/Hash_function), which map set elements to one of the m possible array positions. ([View Highlight](https://read.readwise.io/read/01jr3axv3wpqfjza91wy1wzrkb)) - To *add* an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1. ([View Highlight](https://read.readwise.io/read/01jr3az4hh6mst1b4j1bwkpcxs)) - Removing an element from this simple Bloom filter is impossible because there is no way to tell which of the k bits it maps to should be cleared. ([View Highlight](https://read.readwise.io/read/01jr3b062w754azq0ycq69a6sp)) - Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. ([View Highlight](https://read.readwise.io/read/01jr3hm477wcpsd1f1pj8jj754)) - Assume that a [hash function](https://en.wikipedia.org/wiki/Hash_function) selects each array position with equal probability. If *m* is the number of bits in the array, the probability that a certain bit is not set to 1 by a certain hash function during the insertion of an element is 1 − 1 m . {\displaystyle 1-{\frac {1}{m}}.} ![{\displaystyle 1-{\frac {1}{m}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ba586e62be77ee25ad8bd3be82b5e301befca2b1) ([View Highlight](https://read.readwise.io/read/01jr3b35pp3mt67yxgym5qwrnm)) - Now test membership of an element that is not in the set. Each of the *k* array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the [algorithm](https://en.wikipedia.org/wiki/Algorithm) to erroneously claim that the element is in the set, is often given as ε = ( 1 − [ 1 − 1 m ] k n ) k ≈ ( 1 − e − k n / m ) k . {\displaystyle \varepsilon =\left(1-\left[1-{\frac {1}{m}}\right]^{kn}\right)^{k}\approx \left(1-e^{-kn/m}\right)^{k}.} ![{\displaystyle \varepsilon =\left(1-\left[1-{\frac {1}{m}}\right]^{kn}\right)^{k}\approx \left(1-e^{-kn/m}\right)^{k}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/de73929baec5fd76dde95874189051648c635b1d) ([View Highlight](https://read.readwise.io/read/01jr3b23x9gejrd2ppvzzbjmx5)) - This is not strictly correct as it assumes independence for the probabilities of each bit being set. However, assuming it is a close approximation we have that the probability of false positives decreases as *m* (the number of bits in the array) increases, and increases as *n* (the number of inserted elements) increases. ([View Highlight](https://read.readwise.io/read/01jr3b2wz7dk0wje5x7w5sktq0)) - The number of hash functions, *k*, must be a positive integer. Putting this constraint aside, for a given *m* and *n*, the value of *k* that minimizes the false positive probability is k = m n ln ⁡ 2. {\displaystyle k={\frac {m}{n}}\ln 2.} ![{\displaystyle k={\frac {m}{n}}\ln 2.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/fabc2770225ac59fe42a78f75ea89de650f0130c) ([View Highlight](https://read.readwise.io/read/01jr3ctp87k2rtwd22p4gwbpwt)) - Some kinds of [superimposed code](https://en.wikipedia.org/wiki/Superimposed_code) can be seen as a Bloom filter implemented with physical [edge-notched cards](https://en.wikipedia.org/wiki/Edge-notched_card). An example is [Zatocoding](https://en.wikipedia.org/wiki/Zatocoding), invented by [Calvin Mooers](https://en.wikipedia.org/wiki/Calvin_Mooers) in 1947, in which the set of categories associated with a piece of information is represented by notches on a card, with a random pattern of four notches for each category. ([View Highlight](https://read.readwise.io/read/01jr3b3nwzr3sn46y2bchdf5pp))