Bloom Filters Explained

Probabilistic Data Structure To Check Membership

2023-03-06 2229 words 11 minutes

/bloom-filters-explained/bloom-filter.webp

Contents

The target audience for this article falls into the following roles:

Tech workers
Students
Engineering managers

The prerequisite to reading this article is fundamental knowledge of data structures and algorithms. This article does not cover an in-depth guide on probabilistic data structures.

Get the powerful template to approach system design for FREE on newsletter sign-up:

Introduction

Data structures such as HashSet can be used in a small data set to test if an item is a member of the data set. However, the operation to test if an item is a member of a large dataset can be expensive. The time complexity and space complexity can be linear in the worst case.

Probabilistic data structures offer constant time complexity and constant space complexity at the expense of providing an answer that is non-deterministic.

Requirements

Design a data structure with the following characteristics:

constant time complexity to test membership
a small amount of memory to test membership
insert and query operations should be parallelizable
test membership can yield approximate results

What is a bloom filter?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an item is a member of a set. The bloom filter will always say yes if an item is a set member. However, the bloom filter might still say yes although an item is not a member of the set (false positive). The items can be added to the bloom filter but the items cannot be removed. The bloom filter supports the following operations:

adding an item to the set
test the membership of an item in the set

How does a bloom filter work?

The bloom filter data structure is a bit array of length n as shown in Figure 1. The position of the buckets is indicated by the index (0–9) for a bit array of length ten. All the bits in the bloom filter are set to zero when the bloom filter is initialized (an empty bloom filter). The bloom filter discards the value of the items but stores only a set of bits identified by the execution of hash functions on the item.

How to add an item to the bloom filter?

The following operations are executed to add an item to the bloom filter:

the item is hashed through k hash functions
the modulo n (length of bit array) operation is executed on the output of the hash functions to identify the k array positions (buckets)
the bits at all identified buckets are set to one

There is a probability that some bits on the array are set to one multiple times due to hash collisions.

Figure 2: Adding an item to the Bloom Filter

In Figure 2, the items red and blue are added to the bloom filter. The buckets that should be set to one for the item red are identified by the execution of the modulo operator on the computed hash value.

h1(red) mod 10 = 1
h2(red) mod 10 = 3
h3(red) mod 10 = 5

The buckets that should be set to one for the item blue are identified by the execution of the modulo operator on the computed hash value.

h1(blue) mod 10 = 4
h2(blue) mod 10 = 5
h3(blue) mod 10 = 9

The bucket at position five is set to one by distinct items red and blue.

How to check the membership of an item in the bloom filter?

The following operations are executed to check if an item is a member of the bloom filter:

the item is hashed through the same k-hash functions
the modulo n (length of bit array) operation is executed on the output of the hash functions to identify the k array positions (buckets)
verify if all the bits at identified buckets are set to one

Figure 3: Check the membership of an item in the bloom filter

If any of the identified bits are set to zero, the item is not a member of the bloom filter. If all the bits are set to one, the item might be a member of the bloom filter. The uncertainty about the membership of an item is due to the possibility of some bits being set to one by different items or due to hash function collisions.

In Figure 3, the bloom filter is queried to check the membership of item blue. The buckets that should be checked are identified by the execution of the modulo operator on the computed hash value.

h1(blue) mod 10 = 4
h2(blue) mod 10 = 5
h3(blue) mod 10 = 9

The item blue might be a member of the bloom filter as all the bits are set to one.

Figure 4: Check the membership of an item not in the bloom filter

In Figure 4, the bloom filter is queried to check the membership of item black, which is not a member of the bloom filter.

h1(black) mod 10 = 0
h2(black) mod 10 = 3
h3(black) mod 10 = 6

The item black is not a member of the bloom filter as the bit at position zero is set to zero. The verification of the remaining bits can be skipped.

Bloom filter false positive

Figure 5: Bloom filter yielding a false positive

In Figure 5, the bloom filter is queried to check the membership of item green, which is not a member of the bloom filter.

h1(green) mod 10 = 3
h2(green) mod 10 = 4
h3(green) mod 10 = 5

The bloom filter will say yes although the item green is not a member of the bloom filter as the bits were set to one by items blue and red.

Further system design learning resources

Get the powerful template to approach system design for FREE on newsletter sign-up:

What is the asymptotic complexity of the bloom filter?

The performance of the bloom filter depends on the hash functions used. The faster the computation of the hash function, the quicker the overall time of each operation on the bloom filter.

Time complexity

Operation	Time complexity
add item	O(k) or constant
membership query	O(k) or constant

where k is the number of hash functions.

The time complexity of the bloom filter is independent of the number of items already in the bloom filter. The k lookups in the bloom filter are independent and can be parallelized.

Space complexity

Regardless of the number of items in the bloom filter, the bloom filter requires a constant number of bits by reserving a few bits per item. The bloom filter does not store the data items yielding a constant space complexity of O(1).

Bloom filter calculator

The hur.st bloom filter calculator can be used to choose an optimal size for the bloom filter and explore how the different parameters interact. The accuracy of the bloom filter depends on the following:

number of hash functions (k)
quality of hash functions
length of the bit array (n)
number of items stored in the bloom filter

The properties of an optimal hash function for the bloom filter are the following:

fast
independent
uniformly distributed

Advantages of bloom filter

The advantages of the bloom filter are the following:

constant time complexity
constant space complexity
operations are parallelizable
no false negatives
enables privacy by not storing actual items

Bloom filter disadvantages

The limitations of the bloom filter are the following:

bloom filter doesn’t support the delete operation
false positives rate can’t be reduced to zero
bloom filter on disk requires random access due to random indices generated by hash functions

Removing an item from the bloom filter is not supported because there is no possibility to identify the bits that should be cleared. There might be other items that map onto the same bits and clearing any of the bits would introduce the possibility of false negatives.

Bloom filter use cases

Figure 6: Bloom filter used to reduce disk lookups in a database

The bloom filters are useful in high-ingestion systems to prevent expensive disk seeks. In Figure 6, for instance, the server performing a lookup of an item in a large table on the disk can degrade the throughput of the service. The bloom filter in memory can be used to serve the lookups and reduce the unnecessary disk I/O except when the bloom filter returns a false positive. The applications of the bloom filter are the following:

reducing disk lookups for the non-existing keys in a database
determining whether a user ID is already taken
filtering out previously shown posts on recommendation engines
checking words for misspellings and profanity with a spellchecker
identifying malicious URLs, blocked IPs, and fraudulent transactions
Log-structured merge tree (LSM) storage engine used in databases such as Cassandra uses a bloom filter to check if the key exists in the SSTable
MapReduce uses the bloom filter for the efficient processing of large datasets

Bloom filter implementation

RedisBloom module provides bloom filter support out of the box. The bloom filter can take as little as 2% of the memory a set requires and are typically faster than sets. The node.js implementation of a bloom filter for academic purposes can be found on github.com/guyroyse. Non-cryptographic hash functions such as MurmurHash or FNV-1a can be used in bloom filters.

Extensions of bloom filter

Counting bloom filter

The counting bloom filter includes an array of counters for each bit in the bloom filter array. The counter array is initialized to zeros. The counter for associated bits is incremented when an item is added to the counting bloom filter. The membership query is the same as the classical bloom filter. The counting bloom filter supports the delete operation. The following operations are executed to delete an item from the bloom filter:

the item is hashed through the same k-hash functions
the modulo n (length of bit array) operation is executed on the output of the hash functions to identify the k array positions (buckets)
the counter of the associated bits is decreased
the corresponding bit is unset to zero if the associated counter is decremented to zero

Naturally, Counting Bloom filters are much bigger than classical Bloom filters because additional memory has to be allocated for the counters even if most of them are zeros. Therefore, it is important to estimate how large such counters can become and how their size depends on the filter’s length m and the number of hash functions k.

Scalable bloom filter

The bloom filters cannot be rebalanced on saturation because it is not possible to identify the items that are members of the bloom filter. The bloom filters can be stacked for scalability. Once a bloom filter is saturated, a new bloom filter with a higher capacity is created on top of the current bloom filter. Membership queries require inspecting each bloom filter in the stack. The addition of a new item must check each bloom filter before adding the item to the top bloom filter.

Striped bloom filter

The striped bloom filter is implemented using an array of unsigned 64-bit integers and is divided into shards for concurrency. Each shard contains its mutex, allowing for a high degree of concurrent insert and lookup throughput. A restriction is that the bloom filter size must be a multiple of the number of shards.

Variants of bloom filter

There are data structures that offer similar functionality as the bloom filter. The popular variants of the bloom filter are the following:

Summary

The bloom filters are a useful data structure in high-ingestion systems. The bloom filter offers constant time complexity and constant space complexity. The tradeoff is the bloom filter yielding a probabilistic result.

What to learn next?

Get the powerful template to approach system design for FREE on newsletter sign-up:

License

CC BY-NC-ND 4.0: This license allows reusers to copy and distribute the content in this article in any medium or format in unadapted form only, for noncommercial purposes, and only so long as attribution is given to the creator. The original article must be backlinked.

References

RedisBloom, redis.io
Deduplication with redis, redis.com
Rob Edwards, Bloom Filters (2018), youtube.com
Suresh Kondamudi, How To Remove Duplicates In A Large Dataset Reducing Memory Requirements By 99% (2016), highscalability.com
Bloom Filter Datatype for Redis (2022), redis.com
Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors (1970), uta.edu
Paulo Almeida et al., Scalable Bloom Filters (2007)
Andrei Broder, Michael Mitzenmacher, Network Applications of Bloom Filters: A Survey (2004), harvard.edu
Alfredo Luque, Hyperbloom (2014), GitHub.com
Bonomi, F., et al., An Improved Construction for Counting Bloom Filters (2006), harvard.edu
Tarkoma, S., et al., Theory and Practice of Bloom Filters for Distributed Systems (2012), ieee.org
Guy Royse, How to Use Bloom Filters in Redis (2020), youtube.com
Akash Jain, Probabilistic Data Structures for Big Data and Streaming Applications (2021), octo.vmware.com
Assoc. Prof. Milko Marinov, A Bloom Filter Application for Processing Big Datasets through MapReduce Framework (2021), infotech-bg.com

Contents

Bloom Filters Explained

Probabilistic Data Structure To Check Membership

Introduction

Requirements

What is a bloom filter?

How does a bloom filter work?

How to add an item to the bloom filter?

How to check the membership of an item in the bloom filter?

Bloom filter false positive

Further system design learning resources

What is the asymptotic complexity of the bloom filter?

Time complexity

Space complexity

Bloom filter calculator

Advantages of bloom filter

Bloom filter disadvantages

Bloom filter use cases

Bloom filter implementation

Extensions of bloom filter

Counting bloom filter

Scalable bloom filter

Striped bloom filter

Variants of bloom filter

Summary

What to learn next?

License

References

Join the System Design Journey

And get the powerful template to approach system design for FREE