Contents

Quotient Filter Explained

Probabilistic Data Structure To Check Membership


The target audience for this article falls into the following roles:

  • Tech workers
  • Students
  • Engineering managers

The prerequisite to reading this article is fundamental knowledge of data structures and algorithms. This article does not cover an in-depth guide on probabilistic data structures.



Get the powerful template to approach system design for FREE on newsletter sign-up:




Introduction

Data structures such as HashSet can be used in a small data set to test if an item is a member of the data set. However, the operation to test if an item is a member of a large dataset can be expensive. The time complexity and space complexity can be linear in the worst case.

Probabilistic data structures offer constant time complexity and constant space complexity at the expense of providing a probabilistic answer.




Requirements

Design a data structure with the following characteristics:

  • constant time complexity to test membership of items in a large dataset
  • constant space complexity to test membership of items in a large dataset
  • test membership may yield approximate results
  • items can be deleted
  • the data structure can be resized



What is a quotient filter?

A quotient filter is a space-efficient probabilistic data structure that is used to test whether an item is a member of a set. The quotient filter will always say yes if an item is a set member. However, the quotient filter might still say yes although an item is not a member of the set (false positive). The quotient filter stores only a part of the item’s hash fingerprint along with additional metadata bits. The quotient filter can be resized on demand. The quotient filter supports the following operations:

  • adding an item to the set
  • test the membership of an item in the set
  • deletion of an item from the set



How does a quotient filter work?

The quotient filter data structure is a bit array of length n with additional metadata bits. The hash function generates a p-bits fingerprint as shown in Figure 1. The r least significant bits are called the remainder while the q = p - r most significant bits are called the quotient. The quotient filter has n = 2^q buckets.

Figure 1: Quotient filter partitioning
Figure 1: Quotient filter partitioning

Assume the hash function output is 8-bits long. In Figure 2, the bucket position is indicated by the quotient bits (0–7) in binary number format. All the bits in the quotient filter are set to zero on initialization (an empty quotient filter). The quotient filter utilizes only a hash function such as MurmurHash, unlike the k-hash functions in the bloom filter.

Figure 2: Empty quotient filter
Figure 2: Empty quotient filter

The hash output is split into two parts: quotient and remainder. The quotient filter does not store the item but only a part of the generated hash output. The quotient is used to identify the bucket position while the remainder gets stored in the bucket. For instance, if the output of the hash function is 8 bits (p), and the quotient filter size is eight, a power of two, n = 2^q = 2^3. The quotient (q) is the most significant bits (q = 3) and the remainder (r) is p - q = 8 - 3 = 5 bits.

The original bucket to which the quotient of the hash output points is known as the canonical bucket. The set of fingerprints having the same quotient is known as run. The additional metadata bits in the quotient filter perform the following function:

  • is_occupied: set when a bucket is a canonical bucket for some item
  • is_continuation: set when a bucket is occupied but not by the first remainder in a run
  • is_shifted: set when the remainder in a bucket is not in its canonical bucket



How to add an item to the quotient filter?

In Figure 3, the hash output is 8 bits (p). The size of the quotient filter is 8 (n) yielding a quotient (q) of 3 bits and the remainder (r) of 5 bits. The following operations are executed to add an item to the quotient filter:

  1. the item is hashed through a hash function
  2. the quotient (q) and the remainder (r) are calculated
  3. the quotient is used to identify the bucket position
  4. the remainder is stored in the corresponding bucket
Figure 3: Adding an item to the quotient filter
Figure 3: Adding an item to the quotient filter

In Figure 3, item one is added to the quotient filter. The bucket that should store the fingerprint for item one is identified by the quotient in the hash output.

quotient filter partitioning
hash(one) = 00001010 = 000 (quotient) | 01010 (remainder)

The binary number 000 (quotient) represents the zeroth bucket. The remainder 01010 is stored in the zeroth bucket.

Hard collisions occur in the quotient filter when multiple items result in the same hash output. The hash function output must be uniformly distributed and independent to prevent hard collisions.

Soft collisions occur in the quotient filter when multiple items have the same quotient. The metadata bits are used in the quotient filter due to the following reasons:

  • resolve soft collisions in the quotient filter
  • enable recovery of the original hash output

In Figure 4, item one is added to the quotient filter. The bucket that should store the fingerprint for item one is identified by the quotient of the hash output. The following operations are executed to add item one to the quotient filter:

  1. calculate hash(one) = 00001010 = 000 (quotient) | 01010 (remainder)
  2. 000 (quotient) is used to identify the bucket position
  3. 01010 (remainder) is stored in the bucket
  4. set metadata bit is_occupied to 1 as the item one hashed into the bucket
Figure 4: Soft collisions in quotient filter
Figure 4: Soft collisions in quotient filter

However, items two and three have the same quotient resulting in a soft collision. The following operations are executed to add item two to the quotient filter:

  1. calculate hash(two) = 00100111 = 001 (quotient) | 00111 (remainder)
  2. 001 (quotient) is used to identify the bucket position
  3. 00111 (remainder) is stored in the bucket
  4. set metadata bit is_occupied to 1 as item two hashed into the bucket

The following operations are executed to add item three to the quotient filter:

  1. calculate hash(three) = 00110001 = 001 (quotient) | 10001 (remainder)
  2. 001 (quotient) is used to identify the bucket position
  3. bucket at position 001 is already taken because the is_occupied metadata bit is set to 1
  4. linear probing is performed to identify the subsequent (on the right) empty bucket at position 010
  5. 10001 (remainder) is stored in the bucket at position 010
  6. metadata bit is_occupied is unset as the bucket at position 010 is not the canonical bucket of item three
  7. metadata bit is_continuation is set to 1 as item two also hashed into the same bucket as item three
  8. metadata bit is_shifted is set to 1 as item three is not stored in the original bucket at position 001



Further system design learning resources

Get the powerful template to approach system design for FREE on newsletter sign-up:




How to check the membership of an item in the quotient filter?

The following operations are executed to check if item four is a member of the quotient filter:

  1. item four is hashed through the same hash function
  2. calculate hash(four) = 01100000 = 011 (quotient) | 00000 (remainder)
  3. verify if the metadata bit is_occupied at the bucket position 011 (quotient) is set
Figure 5: Check the membership of an item not in the quotient filter
Figure 5: Check the membership of an item not in the quotient filter

In Figure 5, the metadata bit is_occupied is unset, which shows that item four is not a member of the quotient filter. The following operations are executed to test if item two is a member of the filter:

  1. item two is hashed through the same hash function
  2. calculate hash(two) = 00100111 = 001 (quotient) | 00111 (remainder)
  3. verify if the metadata bit is_occupied at the bucket position 001 is set
Figure 6: Check the membership of an item in the quotient filter
Figure 6: Check the membership of an item in the quotient filter

In Figure 6, the metadata bit is_occupied is set to 1, which shows that item two might be a member of the quotient filter. The uncertainty about the membership of an item is due to the possibility of hard collisions on the hash function output. In case, the is_occupied bit is unset but the other metadata bits are set on the corresponding bucket, perform a linear probing operation to test the membership of an item.




Quotient filter false positive

The probability of a false positive can be reduced by increasing the storage size of the quotient filter. The probability of false positives increases when the load factor (the number of occupied buckets) increases.




How to resize the quotient filter?

Figure 7: Resizing the quotient filter
Figure 7: Resizing the quotient filter

The most significant bit from the remainder is moved to the least significant bit position of the quotient to double the size of the quotient filter.




What is the asymptotic complexity of the quotient filter?

The performance of the quotient filter depends on the hash function used. The faster the computation of the hash function, the quicker the overall time of each operation on the quotient filter.


Time complexity

The quotient filter offers a constant time complexity of O(1) on average for most operations. However, in the worst case, all operations might take logarithmic time complexity O(log(n)), where n is the count of buckets in the quotient filter.


Space complexity

The quotient filter does not store the data items but a part of the fingerprint yielding a constant space complexity of O(1).




Advantages of quotient filter

The benefits of the quotient filter are the following:

  • items can be deleted
  • quotient filter can be resized without rehashing data
  • disk-based data structure due to sequential disk access pattern
  • constant time complexity
  • constant space complexity
  • no false negatives



Quotient filter disadvantages

The limitations of the bloom filter are the following:

  • quotient filters consume around 20% more space than bloom filters due to metadata bits
  • false positives rate can’t be reduced to zero
  • quotient filter is relatively complex to implement



Quotient filter use cases

Figure 8: Quotient filter used to reduce disk lookups in a database
Figure 8: Quotient filter used to reduce disk lookups in a database

The quotient filters are useful in high-ingestion systems to prevent expensive disk lookups. In Figure 8, for instance, the server performing a lookup of an item in a large table on the disk can degrade the throughput of the service. The quotient filter can be used to serve the lookups and reduce the unnecessary disk I/O on the database server except when the quotient filter returns a false positive. The applications of the quotient filter are the following:

  • reducing disk lookups for the non-existing keys in a database
  • Log-structured merge tree (LSM) storage engine used in search engines such as ElasticSearch uses the quotient filter to check if the key exists in the SSTable



Quotient filter implementation

The C language implementation of a quotient filter for academic purposes can be found on GitHub.com/vedantk. In practice, a hash function with 64-bit (p) output can be used to generate a quotient filter, which can hold around 1 billion items. The quotient takes the majority of the hash output bits on partitioning.




Variants of quotient filter

There are more data structures that offer similar functionality as the quotient filter. The popular variants of the quotient filter are the following:




Summary

The quotient filters are a useful probabilistic data structure in high-ingestion systems. The quotient filter offers constant time complexity and constant space complexity. The tradeoff is the quotient filter yielding a probabilistic result and consuming relatively more space than the bloom filter.




What to learn next?

Get the powerful template to approach system design for FREE on newsletter sign-up:




License

CC BY-NC-ND 4.0: This license allows reusers to copy and distribute the content in this article in any medium or format in unadapted form only, for noncommercial purposes, and only so long as attribution is given to the creator. The original article must be backlinked.




References