Approximate Membership Query Filter

From HandWiki

Approximate Membership Query Filter (AMQ-Filter) is a group of space-efficient probabilistic data structures that supports approximate membership queries. An approximate membership query answers if an element is in a set or not with a false positive rate of [math]\displaystyle{ \epsilon }[/math]. Bloom filters are the most known AMQ-Filter, but there are other AMQ-Filters that support additional operations or have different space requirements.

AMQ-Filters have numerous applications, mainly in distributed systems and databases. There, they are often used to avoid network request or I/O operations that result from requesting elements that do not exist.

Approximate membership query problem

The approximate membership query problem is to store information about a set of elements S in a space-efficient way. The goal is to answer queries whether an element x is in the set S or not while allowing false positives with a maximal probability of [math]\displaystyle{ \epsilon }[/math]. All AMQ-Filter support this operation lookup. Dynamic AMQ-Filters allow insertions at any time whereas static AMQ-Filters have to be rebuilt after inserting additional elements. Some AMQ-Filter support additional operations like deleting elements or merging two filters.

Lookup

A lookup can determine if an element is in definitely not in the set or if an element is probably in the set:

[math]\displaystyle{ s \in S }[/math]: always returns true.

[math]\displaystyle{ s \notin S }[/math]: return false with a probability of [math]\displaystyle{ 1-\epsilon }[/math].

A false positive is a lookup of an element that is not part of the set, but the lookup returns true anyway. The probability of this happening is the false positive rate [math]\displaystyle{ \epsilon }[/math]. False negatives (the lookup returns false although the element is part of the set) are not allowed for AMQ-Filters.

Insertion

After an element is inserted the lookup for this element must return true. Dynamic AMQ-Filters support inserting elements one at a time without rebuilding the data structure. Other AMQ-Filters have to be rebuilt after each insertion. Those are called static AMQ-Filters.

False positive rate vs. space

There is a tradeoff between storage size and the false positive rate [math]\displaystyle{ \epsilon }[/math]. Increasing the storage space reduces the false positive rate. The theoretical lower bound is [math]\displaystyle{ log_2 (1/\epsilon) }[/math] bits for each element.[1] Dynamic AMQ-Filters cannot reach this lower bound. They need at least [math]\displaystyle{ n \log_2 (1/\epsilon)(1+ o(1)) }[/math] bits for [math]\displaystyle{ n }[/math] insertions.[2] Different AMQ-Filter have different ranges of false positive rates and space requirements. Choosing the best AMQ-Filter depends on the application.

Data structures

There are different ways to solve the approximate membership query problem. The most known data structure are Bloom filters, but there are other data structures that perform better for some false positive rates and space requirements, support additional operations, or have other insertion and lookup times. Some well known AMQ-Filters are:

Bloom filter

A Bloom filter is a bit array of [math]\displaystyle{ m }[/math] bits with [math]\displaystyle{ k }[/math] hash functions. Each hash function maps an element to one of the [math]\displaystyle{ m }[/math] positions in the array. In the beginning, all bits of the array are set to zero. To insert an element, all hash functions are calculated and all corresponding bits in the array are set to one. To lookup an element, all [math]\displaystyle{ k }[/math] hash functions are calculated. If all corresponding bits are set, true is returned. To reduce the false positive rate, the number of hash functions and [math]\displaystyle{ m }[/math] can be increased.

Quotient filter

The idea of Quotient filters is to hash an element and to split its fingerprint into the [math]\displaystyle{ r }[/math] least significant bits called the remainder [math]\displaystyle{ d_R }[/math] and the most significant bits called the quotient [math]\displaystyle{ d_Q }[/math]. The quotient determines where in the hash table the remainder is stored. Additional three bits for every slot in the hash table are used to resolve soft collisions (same quotient but different remainders).

The space used by Quotient filters is comparable to Bloom filters, but Quotient filters can be merged without affecting their false positive rate.

Cuckoo filter

Cuckoo filters are based on Cuckoo hashing, but only fingerprints of the elements are stored in the hash table. Each element has two possible locations. The second location is calculated based on the first location and the fingerprint of the element. This is necessary to enable moving already inserted elements if both possible slots for an element are full.

After reaching a load threshold the insertion speed of Cuckoo filter degrades. It is possible that an insertion fails, and the table must be rehashed. Whereas Bloom filters have always constant insertion time, but as the load factor increases the false positive rate increases as well.

A Cuckoo filter supports deleting elements if we know for certain that the element was inserted. This is an advantage over Bloom filters and Quotient filters which do not support this operation.

Xor filter

Xor filters[3] are static AMQ-Filters that are based on a Bloomier filter and use the idea perfect hash tables. Similar to Cuckoo filters they save fingerprints of the elements in a hash table. The idea is that a query for an element [math]\displaystyle{ x }[/math] is true if the xor of three given hash functions [math]\displaystyle{ h_0, h_1, h_2 }[/math] is the fingerprint of [math]\displaystyle{ x }[/math]. While building the hash table, each element is assigned one of its three slots in a way that no other elements are assigned to this slot. After all elements are assigned, we set for each element the value of its slot to the xor of the two other (not assigned) slots of the element and the fingerprint of the element. This construction algorithm can fail and no dynamic insertions are possible without rebuilding the hash table. This hash table can be constructed using only [math]\displaystyle{ 1.23 \log_2 (1/\epsilon) }[/math] bits per element.

The disadvantage of this filter is that the data structure has to be rebuilt if additional elements are added. They are used in applications where no elements have to added afterwards and space is of importance.

Application

Typical applications of AMQ-Filters are distributed systems and database systems. The AMQ-Filter functions as a proxy to the set of keys of a database or remote memory. Before a presumable slow query to the database or the remote memory is performed the AMQ-Filter is used to approximate if the key is in the database or remote memory. The slow query is only performed if the AMQ-Filter returns true. Only a false positive result leads to an unnecessary I/O operation or remote access. The applications are numerous and include package and resource routing, P2P networks, and distributed caching:[4]

AMQ-Filters are often used as an in-memory data structure to avoid expensive disk accesses. One application is Log-structured merge-trees or LSM-Trees. They have a fast in-memory component and one or multiple components on a disk which are trees themself. Elements are inserted into the in-memory component until it reaches its maximal size, than the in-memory component is merged with the disk components. To speed up the lookup many LSM-Trees implement AMQ-Filters like Bloom filters or Quotient filters. Those filters approximate for each component what elements are stored in it. LSM-Trees are used in databases like Apache AsterixDB, Bigtable, HBase, LevelDB, SQLite4.

Networks offer a lot of applications for AMQ-Filters. They are used to approximate a set of data that is located on a different servers. In many cases those AMQ-Filters can be seen as immutable. Even if the set on the remote server changes the AMQ-Filter is often not updated right away, but some false positives are tolerated. One example of this application is Web cache sharing. If a proxy has a cache miss it wants to determine if another proxy has the requested data. Therefore, the proxy must know or at least approximate if another proxy holds the requested web page. This can be archived by periodically broadcasting a static AMQ-Filter of the URLs of the web pages a proxy has cached instead of broadcasting URL lists. In this setting, false negatives can occur if the cache changed in between periodic updates.

The same concept can be applied to P2P networks. AMQ-Filters can be used to approximate what is stored at each node of the network. The filter can be filled with ids or keywords of the actual documents of the nodes. False positives only lead to some unnecessary requests. AMQ-Filters have further applications in P2P networks for example finding the difference or intersection between sets stored on different nodes.

References

  1. Carter; Larry (1978). "Exact and approximate membership testers". Proceedings of the tenth annual ACM symposium on Theory of computing - STOC '78. pp. 59–65. doi:10.1145/800133.804332. https://www.researchgate.net/publication/221590962. 
  2. Lovett; Shachar (2010). "A Lower Bound for Dynamic Approximate Membership Data Structures". 2010 IEEE 51st Annual Symposium on Foundations of Computer Science. pp. 797–804. doi:10.1109/FOCS.2010.81. ISBN 978-1-4244-8525-3. https://ieeexplore.ieee.org/document/5671358. 
  3. Graf; Lemire (2020). "Xor Filters". ACM Journal of Experimental Algorithmics 25: 1–16. doi:10.1145/3376122. 
  4. Broder, Andrei; Mitzenmacher, Michael (2002). "Network Applications of Bloom Filters: A Survey". Internet Mathematics: 636–646.