Set intersection oracle

From HandWiki

A set intersection oracle (SIO) is a data structure which represents a collection of sets and can quickly answer queries about whether the set intersection of two given sets is non-empty.

The input to the problem is n finite sets. The sum of the sizes of all sets is N (which also means that there are at most N distinct elements). The SIO should quickly answer any query of the form:

"Does the set Si intersect the set Sk"?

Minimum memory, maximum query time

Without any pre-processing, a query can be answered by inserting the elements of Si into a temporary hash table and then checking for each element of Sk whether it is in the hash table. The query time is [math]\displaystyle{ O(|S_i|+|S_j|)=O(N) }[/math].

Maximum memory, minimum query time

Alternatively, we can pre-process the sets and create an n-by-n table where the intersection information is already entered. Then the query time is [math]\displaystyle{ O(1) }[/math], but the memory required is [math]\displaystyle{ O(n^2) }[/math].

A compromise

Define a "large set" as a set with at least [math]\displaystyle{ \sqrt{N} }[/math] elements. Obviously there are at most [math]\displaystyle{ \sqrt{N} }[/math] such sets. Create a table of intersection data between every large set to every other large set. This requires [math]\displaystyle{ O(N) }[/math] memory. Additionally, for each large set, keep a hash table of all its elements. This requires additional [math]\displaystyle{ O(N^{3/2}) }[/math] memory.

Given two sets, there are three possible cases:

  1. Both sets are large. Then just read the answer to the intersection query from the table, in time [math]\displaystyle{ O(1) }[/math].
  2. Both sets are small. Then insert the elements of one of them into a hash table and check the elements of the other one; because the sets are small, the required time is [math]\displaystyle{ O(\sqrt{N}) }[/math].
  3. One set is large and one set is small. Loop over all elements in the small set and check them against the hash table of the large set. The required time is again [math]\displaystyle{ O(\sqrt{N}) }[/math].

In general, if we define a "large set" as a set with at least [math]\displaystyle{ N^c }[/math] elements, then the number of large set is at most [math]\displaystyle{ N^{1-c} }[/math] so the memory required is [math]\displaystyle{ O(N^{2-c}) }[/math], and the query time is [math]\displaystyle{ O(N^c) }[/math].

Reduction to approximate distance oracle

The SIO problem can be reduced to the approximate distance oracle (DO) problem, in the following way.[1]

  • Build an undirected bipartite graph where one part contains a node for each of the n sets, and the other part contains a node for each of the (at most) N elements contained in the sets.
  • There is an edge between a set and an element, iff the set contains the element.

This graph has the following properties:

  • If two sets intersect, the distance between them is 2 (from one set, to an element in the intersection, to the other set).
  • If two sets do not intersect, the distance between them is at least 4.

So, with a DO whose approximation factor of less than 2, we can solve the SIO problem.

It is believed that the SIO problem does not have a non-trivial solution. I.e., it requires [math]\displaystyle{ \Omega(n^2) }[/math] space to answer queries in time [math]\displaystyle{ O(1) }[/math]. If this conjecture is true, this implies that there is no DO with an approximation factor of less than 2 and a constant query time.[1]

References

  1. 1.0 1.1 Patrascu, M.; Roditty, L. (2010). "Distance Oracles beyond the Thorup–Zwick Bound". 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS). pp. 815. doi:10.1109/FOCS.2010.83. ISBN 978-1-4244-8525-3.