Distance oracle

From HandWiki

In computing, a distance oracle (DO) is a data structure for calculating distances between vertices in a graph.

Introduction

Let G(V,E) be an undirected, weighted graph, with n = |V| nodes and m = |E| edges. We would like to answer queries of the form "what is the distance between the nodes s and t?".

One way to do this is just run the Dijkstra algorithm. This takes time [math]\displaystyle{ O(m + n \log n) }[/math], and requires no extra space (besides the graph itself).

In order to answer many queries more efficiently, we can spend some time in pre-processing the graph and creating an auxiliary data structure.

A simple data structure that achieves this goal is a matrix which specifies, for each pair of nodes, the distance between them. This structure allows us to answer queries in constant time [math]\displaystyle{ O(1) }[/math], but requires [math]\displaystyle{ O(n^2) }[/math] extra space. It can be initialized in time [math]\displaystyle{ O(n^3) }[/math] using an all-pairs shortest paths algorithm, such as the Floyd–Warshall algorithm.

A DO lies between these two extremes. It uses less than [math]\displaystyle{ O(n^2) }[/math] space in order to answer queries in less than [math]\displaystyle{ O(m + n \log n) }[/math] time. Most DOs have to compromise on accuracy, i.e. they don't return the accurate distance but rather a constant-factor approximation of it.

Approximate DO

Thorup and Zwick[1] describe more than 10 different DOs. They then suggest a new DO that, for every k, requires space [math]\displaystyle{ O(k n^{1+1/k}) }[/math], such that any subsequent distance query can be approximately answered in time [math]\displaystyle{ O(k) }[/math]. The approximate distance returned is of stretch at most [math]\displaystyle{ 2k-1 }[/math], that is, the quotient obtained by dividing the estimated distance by the actual distance lies between 1 and [math]\displaystyle{ 2k-1 }[/math]. The initialization time is [math]\displaystyle{ O(kmn^{1/k}) }[/math].

Some special cases include:

  • For [math]\displaystyle{ k=1 }[/math] we get the simple distance matrix.
  • For [math]\displaystyle{ k=2 }[/math] we get a structure using [math]\displaystyle{ O(n^{1.5}) }[/math] space which answers each query in constant time and approximation factor at most 3.
  • For [math]\displaystyle{ k=\lfloor \log n \rfloor }[/math], we get a structure using [math]\displaystyle{ O(n \log n) }[/math] space, query time [math]\displaystyle{ O(\log n) }[/math], and stretch [math]\displaystyle{ O(\log n) }[/math].

Higher values of k do not improve the space or preprocessing time.

DO for general metric spaces

The oracle is built of a decreasing collection of k+1 sets of vertices:

  • [math]\displaystyle{ A_0 = V }[/math]
  • For every [math]\displaystyle{ i=1,\ldots,k-1 }[/math]: [math]\displaystyle{ A_i }[/math] contains each element of [math]\displaystyle{ A_{i-1} }[/math], independently, with probability [math]\displaystyle{ n^{-1/k} }[/math]. Note that the expected size of [math]\displaystyle{ A_i }[/math] is [math]\displaystyle{ n^{1-i/k} }[/math]. The elements of [math]\displaystyle{ A_i }[/math] are called i-centers.
  • [math]\displaystyle{ A_k = \emptyset }[/math]

For every node v, calculate its distance from each of these sets:

  • For every [math]\displaystyle{ i=0,\ldots,k-1 }[/math]: [math]\displaystyle{ d(A_i,v) = \min{(d(w,v)|w\in A_i} }[/math] and [math]\displaystyle{ p_i(v) = \arg \min{(d(w,v)|w\in A_i} }[/math]. I.e., [math]\displaystyle{ p_i(v) }[/math] is the i-center nearest to v, and [math]\displaystyle{ d(A_i,v) }[/math] is the distance between them. Note that for a fixed v, this distance is weakly increasing with i. Also note that for every v, [math]\displaystyle{ d(A_0,v)=0 }[/math] and [math]\displaystyle{ p_0(v)=v }[/math].
  • [math]\displaystyle{ d(A_k,v)=\infty }[/math].

For every node v, calculate:

  • For every [math]\displaystyle{ i=0,\ldots,k-1 }[/math]: [math]\displaystyle{ B_i(v) = \{w\in A_i\setminus A_{i+1}|d(w,v)\lt d(A_{i+1},v)\} }[/math]. [math]\displaystyle{ B_i(v) }[/math] contains all vertices in [math]\displaystyle{ A_i }[/math] which are strictly closer to v than all vertices in [math]\displaystyle{ A_{i+1} }[/math]. The partial unions of [math]\displaystyle{ B_i }[/math]s are balls in increasing diameter, that contain vertices with distances up to the first vertex of the next level.

For every v, compute its bunch:

  • [math]\displaystyle{ B(v) = \bigcup_{i=0}^{k-1}B_i(v). }[/math]

It is possible to show that the expected size of [math]\displaystyle{ B(v) }[/math] is at most [math]\displaystyle{ kn^{1/k} }[/math].

For every bunch [math]\displaystyle{ B(v) }[/math], construct a hash table that holds, for every [math]\displaystyle{ w\in B(V) }[/math], the distance [math]\displaystyle{ d(w,v) }[/math].

The total size of the data structure is [math]\displaystyle{ O(kn+\Sigma|B(v)|)=O(kn+nkn^{1/k})=O(kn^{1+1/k}) }[/math]

Having this structure initialized, the following algorithm finds the distance between two nodes, u and v:

  • [math]\displaystyle{ w:=u, i:=0 }[/math]
  • while [math]\displaystyle{ w\notin B(v) }[/math]:
    • [math]\displaystyle{ i:=i+1 }[/math]
    • [math]\displaystyle{ (u,v):=(v,u) }[/math] (swap the two input nodes; this does not change the distance between them since the graph is undirected).
    • [math]\displaystyle{ w := p_i(u) }[/math]
  • return [math]\displaystyle{ d(w,u)+d(w,v) }[/math]

It is possible to show that, in each iteration, the distance [math]\displaystyle{ d(w,u) }[/math] grows by at most [math]\displaystyle{ d(v,u) }[/math]. Since [math]\displaystyle{ A_{k-1}\subseteq B(v) }[/math], there are at most k-1 iterations, so finally [math]\displaystyle{ d(w,u)\leq (k-1)d(v,u) }[/math]. Now by the triangle inequality, [math]\displaystyle{ d(w,v)\leq d(w,u)+d(u,v)\leq kd(v,u) }[/math], so the distance returned is at most [math]\displaystyle{ (2k-1)d(u,v) }[/math].

Improvements

The above result was later improved by Patrascu and Roditty[2] who suggest a DO of size [math]\displaystyle{ O(n^{4/3} m^{1/3}) }[/math] which returns a factor 2 approximation.

Reduction from set intersection oracle

If there is a DO with an approximation factor of at most 2, then it is possible to build a set intersection oracle (SIO) with query time [math]\displaystyle{ O(1) }[/math] and space requirements [math]\displaystyle{ O(N+n) }[/math], where n is the number of sets and N the sum of their sizes; see set intersection oracle#Reduction to approximate distance oracle.

It is believed that the SIO problem does not have a non-trivial solution. I.e., it requires [math]\displaystyle{ \Omega(n^2) }[/math] space to answer queries in time [math]\displaystyle{ O(1) }[/math], e.g. using an n-by-n table with the intersection between each two sets. If this conjecture is true, this implies that there is no DO with an approximation factor of less than 2 and a constant query time.[2]

References

  1. Thorup, M.; Zwick, U. (2005). "Approximate distance oracles". Journal of the ACM 52: 1–24. doi:10.1145/1044731.1044732. 
  2. 2.0 2.1 Patrascu, M.; Roditty, L. (2010). "Distance Oracles beyond the Thorup–Zwick Bound". 2010 IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS). pp. 815. doi:10.1109/FOCS.2010.83. ISBN 978-1-4244-8525-3.