Proof of O(log*n) time complexity of union–find

From HandWiki
Revision as of 19:34, 14 June 2021 by imported>Scavis (url)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


In computer science, Union Find is an algorithm for doing certain operations on sets. This page is about proof of O(log*n) amortized time [1] of Union Find[2][3][4]

Statement: If m operations, either Union or Find, are applied to n elements, the total run time is O(m log*n), where log* is the iterated logarithm.

Proof

Lemma 1: As the find function follows the path along to the root, the rank of node it encounters is increasing.

Proof: claim that as Find and Union operations are applied to the data set, this fact remains true over time. Initially when each node is the root of its own tree, it's trivially true. The only case when the rank of a node might be changed is when the Union by Rank operation is applied. In this case, a tree with smaller rank will be attached to a tree with greater rank, rather than vice versa. And during the find operation, all nodes visited along the path will be attached to the root, which has larger rank than its children, so this operation won't change this fact either.

Lemma 2: A node u which is root of a subtree with rank r has at least 2r nodes.

Proof: Initially when each node is the root of its own tree, it's trivially true. Assume that a node u with rank r has at least 2r nodes. Then when two trees with rank r Union by Rank and form a tree with rank r + 1, the new node has at least 2r + 2r = 2r + 1 nodes.
ProofOflogstarnRank.jpg

Lemma 3: The maximum number of nodes of rank r is at most n/2r.

Proof: From lemma 2, we know that a node u which is root of a subtree with rank r has at least 2r nodes. We will get the maximum number of nodes of rank r when each node with rank r is the root of a tree that has exactly 2r nodes. In this case, the number of nodes of rank r is n/2r

For convenience, we define "bucket" here: a bucket is a set that contains vertices with particular ranks.

We create some buckets and put vertices into the buckets according to their ranks inductively. That is, vertices with rank 0 go into the zeroth bucket, vertices with rank 1 go into the first bucket, vertices with ranks 2 and 3 go into the second bucket. If the Bth bucket contains vertices with ranks from interval [r, 2r − 1] = [r, R - 1] then the (B+1)st bucket will contain vertices with ranks from interval [R, 2R − 1].

Proof of [math]\displaystyle{ O(\log^*n) }[/math] Union Find

We can make two observations about the buckets.

  1. The total number of buckets is at most log*n
    Proof: When we go from one bucket to the next, we add one more two to the power, that is, the next bucket to [B, 2B − 1] will be [2B, 22B − 1]
  2. The maximum number of elements in bucket [B, 2B – 1] is at most 2n/2B
    Proof: The maximum number of elements in bucket [B, 2B – 1] is at most n/2B + n/2B+1 + n/2B+2 + … + n/22B – 1 ≤ 2n/2B

Let F represent the list of "find" operations performed, and let

[math]\displaystyle{ T_1 = \sum_F\text{(link to the root)} }[/math]
[math]\displaystyle{ T_2 = \sum_F\text{(number of links traversed where the buckets are different)} }[/math]
[math]\displaystyle{ T_3 = \sum_F\text{(number of links traversed where the buckets are the same).} }[/math]

Then the total cost of m finds is T = T1 + T2 + T3

Since each find operation makes exactly one traversal that leads to a root, we have T1 = O(m).

Also, from the bound above on the number of buckets, we have T2 = O(mlog*n).

For T3, suppose we are traversing an edge from u to v, where u and v have rank in the bucket [B, 2B − 1] and v is not the root (at the time of this traversing, otherwise the traversal would be accounted for in T1). Fix u and consider the sequence v1,v2,...,vk that take the role of v in different find operations. Because of path compression and not accounting for the edge to a root, this sequence contains only different nodes and because of Lemma 1 we know that the ranks of the nodes in this sequence are strictly increasing. By both of the nodes being in the bucket we can conclude that the length k of the sequence (the number of times node u is attached to a different root in the same bucket) is at most the number of ranks in the buckets B, i.e. at most 2B − 1 − B < 2B.

Therefore, [math]\displaystyle{ T_3 \le \sum_{[B, 2^B - 1]} \sum_{u} 2^B. }[/math]

From Observations 1 and 2, we can conclude that [math]\displaystyle{ T_3\le\sum_{B} 2^B\frac{2n}{2^B}\le 2n\log^*n. }[/math]

Therefore, T = T1 + T2 + T3 = O(m log*n).

References

  1. Raimund Seidel, Micha Sharir. "Top-down analysis of path compression", SIAM J. Comput. 34(3):515–525, 2005
  2. Tarjan, Robert Endre (1975). "Efficiency of a Good But Not Linear Set Union Algorithm". Journal of the ACM 22 (2): 215–225. doi:10.1145/321879.321884. http://portal.acm.org/citation.cfm?id=321884. 
  3. Hopcroft, J. E.; Ullman, J. D. (1973). "Set Merging Algorithms". SIAM Journal on Computing 2 (4): 294–303. doi:10.1137/0202024. 
  4. Robert E. Tarjan and Jan van Leeuwen. Worst-case analysis of set union algorithms. Journal of the ACM, 31(2):245–281, 1984.