K-sorted sequence

From HandWiki

In computer science, a nearly-sorted sequence, also known as roughly-sorted sequence and as [math]\displaystyle{ k }[/math]-sorted sequence is a sequence which is almost ordered. By almost ordered, it is meant that no element of the sequence is very far away from where it would be if the sequence were perfectly ordered. It is still possible that no element of the sequence is at the place where it should be if the sequence were perfectly ordered.

The nearly-sorted sequences are particularly useful when the exact order of element has little importance. For example Twitter[1] nearly sort tweets, up to the second, as there is no need for more precision. Actually, given the impossibility to exactly synchronize all computers, an exact sorting of all tweets according to the time at which they are posted is impossible. This idea led to the creation of Snowflake IDs.

[math]\displaystyle{ k }[/math]-sorting is the operation of reordering the elements of a sequence so that it becomes [math]\displaystyle{ k }[/math]-sorted. [math]\displaystyle{ k }[/math]-sorting is generally more efficient than sorting. Similarly, sorting a sequence is easier if it is known that the sequence is [math]\displaystyle{ k }[/math]-sorted. So if a program needs only to consider [math]\displaystyle{ k }[/math]-sorted sequences as input or output, considering [math]\displaystyle{ k }[/math]-sorted sequences may save time.

The radius of a sequence is a measure of presortedness, that is, its value indicate how much the elements in the list has to be moved to get a totally sorted value. In the above example of tweets which are sorted up to the second, the radius is bounded by the number of tweets in a second.

Definition

Given a positive number [math]\displaystyle{ k }[/math], a sequence [math]\displaystyle{ [a_1,\dots, a_n] }[/math] is said to be [math]\displaystyle{ k }[/math]-sorted if for each [math]\displaystyle{ 1\le i }[/math] and for each [math]\displaystyle{ i+k\le j\le n }[/math], [math]\displaystyle{ a_i\le a_j }[/math]. That is, the sequence has to be ordered only for pairs of elements whose distance is at least [math]\displaystyle{ k }[/math].

The radius of the sequence [math]\displaystyle{ \alpha }[/math], denoted [math]\displaystyle{ \text{ROUGH}(\alpha) }[/math][2] or [math]\displaystyle{ \text{Par}(\alpha) }[/math][3] is the smallest [math]\displaystyle{ k }[/math] such that the sequence is [math]\displaystyle{ k }[/math]-sorted. The radius is a measure of presortedness.

A sequence is said to be nearly-sorted or roughly-sorted if its radius is small compared to its length.

Equivalent definition

A sequence [math]\displaystyle{ [a_1,\dots, a_n] }[/math] is [math]\displaystyle{ k }[/math]-sorted if and only if each range of length [math]\displaystyle{ 2k+2 }[/math], [math]\displaystyle{ [a_i, a_{i+1},\dots, a_{i+2k+2}] }[/math] is [math]\displaystyle{ k }[/math]-sorted.

Properties

All sequences of length [math]\displaystyle{ n }[/math] are [math]\displaystyle{ (n-1) }[/math]-sorted, that is, [math]\displaystyle{ 0\le \text{Par}([a_1,\dots,a_n])\lt n }[/math]. A sequence is [math]\displaystyle{ 0 }[/math]-sorted if and only if it is sorted. A [math]\displaystyle{ k }[/math]-sorted sequence is automatically [math]\displaystyle{ (k+1) }[/math]-sorted but not necessarily [math]\displaystyle{ (k-1) }[/math]-sorted.

Relation with sorted sequences

Given a sequence a [math]\displaystyle{ k }[/math]-sorted sequence [math]\displaystyle{ [a_1,\dots, a_n] }[/math] and its sorted permutation [math]\displaystyle{ [a_{\sigma_1},\dots, a_{\sigma_n}] }[/math], [math]\displaystyle{ |i-\sigma_i| }[/math] is at most [math]\displaystyle{ k }[/math].

Algorithms

Deciding whether a sequence is [math]\displaystyle{ k }[/math]-sorted

Deciding whether a sequence [math]\displaystyle{ [a_1,\dots, a_n] }[/math] is [math]\displaystyle{ k }[/math]-sorted can be done in linear time and constant space by a streaming algorithm. It suffices, for each [math]\displaystyle{ 1\le i \lt n-k }[/math], to keep track of [math]\displaystyle{ \max(a_j\mid j\le i) }[/math] and to check that [math]\displaystyle{ a_{i+k}\ge\max(a_j\mid j\le i) }[/math].

Computing the radius of a sequence

Computing the radius of a sequence can be computed in linear time and space. This follows from the fact that it can be defined as [math]\displaystyle{ \max(i-j\mid \min(a_k\mid k\ge i) \lt \max(a_k\mid k\le j)) }[/math].

Halving the radius of a sequence

Given a [math]\displaystyle{ 2k }[/math]-sorted sequence [math]\displaystyle{ \alpha=[a_1,\dots, a_n] }[/math], it is possible to compute a [math]\displaystyle{ (k-1) }[/math]-sorted permutation [math]\displaystyle{ \alpha' }[/math] of [math]\displaystyle{ \alpha }[/math] in linear time and constant space.

First, given a sequence [math]\displaystyle{ \beta=[b_1,\dots,b_{2k}] }[/math], lets say that this sequence is partitioned if the [math]\displaystyle{ k }[/math]-smaller elements are in [math]\displaystyle{ [b_1,\dots,b_k] }[/math] and the [math]\displaystyle{ k }[/math]-greater elements are in [math]\displaystyle{ [b_{k+1},\dots, b_{2k}] }[/math]. Lets call partitioning the action of reordering the sequence [math]\displaystyle{ \beta }[/math] into a partitioned permutation. This can be done in linear time by first finding the median of [math]\displaystyle{ b }[/math] and then moving elements to the first or second half depending on whether they are smaller or greater than the median.

The sequence [math]\displaystyle{ \alpha' }[/math] can be obtained by partitioning the blocks of elements at position [math]\displaystyle{ [2k(i-1)+1, \dots, 2ki] }[/math], then by partitioning the blocks of elements at position [math]\displaystyle{ [2k(i-1)+k+1, \dots, 2ki+k] }[/math], and then again the elements at position [math]\displaystyle{ [2k(i-1)+1, \dots, 2ki] }[/math] for each number [math]\displaystyle{ i }[/math] such that those sequences are defined.

Using [math]\displaystyle{ \frac{n}{2k} }[/math] processors, with no shared read nor write access to memory, the same algorithm can be applied in [math]\displaystyle{ O(k) }[/math] time, since each partition of a sequence can occur in parallel.

Merging two [math]\displaystyle{ k }[/math]-sorted sequences

Merging two [math]\displaystyle{ k }[/math]-sorted sequences [math]\displaystyle{ \alpha^1=[a^1_1,\dots, a^1_n] }[/math] and [math]\displaystyle{ \alpha=[a^2_1,\dots, a^2_m] }[/math] can be done in linear time and constant space.

First, using the preceding algorithm, both sequences should be transformed into [math]\displaystyle{ k/2 }[/math]-sorted sequences.

Let's construct iteratively an output sequence [math]\displaystyle{ \omega }[/math] by removing content from both [math]\displaystyle{ \alpha^i }[/math] and adding it in [math]\displaystyle{ \omega }[/math].

If both [math]\displaystyle{ \alpha^i }[/math]'s are empty, then it suffices to return [math]\displaystyle{ \omega }[/math]. Otherwise, let us assume that [math]\displaystyle{ \alpha^1 }[/math] is empty and not [math]\displaystyle{ \alpha^2 }[/math], it suffices to remove the content of [math]\displaystyle{ \alpha^2 }[/math] and append it to [math]\displaystyle{ \omega }[/math]. The case where [math]\displaystyle{ \alpha^2 }[/math] is empty and not [math]\displaystyle{ \alpha^1 }[/math] is similar by symmetry.

Let us consider the case where neither [math]\displaystyle{ \alpha^i }[/math] is empty. Let us call [math]\displaystyle{ A^1=\max(a^1_i\mid 1\le i\le k/2) }[/math] and [math]\displaystyle{ A^2=\max(a^2_i\mid 1\le i\le k/2) }[/math], they are the greatest of the [math]\displaystyle{ k/2 }[/math]-firsts elements of each list. Let us assume that [math]\displaystyle{ A^1\lt A^2 }[/math], the other case is similar by symmetry. Remove [math]\displaystyle{ a^1_1,\dots, a^1_{k/2} }[/math] from [math]\displaystyle{ \alpha^1 }[/math] and remove [math]\displaystyle{ \{a^2_j\mid 1\le j\le k/2, a^2_j\le A^1\} }[/math] from [math]\displaystyle{ \alpha^2 }[/math] and add them to [math]\displaystyle{ \omega }[/math].

Sorting a [math]\displaystyle{ k }[/math]-sorted sequence

A [math]\displaystyle{ k }[/math]-sorted sequence can be sorted by applying the halving algorithm given above [math]\displaystyle{ \log_2(k) }[/math] times. This takes [math]\displaystyle{ O(n\log k) }[/math] time on a sequential machine, or [math]\displaystyle{ O(k\log k) }[/math] time using [math]\displaystyle{ O(n) }[/math] processors.

[math]\displaystyle{ k }[/math]-sorting a sequence

Since each sequence [math]\displaystyle{ \alpha=[a_1,\dots, a_n] }[/math] is necessarily [math]\displaystyle{ n }[/math]-sorted, it suffices to apply the halving algorithm [math]\displaystyle{ \log_{2}\left(\frac{n}{k}\right) }[/math]-times. Thus, [math]\displaystyle{ k }[/math]-sorting can be done in [math]\displaystyle{ O(n\log (n/k)) }[/math]-time. This algorithm is Par-optimal, that is, there exists no sequential algorithm with a better worst-case complexity.

References

  1. @rk. "Announcing Snowflake". https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html. 
  2. Altman, Tom; Igarashi, Yoshihide (1989-08-25). "Roughly Sorting: Sequential and Parallel Approach". Journal of Information Processing 12 (2): 154–158. 
  3. Estivill-Castro, Vladimir; Wood, Derick (October 1989). "A new measure of presortdness". Information and Computation 83 (1): 111–119. doi:10.1016/0890-5401(89)90050-3.