Weighted median

From HandWiki
The top chart shows a list of elements with values indicated by height and the median element shown in red. The lower chart shows the same elements with weights as indicated by the width of the boxes. The weighted median is shown in red and is different than the ordinary median.

In statistics, a weighted median of a sample is the 50% weighted percentile.[1][2][3] It was first proposed by F. Y. Edgeworth in 1888.[4][5] Like the median, it is useful as an estimator of central tendency, robust against outliers. It allows for non-uniform statistical weights related to, e.g., varying precision measurements in the sample.

Definition

General case

For [math]\displaystyle{ n }[/math] distinct ordered elements [math]\displaystyle{ x_1, x_2,...,x_n }[/math] with positive weights [math]\displaystyle{ w_1, w_2,...,w_n }[/math] such that [math]\displaystyle{ \sum_{i=1}^n w_i = 1 }[/math], the weighted median is the element [math]\displaystyle{ x_k }[/math] satisfying

[math]\displaystyle{ \sum_{i = 1}^{k - 1} w_i \le 1/2 }[/math] and [math]\displaystyle{ \sum_{i = k + 1}^{n} w_i \le 1/2 }[/math]

Special case

Consider a set of elements in which two of the elements satisfy the general case. This occurs when both element's respective weights border the midpoint of the set of weights without encapsulating it; Rather, each element defines a partition equal to [math]\displaystyle{ 1/2 }[/math]. These elements are referred to as the lower weighted median and upper weighted median. Their conditions are satisfied as follows:

Lower Weighted Median

[math]\displaystyle{ \sum_{i = 1}^{k - 1} w_i \lt 1/2 }[/math] and [math]\displaystyle{ \sum_{i = k + 1}^{n} w_i = 1/2 }[/math]

Upper Weighted Median

[math]\displaystyle{ \sum_{i = 1}^{k - 1} w_i = 1/2 }[/math] and [math]\displaystyle{ \sum_{i = k + 1}^{n} w_i \lt 1/2 }[/math]

Ideally, a new element would be created using the mean of the upper and lower weighted medians and assigned a weight of zero. This method is similar to finding the median of an even set. The new element would be a true median since the sum of the weights to either side of this partition point would be equal.
Depending on the application, it may not be possible or wise to create new data. In this case, the weighted median should be chosen based on which element keeps the partitions most equal. This will always be the weighted median with the lowest weight.
In the event that the upper and lower weighted medians are equal, the lower weighted median is generally accepted as originally proposed by Edgeworth.[6]

Properties

The sum of weights in each of the two partitions should be as equal as possible.

If the weights of all numbers in the set are equal, then the weighted median reduces down to the median.

Examples

For simplicity, consider the set of numbers [math]\displaystyle{ \{1, 2, 3, 4, 5\} }[/math] with each number having weights [math]\displaystyle{ \{0.15, 0.1, 0.2, 0.3, 0.25\} }[/math] respectively. The median is 3 and the weighted median is the element corresponding to the weight 0.3, which is 4. The weights on each side of the pivot add up to 0.45 and 0.25, satisfying the general condition that each side be as even as possible. Any other weight would result in a greater difference between each side of the pivot.

Consider the set of numbers [math]\displaystyle{ \{1, 2, 3, 4\} }[/math] with each number having uniform weights [math]\displaystyle{ \{0.25, 0.25, 0.25, 0.25\} }[/math] respectively. Equal weights should result in a weighted median equal to the median. This median is 2.5 since it is an even set. The lower weighted median is 2 with partition sums of 0.25 and 0.5, and the upper weighted median is 3 with partition sums of 0.5 and 0.25. These partitions each satisfy their respective special condition and the general condition. It is ideal to introduce a new pivot by taking the mean of the upper and lower weighted medians when they exist. With this, the set of numbers is [math]\displaystyle{ \{1, 2, 2.5, 3, 4\} }[/math] with each number having weights [math]\displaystyle{ \{0.25, 0.25, 0, 0.25, 0.25\} }[/math] respectively. This creates partitions that both sum to 0.5. It can easily be seen that the weighted median and median are the same for any size set with equal weights.

Similarly, consider the set of numbers [math]\displaystyle{ \{1, 2, 3, 4\} }[/math] with each number having weights [math]\displaystyle{ \{0.49, 0.01, 0.25, 0.25\} }[/math] respectively. The lower weighted median is 2 with partition sums of 0.49 and 0.5, and the upper weighted median is 3 with partition sums of 0.5 and 0.25. In the case of working with integers or non-interval measures, the lower weighted median would be accepted since it is the lower weight of the pair and therefore keeps the partitions most equal. However, it is more ideal to take the mean of these weighted medians when it makes sense instead. Coincidentally, both the weighted median and median are equal to 2.5, but this will not always hold true for larger sets depending on the weight distribution.

Algorithm

The weighted median can be computed by sorting the set of numbers and finding the smallest set of numbers which sum to half the weight of the total weight. This algorithm takes [math]\displaystyle{ O(n \log n) }[/math] time. There is a better approach to find the weighted median using a modified selection algorithm.[1]

// Main call is WeightedMedian(a, 1, n)
// Returns lower median
WeightedMedian(a[1..n], p, r)
    // Base case for single element
    if r = p then
        return a[p]
    // Base case for two elements
    // Make sure we return the mean in the case that the two candidates have equal weight
    if r-p = 1 then
        if a[p].w == a[r].w
            return (a[p] + a[r])/2
        if a[p].w > a[r].w
            return a[p]
        else 
            return a[r]
    // Partition around pivot r
    q = partition(a, p, r)
    wl, wg = sum weights of partitions (p, q-1), (q+1, r)
    // If partitions are balanced then we are done
    if wl and wg both < 1/2 then
        return a[q]
    else
        // Increase pivot weight by the amount of partition we eliminate
        if wl > wg then
            a[q].w += wg
            // Recurse on pivot inclusively 
            WeightedMedian(a, p, q)
        else
            a[q].w += wl
            WeightedMedian(a, q, r)

Software/source code

  • A fast weighted median algorithm is implemented in a C extension for Python in the Robustats Python package.
  • R has many implementations, including matrixStats::weightedMedian(), spatstat::weighted.median(), and others.[7]

See also

References

  1. 1.0 1.1 Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). Introduction to Algorithms. ISBN 9780262032933. https://books.google.com/books?id=NLngYyWFl_YC&q=weighted+median&pg=PA194. 
  2. Horowitz, Ellis; Sahni, Sartaj; Rajasekaran, Sanguthevar (1996-12-15). Computer Algorithms C++: C++ and Pseudocode Versions. ISBN 9780716783152. https://books.google.com/books?id=q1FKJ6l0_zUC&q=weighted+median&pg=PA693. 
  3. Bovik, Alan C (2010-07-21). Handbook of Image and Video Processing. ISBN 9780080533612. https://books.google.com/books?id=UM_GCfJe88sC&q=weighted+median&pg=PA111. 
  4. Edgeworth, F. Y. (1888). "On a New Method of Reducing Observations Relating to Several Quantities". Philosophical Magazine 25 (154): 184–191. doi:10.1080/14786448808628170. https://zenodo.org/record/1431185. 
  5. Edgeworth, F. Y. (1887). "On Observations Relating to Several Quantities". Hermathena (Trinity College Dublin) 6 (13): 279–285. 
  6. Lange, Kenneth (15 June 2010). Numerical Analysis for Statisticians (second ed.). Springer. p. 313. ISBN 978-1-4419-5944-7. https://books.google.com/books?id=AtiDhx2bsiMC&q=edgeworth+weighted+median&pg=PA313. 
  7. Is there a weighted.median() function?