# Kendall rank correlation coefficient

Short description: Statistic for rank correlation

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.[2]

Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of −1) rank between the two variables.

Both Kendall's $\displaystyle{ \tau }$ and Spearman's $\displaystyle{ \rho }$ can be formulated as special cases of a more general correlation coefficient. Its notions of concordance and discordance also appear in other areas of statistics, like the Rand index in cluster analysis.

## Definition

All points in the gray area are concordant and all points in the white area are discordant with respect to point $\displaystyle{ (X_1, Y_1) }$. With $\displaystyle{ n=30 }$ points, there are a total of $\displaystyle{ \binom{30}{2} = 435 }$ possible point pairs. In this example there are 395 concordant point pairs and 40 discordant point pairs, leading to a Kendall rank correlation coefficient of 0.816.

Let $\displaystyle{ (x_1,y_1), ..., (x_n,y_n) }$ be a set of observations of the joint random variables X and Y, such that all the values of ($\displaystyle{ x_i }$) and ($\displaystyle{ y_i }$) are unique (ties are neglected for simplicity). Any pair of observations $\displaystyle{ (x_i,y_i) }$ and $\displaystyle{ (x_j,y_j) }$, where $\displaystyle{ i \lt j }$, are said to be concordant if the sort order of $\displaystyle{ (x_i,x_j) }$ and $\displaystyle{ (y_i,y_j) }$ agrees: that is, if either both $\displaystyle{ x_i\gt x_j }$ and $\displaystyle{ y_i\gt y_j }$ holds or both $\displaystyle{ x_i\lt x_j }$ and $\displaystyle{ y_i\lt y_j }$; otherwise they are said to be discordant.

The Kendall τ coefficient is defined as:

$\displaystyle{ \tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{ (\text{number of pairs}) } = 1- \frac{2 (\text{number of discordant pairs})}{ {n \choose 2} } . }$[3]

where $\displaystyle{ {n \choose 2} = {n (n-1) \over 2} }$ is the binomial coefficient for the number of ways to choose two items from n items.

The number of discordant pairs is equal to the inversion number that permutes the y-sequence into the same order as the x-sequence.

### Properties

The denominator is the total number of pair combinations, so the coefficient must be in the range −1 ≤ τ ≤ 1.

• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
• If X and Y are independent and not constant, then the expectation of the coefficient is zero.
• An explicit expression for Kendall's rank coefficient is $\displaystyle{ \tau= \frac{2}{n(n-1)}\sum_{i\lt j} \sgn(x_i-x_j)\sgn(y_i-y_j) }$.

## Hypothesis test

The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of X or Y or the distribution of (X,Y).

Under the null hypothesis of independence of X and Y, the sampling distribution of τ has an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean zero and variance $\displaystyle{ 2(2n+5)/9n (n-1) }$.[4]

The following proof is from Valz & McLeod (1990;[5] 1995[6]).

Proof

Asymptotic normality — At the $\displaystyle{ n\to \infty }$ limit, $\displaystyle{ z_A = \frac{\tau_A}{\sqrt{Var[\tau_A]}} = {n_C - n_D \over \sqrt{n(n-1)(2n+5)/18} } }$ converges in distribution to the standard normal distribution.

## Case of standard normal distributions

If $\displaystyle{ (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) }$ are IID samples from the same jointly normal distribution with a known Pearson correlation coefficient $\displaystyle{ r }$, then the expectation of Kendall rank correlation has a closed-form formula.[8]

Greiner’s equality — If $\displaystyle{ X, Y }$ are jointly normal, with correlation $\displaystyle{ r }$, then $\displaystyle{ r = \sin{\left(\frac\pi 2 E[\tau_A]\right)} }$

The name is credited to Richard Greiner (1909)[9] by P. A. P. Moran.[10]

Proof

## Accounting for ties

A pair $\displaystyle{ \{ (x_{i},y_{i}),(x_{j},y_{j}) \} }$ is said to be tied if and only if $\displaystyle{ x_{i} = x_{j} }$ or $\displaystyle{ y_{i} = y_{j} }$; a tied pair is neither concordant nor discordant. When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [−1, 1]:

### Tau-a

The Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will not make any adjustment for ties. It is defined as:

$\displaystyle{ \tau_A = \frac{n_c-n_d}{n_0} }$

where nc, nd and n0 are defined as in the next section.

### Tau-b

The Tau-b statistic, unlike Tau-a, makes adjustments for ties.[12] Values of Tau-b range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

The Kendall Tau-b coefficient is defined as:

$\displaystyle{ \tau_B = \frac{n_c-n_d}{\sqrt{(n_0-n_1)(n_0-n_2)}} }$

where

\displaystyle{ \begin{align} n_0 & = n(n-1)/2\\ n_1 & = \sum_i t_i (t_i-1)/2 \\ n_2 & = \sum_j u_j (u_j-1)/2 \\ n_c & = \text{Number of concordant pairs} \\ n_d & = \text{Number of discordant pairs} \\ t_i & = \text{Number of tied values in the } i^\text{th} \text{ group of ties for the first quantity} \\ u_j & = \text{Number of tied values in the } j^\text{th} \text{ group of ties for the second quantity} \end{align} }

A simple algorithm developed in BASIC computes Tau-b coefficient using an alternative formula.[13]

Be aware that some statistical packages, e.g. SPSS, use alternative formulas for computational efficiency, with double the 'usual' number of concordant and discordant pairs.[14]

### Tau-c

Tau-c (also called Stuart-Kendall Tau-c)[15] is more suitable than Tau-b for the analysis of data based on non-square (i.e. rectangular) contingency tables.[15][16] So use Tau-b if the underlying scale of both variables has the same number of possible values (before ranking) and Tau-c if they differ. For instance, one variable might be scored on a 5-point scale (very good, good, average, bad, very bad), whereas the other might be based on a finer 10-point scale.

The Kendall Tau-c coefficient is defined as:[16]

$\displaystyle{ \tau_C = \frac{2 (n_c-n_d)}{n^2 \frac{(m-1)}{m}} = \tau_A \frac{n-1}{n} \frac{m}{m-1} }$

where

\displaystyle{ \begin{align} n_c & = \text{Number of concordant pairs} \\ n_d & = \text{Number of discordant pairs} \\ r & = \text{Number of rows} \\ c & = \text{Number of columns} \\ m & = \min(r, c) \end{align} }

## Significance tests

When two quantities are statistically dependent, the distribution of $\displaystyle{ \tau }$ is not easily characterizable in terms of known distributions. However, for $\displaystyle{ \tau_A }$ the following statistic, $\displaystyle{ z_A }$, is approximately distributed as a standard normal when the variables are statistically independent:

$\displaystyle{ z_A = {n_c - n_d \over \sqrt{\frac{1}{18}v_0} } }$

where $\displaystyle{ v_0 = n(n-1)(2n+5) }$.

Thus, to test whether two variables are statistically dependent, one computes $\displaystyle{ z_A }$, and finds the cumulative probability for a standard normal distribution at $\displaystyle{ -|z_A| }$. For a 2-tailed test, multiply that number by two to obtain the p-value. If the p-value is below a given significance level, one rejects the null hypothesis (at that significance level) that the quantities are statistically independent.

Numerous adjustments should be added to $\displaystyle{ z_A }$ when accounting for ties. The following statistic, $\displaystyle{ z_B }$, has the same distribution as the $\displaystyle{ \tau_B }$ distribution, and is again approximately equal to a standard normal distribution when the quantities are statistically independent:

$\displaystyle{ z_B = {n_c - n_d \over \sqrt{ v } } }$

where

$\displaystyle{ \begin{array}{ccl} v & = & \frac{1}{18} v_0 - (v_t + v_u)/18 + (v_1 + v_2) \\ v_0 & = & n (n-1) (2n+5) \\ v_t & = & \sum_i t_i (t_i-1) (2 t_i+5)\\ v_u & = & \sum_j u_j (u_j-1)(2 u_j+5) \\ v_1 & = & \sum_i t_i (t_i-1) \sum_j u_j (u_j-1) / (2n(n-1)) \\ v_2 & = & \sum_i t_i (t_i-1) (t_i-2) \sum_j u_j (u_j-1) (u_j-2) / (9 n (n-1) (n-2)) \end{array} }$

This is sometimes referred to as the Mann-Kendall test.[6]

## Algorithms

The direct computation of the numerator $\displaystyle{ n_c - n_d }$, involves two nested iterations, as characterized by the following pseudocode:

numer := 0
for i := 2..N do
for j := 1..(i − 1) do
numer := numer + sign(x[i] − x[j]) × sign(y[i] − y[j])
return numer

Although quick to implement, this algorithm is $\displaystyle{ O(n^2) }$ in complexity and becomes very slow on large samples. A more sophisticated algorithm[17] built upon the Merge Sort algorithm can be used to compute the numerator in $\displaystyle{ O(n \cdot \log{n}) }$ time.

Begin by ordering your data points sorting by the first quantity, $\displaystyle{ x }$, and secondarily (among ties in $\displaystyle{ x }$) by the second quantity, $\displaystyle{ y }$. With this initial ordering, $\displaystyle{ y }$ is not sorted, and the core of the algorithm consists of computing how many steps a Bubble Sort would take to sort this initial $\displaystyle{ y }$. An enhanced Merge Sort algorithm, with $\displaystyle{ O(n \log n) }$ complexity, can be applied to compute the number of swaps, $\displaystyle{ S(y) }$, that would be required by a Bubble Sort to sort $\displaystyle{ y_i }$. Then the numerator for $\displaystyle{ \tau }$ is computed as:

$\displaystyle{ n_c-n_d = n_0 - n_1 - n_2 + n_3 - 2 S(y), }$

where $\displaystyle{ n_3 }$ is computed like $\displaystyle{ n_1 }$ and $\displaystyle{ n_2 }$, but with respect to the joint ties in $\displaystyle{ x }$ and $\displaystyle{ y }$.

A Merge Sort partitions the data to be sorted, $\displaystyle{ y }$ into two roughly equal halves, $\displaystyle{ y_\mathrm{left} }$ and $\displaystyle{ y_\mathrm{right} }$, then sorts each half recursive, and then merges the two sorted halves into a fully sorted vector. The number of Bubble Sort swaps is equal to:

$\displaystyle{ S(y) = S(y_\mathrm{left}) + S(y_\mathrm{right}) + M(Y_\mathrm{left},Y_\mathrm{right}) }$

where $\displaystyle{ Y_\mathrm{left} }$ and $\displaystyle{ Y_\mathrm{right} }$ are the sorted versions of $\displaystyle{ y_\mathrm{left} }$ and $\displaystyle{ y_\mathrm{right} }$, and $\displaystyle{ M(\cdot,\cdot) }$ characterizes the Bubble Sort swap-equivalent for a merge operation. $\displaystyle{ M(\cdot,\cdot) }$ is computed as depicted in the following pseudo-code:

function M(L[1..n], R[1..m]) is
i := 1
j := 1
nSwaps := 0
while i ≤ n and j ≤ m do
if R[j] < L[i] then
nSwaps := nSwaps + n − i + 1
j := j + 1
else
i := i + 1
return nSwaps

A side effect of the above steps is that you end up with both a sorted version of $\displaystyle{ x }$ and a sorted version of $\displaystyle{ y }$. With these, the factors $\displaystyle{ t_i }$ and $\displaystyle{ u_j }$ used to compute $\displaystyle{ \tau_B }$ are easily obtained in a single linear-time pass through the sorted arrays.

## Software Implementations

• R implements the test for $\displaystyle{ \tau_B }$ cor.test(x, y, method = "kendall") in its "stats" package (also cor(x, y, method = "kendall") will work, but the latter does not return the p-value). All three versions of the coefficient are available in the "DescTools" package along with the confidence intervals: KendallTauA(x,y,conf.level=0.95) for $\displaystyle{ \tau_A }$, KendallTauB(x,y,conf.level=0.95) for $\displaystyle{ \tau_B }$, StuartTauC(x,y,conf.level=0.95) for $\displaystyle{ \tau_C }$.
• For Python, the SciPy library implements the computation of $\displaystyle{ \tau_B }$ in scipy.stats.kendalltau

## References

1. Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika 30 (1–2): 81–89. doi:10.1093/biomet/30.1-2.81.
2. "Ordinal Measures of Association". Journal of the American Statistical Association 53 (284): 814–861. 1958. doi:10.2307/2281954.
3. Hazewinkel, Michiel, ed. (2001), "Kendall tau metric", Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 978-1-55608-010-4
4. Hazewinkel, Michiel, ed. (2001), "Kendall coefficient of rank correlation", Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 978-1-55608-010-4
5. Valz, Paul D.; McLeod, A. Ian (February 1990). "A Simplified Derivation of the Variance of Kendall's Rank Correlation Coefficient" (in en). The American Statistician 44 (1): 39–40. doi:10.1080/00031305.1990.10475691. ISSN 0003-1305.
6. Valz, Paul D.; McLeod, A. Ian; Thompson, Mary E. (February 1995). "Cumulant Generating Function and Tail Probability Approximations for Kendall's Score with Tied Rankings". The Annals of Statistics 23 (1): 144–160. doi:10.1214/aos/1176324460. ISSN 0090-5364.
7. Hoeffding, Wassily (1992), Kotz, Samuel; Johnson, Norman L., eds., "A Class of Statistics with Asymptotically Normal Distribution" (in en), Breakthroughs in Statistics: Foundations and Basic Theory, Springer Series in Statistics (New York, NY: Springer): pp. 308–334, doi:10.1007/978-1-4612-0919-5_20, ISBN 978-1-4612-0919-5, retrieved 2024-01-19
8. Kendall, M. G. (1949). "Rank and Product-Moment Correlation". Biometrika 36 (1/2): 177–193. doi:10.2307/2332540. ISSN 0006-3444.
9. Richard Greiner, (1909), Ueber das Fehlersystem der Kollektiv-maßlehre, Zeitschrift für Mathematik und Physik, Band 57, B. G. Teubner, Leipzig, pages 121-158, 225-260, 337-373.
10. Moran, P. A. P. (1948). "Rank Correlation and Product-Moment Correlation". Biometrika 35 (1/2): 203–206. doi:10.2307/2332641. ISSN 0006-3444.
11. Berger, Daniel (2016). "A Proof of Greiner's Equality" (in en). SSRN Electronic Journal. doi:10.2139/ssrn.2830471. ISSN 1556-5068.
12. Agresti, A. (2010). Analysis of Ordinal Categorical Data (Second ed.). New York: John Wiley & Sons. ISBN 978-0-470-08289-8.
13. Alfred Brophy (1986). "An algorithm and program for calculation of Kendall's rank correlation coefficient". Behavior Research Methods, Instruments, & Computers 18: 45–46. doi:10.3758/BF03200993.
14. IBM (2016). IBM SPSS Statistics 24 Algorithms. IBM. p. 168. Retrieved 31 August 2017.
15. Berry, K. J.; Johnston, J. E.; Zahran, S.; Mielke, P. W. (2009). "Stuart's tau measure of effect size for ordinal variables: Some methodological considerations". Behavior Research Methods 41 (4): 1144–1148. doi:10.3758/brm.41.4.1144. PMID 19897822.
16. Stuart, A. (1953). "The Estimation and Comparison of Strengths of Association in Contingency Tables". Biometrika 40 (1–2): 105–110. doi:10.2307/2333101.
17. Knight, W. (1966). "A Computer Method for Calculating Kendall's Tau with Ungrouped Data". Journal of the American Statistical Association 61 (314): 436–439. doi:10.2307/2282833.