# Permutation test

__: Exact statistical hypothesis test__

**Short description**A **permutation test** (also called re-randomization test) is an exact statistical hypothesis test making use of the proof by contradiction in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.

Permutation tests can be understood as surrogate data testing where the surrogate data under the null hypothesis are obtained through permutations of the original data.^{[1]}

In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Ronald Fisher and E. J. G. Pitman in the 1930s.

Permutation tests should not be confused with randomized tests.^{[2]}

## Method

To illustrate the basic idea of a permutation test, suppose we collect random variables [math]\displaystyle{ X_A }[/math] and [math]\displaystyle{ X_B }[/math] for each individual from two groups [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] whose sample means are [math]\displaystyle{ \bar{x}_{A} }[/math] and [math]\displaystyle{ \bar{x}_{B} }[/math], and that we want to know whether [math]\displaystyle{ X_A }[/math] and [math]\displaystyle{ X_B }[/math] come from the same distribution. Let [math]\displaystyle{ n_{A} }[/math] and [math]\displaystyle{ n_{B} }[/math] be the sample size collected from each group. The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject, at some significance level, the null hypothesis H[math]\displaystyle{ _{0} }[/math] that the data drawn from [math]\displaystyle{ A }[/math] is from the same distribution as the data drawn from [math]\displaystyle{ B }[/math].

The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the observed value of the test statistic, [math]\displaystyle{ T_\text{obs} }[/math].

Next, the observations of groups [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] are pooled, and the difference in sample means is calculated and recorded for every possible way of dividing the pooled values into two groups of size [math]\displaystyle{ n_{A} }[/math] and [math]\displaystyle{ n_{B} }[/math] (i.e., for every permutation of the group labels A and B). The set of these calculated differences is the exact distribution of possible differences (for this sample) under the null hypothesis that group labels are exchangeable (i.e., are randomly assigned).

The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than [math]\displaystyle{ T_\text{obs} }[/math]. The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than [math]\displaystyle{ |T_\text{obs}| }[/math].

Alternatively, if the only purpose of the test is to reject or not reject the null hypothesis, one could sort the recorded differences, and then observe if [math]\displaystyle{ T_\text{obs} }[/math] is contained within the middle [math]\displaystyle{ (1 - \alpha) \times 100 }[/math]% of them, for some significance level [math]\displaystyle{ \alpha }[/math]. If it is not, we reject the hypothesis of identical probability curves at the [math]\displaystyle{ \alpha\times100\% }[/math] significance level.

## Relation to parametric tests

Permutation tests are a subset of non-parametric statistics. Assuming that our experimental data come from data measured from two treatment groups, the method simply generates the distribution of mean differences under the assumption that the two groups are not distinct in terms of the measured variable. From this, one then uses the observed statistic ([math]\displaystyle{ T_\text{obs} }[/math] above) to see to what extent this statistic is special, i.e., the likelihood of observing the magnitude of such a value (or larger) if the treatment labels had simply been randomized after treatment.

In contrast to permutation tests, the distributions underlying many popular "classical" statistical tests, such as the *t*-test, *F*-test, *z*-test, and *χ*^{2} test, are obtained from theoretical probability distributions. Fisher's exact test is an example of a commonly used permutation test for evaluating the association between two dichotomous variables. When sample sizes are very large, the Pearson's chi-square test will give accurate results. For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher's exact test becomes more appropriate.

Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square). All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner to construct a permutation *t*-test, a permutation *χ*^{2} test of association, a permutation version of Aly's test for comparing variances and so on.

The major drawbacks to permutation tests are that they

- Can be computationally intensive and may require "custom" code for difficult-to-calculate statistics. This must be rewritten for every case.
- Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even more computation.

## Advantages

Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses.

Permutation tests can be used for analyzing unbalanced designs^{[3]} and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001) . They can also be used to analyze qualitative data that has been quantitized (i.e., turned into numbers). Permutation tests may be ideal for analyzing quantitized data that do not satisfy statistical assumptions underlying traditional parametric tests (e.g., t-tests, ANOVA),^{[4]} see PERMANOVA.

Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes.

Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations made the application of permutation test methods practical for a wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based "exact" confidence intervals.

## Limitations

An important assumption behind a permutation test is that the observations are exchangeable under the null hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation t-test) require equal variance under the normality assumption. In this respect, the permutation t-test shares the same weakness as the classical Student's t-test (the Behrens–Fisher problem). A third alternative in this situation is to use a bootstrap-based test. Good (2005) explains the difference between permutation tests and bootstrap tests the following way: "Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap entails less-stringent assumptions." Bootstrap tests are not exact. In some cases, a permutation test based on a properly studentized statistic can be asymptotically exact even when the exchangeability assumption is violated.^{[5]}

## Monte Carlo testing

An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data to allow complete enumeration in a convenient manner. This is done by generating the reference distribution by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible replicates.
The realization that this could be applied to any permutation test on any dataset was an important breakthrough in the area of applied statistics. The earliest known references to this approach are Eden and Yates (1933) and Dwass (1957).^{[6]}^{[7]}
This type of permutation test is known under various names: *approximate permutation test*, *Monte Carlo permutation tests* or *random permutation tests*.^{[8]}

After [math]\displaystyle{ N }[/math] random permutations, it is possible to obtain a confidence interval for the p-value based on the Binomial distribution, see Binomial proportion confidence interval. For example, if after [math]\displaystyle{ N = 10000 }[/math] random permutations the p-value is estimated to be [math]\displaystyle{ \widehat{p}=0.05 }[/math], then a 99% confidence interval for the true [math]\displaystyle{ p }[/math] (the one that would result from trying all possible permutations) is [math]\displaystyle{ \left[\hat{p}-z\sqrt{\frac{0.05(1-0.05)}{10000}}, \hat{p}+z\sqrt{\frac{0.05(1-0.05)}{10000}} \right]=[0.045, 0.055] }[/math].

On the other hand, the purpose of estimating the p-value is most often to decide whether [math]\displaystyle{ p \leq \alpha }[/math], where [math]\displaystyle{ \scriptstyle\ \alpha }[/math] is the threshold at which the null hypothesis will be rejected (typically [math]\displaystyle{ \alpha=0.05 }[/math]). In the example above, the confidence interval only tells us that there is roughly a 50% chance that the p-value is smaller than 0.05, i.e. it is completely unclear whether the null hypothesis should be rejected at a level [math]\displaystyle{ \alpha=0.05 }[/math].

If it is only important to know whether [math]\displaystyle{ p \leq \alpha }[/math] for a given [math]\displaystyle{ \alpha }[/math], it is logical to continue simulating until the statement [math]\displaystyle{ p \leq \alpha }[/math] can be established to be true or false with a very low probability of error. Given a bound [math]\displaystyle{ \epsilon }[/math] on the admissible probability of error (the probability of finding that [math]\displaystyle{ \widehat{p} \gt \alpha }[/math] when in fact [math]\displaystyle{ p \leq \alpha }[/math] or vice versa), the question of how many permutations to generate can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far, in order to guarantee that the conclusion (which is either [math]\displaystyle{ p \leq \alpha }[/math] or [math]\displaystyle{ p \gt \alpha }[/math]) is correct with probability at least as large as [math]\displaystyle{ 1-\epsilon }[/math]. ([math]\displaystyle{ \epsilon }[/math] will typically be chosen to be extremely small, e.g. 1/1000.) Stopping rules to achieve this have been developed^{[9]} which can be incorporated with minimal additional computational cost. In fact, depending on the true underlying p-value it will often be found that the number of simulations required is remarkably small (e.g. as low as 5 and often not larger than 100) before a decision can be reached with virtual certainty.

## Example tests

## Literature

Original references:

- Fisher, R.A. (1935)
*The Design of Experiments*, New York: Hafner - Pitman, E. J. G. (1937) "Significance tests which may be applied to samples from any population",
*Royal Statistical Society Supplement*, 4: 119-130 and 225-32 (parts I and II). JSTOR 2984124 JSTOR 2983647 - Pitman, E. J. G. (1938). "Significance tests which may be applied to samples from any population. Part III. The analysis of variance test".
*Biometrika***29**(3–4): 322–335. doi:10.1093/biomet/29.3-4.322.

Modern references:

- Collingridge, D.S. (2013). "A Primer on Quantitized Data Analysis and Permutation Testing".
*Journal of Mixed Methods Research***7**(1): 79–95. doi:10.1177/1558689812454457. - Edgington. E.S. (1995)
*Randomization tests*, 3rd ed. New York: Marcel-Dekker - Good, Phillip I. (2005)
*Permutation, Parametric and Bootstrap Tests of Hypotheses*, 3rd ed., Springer ISBN:0-387-98898-X - Good, P (2002). "Extensions of the concept of exchangeability and their applications".
*Journal of Modern Applied Statistical Methods***1**(2): 243–247. doi:10.22237/jmasm/1036110240. - Lunneborg, Cliff. (1999)
*Data Analysis by Resampling*, Duxbury Press. ISBN:0-534-22110-6. - Pesarin, F. (2001).
*Multivariate Permutation Tests : With Applications in Biostatistics*,*John Wiley & Sons*. ISBN:978-0471496700 - Welch, W. J. (1990). "Construction of permutation tests".
*Journal of the American Statistical Association***85**(411): 693–698. doi:10.1080/01621459.1990.10474929.

Computational methods:

- Mehta, C. R.; Patel, N. R. (1983). "A network algorithm for performing Fisher's exact test in r x c contingency tables".
*Journal of the American Statistical Association***78**(382): 427–434. doi:10.1080/01621459.1983.10477989. - Mehta, C. R.; Patel, N. R.; Senchaudhuri, P. (1988). "Importance sampling for estimating exact probabilities in permutational inference".
*Journal of the American Statistical Association***83**(404): 999–1005. doi:10.1080/01621459.1988.10478691. - Gill, P. M. W. (2007). "Efficient calculation of p-values in linear-statistic permutation significance tests".
*Journal of Statistical Computation and Simulation***77**(1): 55–61. doi:10.1080/10629360500108053. http://rsc.anu.edu.au/%7Epgill/papers/103Fisher.pdf.

### Current research on permutation tests

- Good, P.I. (2012) Practitioner's Guide to Resampling Methods.
- Good, P.I. (2005) Permutation, Parametric, and Bootstrap Tests of Hypotheses
- Bootstrap Sampling tutorial
- Hesterberg, T. C., D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein (2005): Bootstrap Methods and Permutation Tests, software.
- Moore, D. S., G. McCabe, W. Duckworth, and S. Sclove (2003): Bootstrap Methods and Permutation Tests
- Simon, J. L. (1997): Resampling: The New Statistics.
- Yu, Chong Ho (2003): Resampling methods: concepts, applications, and justification. Practical Assessment, Research & Evaluation, 8(19).
*(statistical bootstrapping)* - Resampling: A Marriage of Computers and Statistics (ERIC Digests)

## References

- ↑ Moore, Jason H. "Bootstrapping, permutation testing and the method of surrogate data." Physics in Medicine & Biology 44.6 (1999): L11.
- ↑ Onghena, Patrick (2017-10-30), Berger, Vance W., ed., "Randomization Tests or Permutation Tests? A Historical and Terminological Clarification" (in en),
*Randomization, Masking, and Allocation Concealment*(Boca Raton, FL: Chapman and Hall/CRC): pp. 209–228, doi:10.1201/9781315305110-14, ISBN 978-1-315-30511-0, https://www.taylorfrancis.com/books/9781315305103/chapters/10.1201/9781315305110-14, retrieved 2021-10-08 - ↑ "Invited Articles".
*Journal of Modern Applied Statistical Methods***1**(2): 202–522. Fall 2011. http://tbf.coe.wayne.edu/jmasm/vol1_no2.pdf. - ↑ Collingridge, Dave S. (11 September 2012). "A Primer on Quantitized Data Analysis and Permutation Testing".
*Journal of Mixed Methods Research***7**(1): 81–97. doi:10.1177/1558689812454457. - ↑ Chung, EY; Romano, JP (2013). "Exact and asymptotically robust permutation tests".
*The Annals of Statistics***41**(2): 487–507. doi:10.1214/13-AOS1090. - ↑ Eden, T; Yates, F (1933). "On the validity of Fisher's z test when applied to an actual example of non-normal data. (With five text-figures.)".
*The Journal of Agricultural Science***23**(1): 6–17. doi:10.1017/S0021859600052862. https://www.cambridge.org/core/journals/journal-of-agricultural-science/article/abs/on-the-validity-of-fishers-z-test-when-applied-to-an-actual-example-of-nonnormal-data-with-five-textfigures/6232D2A79D698995B23E1A1AF4CEA8AB. Retrieved 3 June 2021. - ↑ Dwass, Meyer (1957). "Modified Randomization Tests for Nonparametric Hypotheses".
*Annals of Mathematical Statistics***28**(1): 181–187. doi:10.1214/aoms/1177707045. - ↑ Thomas E. Nichols, Andrew P. Holmes (2001). "Nonparametric Permutation Tests For Functional Neuroimaging: A Primer with Examples".
*Human Brain Mapping***15**(1): 1–25. doi:10.1002/hbm.1058. PMID 11747097. PMC 6871862. http://www.fil.ion.ucl.ac.uk/spm/doc/papers/NicholsHolmes.pdf. - ↑ Gandy, Axel (2009). "Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk".
*Journal of the American Statistical Association***104**(488): 1504–1511. doi:10.1198/jasa.2009.tm08368.

Original source: https://en.wikipedia.org/wiki/Permutation test.
Read more |