Small samples

From HandWiki
Revision as of 10:47, 5 August 2021 by imported>PolicyEnforcerIA (attribution)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Many statistical methods and theorems are considerably simplified in the asymptotic limit of large amounts of data, very precise measurements, and/or linear models. Unfortunately, research workers accustomed to using the simplified results may be unaware of the complications which arise in the general case, i.e. for small data samples. These complications are of various different origins:

Differences between the Bayesian and non-Bayesian approaches to statistics ( Hepa img2.gif Bayesian Statistics), which are negligible for large data samples, become important for small samples, where the treatment of prior knowledge has a large effect on the quoted results.

Many methods assume that certain underlying distributions are Gaussian because of the central limit theorem, whereas for small samples this may not be true. For example, Hepa img111.gif tests on histograms are valid only when there are enough events per bin ( Hepa img2.gif Binning) so that the distribution in each bin is approximately Gaussian. For small samples, one must fall back on the multinomial distribution, which is much harder to handle.

Related to the above is the problem of giving correct confidence limits on variables when only a few events are available. In the non-Bayesian theory, the exact confidence limits are given for the Poisson distribution (for cross-sections) in Regener51, and for the binomial distribution (for branching ratios) in James80. The observed value of a variable is often used in place of its expected value in statistical calculations, and this approximation may be poor for small samples. For example, the variance of a Poisson distribution is exactly the square root of the expected value, but only approximately the square root of the observed value. Since the expected value depends on the hypothesis (or on the parameters of the fit) it is more convenient to use the observed value which is constant during a fit and does not depend on the hypothesis. This introduces a bias for small samples, since a negative fluctuation will then be assigned a smaller error and larger weight than a positive fluctuation.

In calculating the usual Hepa img111.gif , the observed number of events in a bin should be compared with the integral of the expectation over the bin, but one usually takes the value of the expectation in the middle of the bin and multiplies by the bin width. For small samples, bins may be wide and this approximation may be poor if the expectation is varying considerably.

When errors are propagated from measured quantities to derived quantities, the usual formula takes account only of the linear terms in the transformation of variables. This is a good approximation only when the transformation really is linear or when the errors are small. The calculation for the more general case can be quite complicated and generally gives rise to non-Gaussian errors on the derived quantity, even when the original measured quantities have Gaussian errors. An example of such a calculation by Monte Carlo methods is given in James81.