DMelt:Statistics/1 General Topics
Statistics of 1D arrays
The package jhplot.stat can be used for descriptive analysis of random distributions in 1D. Similarly, cern.jet.stat.Descriptive package contains descriptive methods to calculate many statistical characteristics.
Consider also several other packages:
- Descriptive statistics package
- Major statistical distributions
- Colt statistics package
But before using such packages, check again the data containers such as jhplot.P1D or jhplot.H1D . They already have many useful methods to access statistical information on data.
Some statistical questions have been already described before. Below we will consider several advances topics.
Moments of a distribution
DataMelt can be used to determine statistical characteristics of an arbitrary frequency distribution. Moments of this distribution can be calculated up to the 6th order. Read more about Moment (mathematics).
from jhplot import * from jhplot.math.StatisticSample import * a=randomLogNormal(1000,0,10) # generate random 1000 numbers between 0 and 10 using a LogNormal distribution p0=P0D(a) # convert it to an array print p0.getStatString() # print detailed characteristics
Run this script and you will get a very detailed information about this distribution (rather self-explanatory)
To show this output, click expand
Size: 1000 Sum: 2.0795326321690155E11 SumOfSquares: 1.722072831288292E22 Min: 4.3681673233597326E-14 Max: 1.187289072883721E11 Mean: 2.0795326321690154E8 RMS: 4.1497865382309628E9 Variance: 1.7194678431631995E19 Standard deviation: 4.14664664899627E9 Standard error: 1.3112848062732975E8 Geometric mean: 0.7193930848008395 Product: 9.252494313364321E-144 Harmonic mean: 2.2976022239249118E-11 Sum of inversions: 4.352363475222163E13 Skew: 25.65476598759878 Kurtosis: 694.7699433878664 Sum of powers(3): 1.839916709064571E33 Sum of powers(4): 2.0782654881247146E44 Sum of powers(5): 2.4093597349729484E55 Sum of powers(6): 2.8286717081193334E66 Moment(0,0): 1.0 Moment(1,0): 2.0795326321690154E8 Moment(2,0): 1.722072831288292E19 Moment(3,0): 1.839916709064571E30 Moment(4,0): 2.0782654881247147E41 Moment(5,0): 2.409359734972948E52 Moment(6,0): 2.8286717081193336E63 Moment(0,mean()): 1.0 Moment(1,mean()): 4.931390285491944E-7 Moment(2,mean()): 1.7177483753200437E19 Moment(3,mean()): 1.8291913748162454E30 Moment(4,mean()): 2.0630054468429083E41 Moment(5,mean()): 2.3878300421487077E52 Moment(6,mean()): 2.798744135044988E63 25%, 50%, 75% Quantiles: 0.0012310644573427145, 0.9530465118707188, 535.0653267374155 quantileInverse(median): 0.5005 Distinct elements & frequencies not printed (too many).
Let us continue with this example and now we would like to return all statistical characteristics of the sample as a dictionary. We can do this by appending the following lines that 1) create a dictionary "stat" with key/value pairs; 2) retrieve a variance of the sample using the key ``Variance.
stat=p0.getStat() print "Variance=",stat["variance"]
which will print "Variance= 757.3". If not sure about the names of the keys, simply print the dictionary as "print stat".
One can create histograms that catch the most basic characteristics of data. This is especially important if there is no particular reasons to deal with complete data arrays. We can easily do this with above Fibonacci sequence as:
h=p0.getH1D(10, 0, 100) print h.getStat()
The code converts the array into a histogram with 10 equidistant bins in the range 0-100, and then it prints the map with statistical characteristics.
You can also visualize the random numbers in the form of a histogram as shown in this detailed example above. We create random numbers, convert them to histograms and plot them.
from jhplot import * from jhplot.math.StatisticSample import * a=randomLogNormal(1000,0,10) h=H1D("LogNormal",40,-10,10) # make a histogram h.fill(a) c1 = HPlot() c1.setGTitle("LogNormal"); c1.visible() c1.setAutoRange() c1.draw(h)
Statistics with 2D arrays
You can get detailed statistics on 2D data represented by jhplot.P1D arrays using the method getStat(axis), where axis=0 for X and axis=1 for Y. It returns a map (for JAVA) or Python dictionary (for Jython) where each statistical characteristics can be accessed using a key, such as mean, RMS, variance, error on the mean at. Assuming that P1D is represented by "p1" object, try this code:
stat=p2.getStat() # get PYTHON dictionary with statistics for X for key in stat: print key , '=', stat[key]
This will print the following values:
error 0.996592835069 rms 5.05682000584 mean 4.42857142857 variance 6.95238095238 stddev 2.63673679998
Here is a more detailed example:
from jhplot import * from jhplot.math.StatisticSample import * a=randomLogNormal(1000,0,1) # get statistics p0=P0D(a) print p0.getStat() # make histogram h=H1D("LogNormal",30,0,1) h.fill(a) c1 = HPlot("LogNormal") c1.setGTitle("LogNormal") c1.setRange(0,1,0,100) c1.visible() c1.draw(h)
Comparing two histograms
Comparison of two histograms test hypotheses that two histograms represent identical distributions. Both jhplot.H1D and jhplot.H2D histograms have the method called "compareChi2(h1,h2)" to perform Pearson's chi-squared test. It calculates Chi2 between 2 histograms taking into account errors on the heights of the bins. The number chi2/ndf gives the estimate: values smaller or close to 1 indicates similarity between 2 histograms. Here is an example:
d=h1.compareChi2(h2) # h1, h2 are H1D or H2D histograms defined above chi2=d["chi2"] # chi2 ndf =d["ndf"] # number of degrees of freedom p =d["p-value"] # probability (p-value)
Two histograms are identical if chi2=0. Make sure that both histograms have error (or set them to small values).
A similar method also exists for jhplot.P1D data points. The comparison is done for Y-values, assuming symmetric errors on Y. However, data should be ordered in X for correct comparison.
Comparing histograms with a function
Similarly, one can run the Pearson's chi-squared test on a histogram and a function. This shows how to do this in one line:
f1=F1D("x*x") d=h1.compareChi2(f1) chi2=d["chi2"] # chi2 ndf =d["ndf"] # number of degrees of freedom p =d["p-value"] # probability (p-value)
Again, this test returns chi2 and the p-value.
Linear regression analysis
There are several tools to perform linear regressions and estimate slope and intercept (with statistical uncertainties), as well as to estimate the prediction and confidence level bands. Let us make a short example generating data in 2D and performing linear regression fit:
from jhplot import * from jhplot.stat import LinReg from java.awt import Color # create input data data="""2 3 4 2 8 2 3 1 7 1 2 3 7 2 3 4 2 3 4 5 1 2 2 4 3 3 4 5 2 2 1 3 8 8 8 3 1 3 5 1 2 3 1 3 4 5 6 1 2 1 0 0 2 3 4 5 6 7 8 2 1 2 3 5 2 3 1 2 1 2 4 2 3 5 5 5 1 2 3 4 """ file = open("pnd.d", "w") file.write(data) file.close() # read data from external file pn=PND('data','pnd.d') print pn.toString() p0=pn.getP0D(1) # extract 2nd column and put to a 1D array print p0.getStat() # print statistics print p0.variance() h1=p0.getH1D(10) # make a histogram with 10 bins p1=pn.getP1D(1,2) # extract column 2 and 3 c1=HPlot("Plot",600,400,2,1) c1.visible() c1.cd(1,1) # go to the first drawing region c1.setAutoRange() c1.draw(h1) c1.cd(2,1) # go to second drawing region c1.setAutoRange() r = LinReg(p1) print "Intercept=",r.getIntercept(), "+/-",r.getInterceptError() print "Slope=",r.getSlope(),"+/-",r.getSlopeError() # create a string with a*x+b function func='%4.2f*x+%4.2f' % (r.getSlope(),r.getIntercept()) f1=F1D( func, p1.getMin(0), p1.getMax(0)) # define a function in the range of the data p=r.getPredictionBand(Color.green) # calculate the prediction band p.setLegend(False) # do not show the legend p.setErrColor(Color.green) # color for error bars c1.draw(p) c1.draw(p1) # redraw data and the function c1.draw(f1) c1.export("tutorial_dmin1.eps") # make image file (EPS) p2=pn.getP2D(0,1,2) # extract 1,2,3 columns p3=pn.getP2D(0,2,3) # extract 1,3,4 columns c2=HPlot3D("Plot",600,400,2,1) c2.visible() c2.setAutoRange() c2.cd(1,1) c2.setBoxColor(Color.white) c2.draw(p2) c2.cd(2,1) c2.setBoxColor(Color(200,210,210)) c2.setAutoRange() c2.draw(p3) c2.export("tutorial_dmin2.eps")
The output of this script is the fit values:
<file> Intercept= 0.0637647629564 +/- 0.0656878315703 Slope= 0.1024396794 +/- 0.331874315937 </jcode>
Normalised Factorial Moments
Normalized factorial moments (NFM) are used to measure deviations of a multiplicity distribution from a Poisson_distribution. As example, let us consider calculations of normalized factorial moments (NFM) for several distributions. They are defined as
[math]\displaystyle{ F_q = \lt n (n-1) .. (n+1-q) \gt / \lt n \gt ^{q} }[/math].
where "n" is an integer number. According to this definition, Poisson_distribution has all moments equal to 1. Distributions broader than a Poisson distribution have moments larger than one.
Let us calculate the NFM up to 4th order for a Poisson distribution Binomial distribution and a Negative binomial distributions
# Definide as F_q=<n*(n-1) .. (n-q+1)/ <n>^q # authors: S.Chekanov from jhplot import * from cern.jet.random.engine import * from cern.jet.random import * from jhplot.shapes import Line from java.awt import Color from jhpro.stat import * c1 = HPlot("Canvas",600,400) c1.visible(1) c1.setNameX("NFM order") c1.setNameY("Values") c1.setRange(0,5,0,2) # show a line in the NDC system line = Line(0.0,1, 5., 1.) line.setPosCoord("USER") line.setColor(Color.gray) line.setTransparency(0.5) c1.add(line) # build a random engine engine=MersenneTwister() poisson=Poisson(10,engine) m=MomentsFacNorm(4) # calculates moments up to 4th order # use a Poissonian random numbers for i in range(100): m.process( poisson.nextInt()) p1=m.getResults() p1.setTitle("NFM for Poissson") p1.setSymbol(4) p1.setSymbolSize(10) c1.draw(p1) print(p1.toString()) ## Binomial distribution binomial=Binomial(10, 0.2, engine) m=MomentsFacNorm(4) # calculates moments up to 4th order for i in range(200): m.process( binomial.nextInt()) p2=m.getResults() p2.setTitle("NFM for Binomial") p2.setSymbol(5) p2.setSymbolSize(10) c1.draw(p2) print(p2.toString()) ## NegativeBinomial distribution (NBD) nbinom=NegativeBinomial(10, 0.4, engine) m=MomentsFacNorm(4) for i in range(300): m.process( nbinom.nextInt()) p3=m.getResults() p3.setTitle("NFM for NBD") p3.setSymbol(6) p3.setSymbolSize(10) c1.draw(p3) print(p3.toString())
The output file shows the NFM for all three distributions together with statistical errors.
Correlation coefficients
Correlation coefficients between two Python lists can can be obtained using several methods.
Correlation-coefficient calculations are included to the statlib.stats package. One can obtain covariance, Peason correlations and Spearman rank-oder correlations using the method shown in this example:
from statlib.stats import * from random import * ran=Random() mu,sigma=2.0,3.0 x,y=[],[] for i in range(100): t=ran.gauss(mu,sigma) x.append(t) y.append(t*2+ran.gauss(mu,sigma)) print stdev(x), ’+/-’,sterr(x) print mean(y),’+/-’,sem(y) print ’Covariance=’,lcov(x,y) print ’Pearson (correlation coeff. prob)=’,lpearsonr(x,y) print ’Spearman rank-order correlation=’,lspearmanr(x,y)
Setting limits
This section describes how to set a limit on observation of a signal in the presence of a background distribution. We will also consider a situation when the background is affected by a systematic uncertainty. We will consider how to estimate statistical significance of an observation in presence of statistical and systematic al errors (in case of observation) and also how to set the 95% CL exclusion limit (in case of no observation). We will show how to estimate 95% confidence limit with correct treatment of statistical errors.
Please go to DMelt:Statistics/4_Statistical_Limits