DScience:Histograms

From HandWiki
Jump to: navigation, search
Limitted access. Login to DataMelt if you are a full DataMelt member.
Table of Contents
Table of contents


70% complete
   


A histogram is a summary graph that shows counts of data points falling into various ranges, thus it gives an approximation of the frequency distribution of data. It is an elegant tool to project multidimensional data to lower dimensions and display such projections for visual inspection. The histogram shows data in the form of a bar graph in which the bar heights display event frequencies. Events are measured on the horizontal axis, "X", which has to be binned. The larger number of bins, the higher chances that a fine structure of data can be resolved. Obviously, the binning destroys the fine-grain information of original data.

In this respect, the histogram representation is useful if one needs to create a statistical snapshot of a large data sample in a compact form. Let us illustrate this: Assume we have [math]N[/math] numbers, each representing a single measurement. One can store such data in the form of Java or Python lists. Thus, one needs to store [math]8\times N[/math] bytes (assuming 8 bytes to keep one number). In the case of a large number measurements, we need to be ready to store a very big output file, as the size of this file is proportional to the number of events. Instead, one can keep for future use only the most important statistical summary of data, such as the shape of the frequency distribution and the total numbers of events. The information which needs to be stored is proportional to the number of bins, thus the file storage has nothing to do with the size of the original data. An example of a histogram is shown in

DMelt example: One dimensional histogram filled with Gaussian numbers

Each bin of the histogram has associated statistical uncertainties showed with vertical lines. The size of the error bars is equal to the square root of the number of entries.


1D histogram

To create a histogram in one dimension (1D), one needs to define the number of bins, Nbins, and the minimum (Min) and the maximum (Max) value for a certain variable. The bin width is given by (Max-Min)/Nbins. If the bins are too wide (Nbins is small), important information might get lost. On the other hand, if the bins are too narrow (Nbins is large), what may appear to be meaningful information really may be due to statistical variations of data entries in bins. To determine whether the bin width is set to an appropriate size, different bin sizes should be tried.

DMelt histograms are designed on the bases of the JAIDA FreeHEP library . For one-dimensional histograms, use the class jhplot.H1D jhplot.H1D. To initialize an empty histogram, the following constructor can be used:

The code above creates a 1D histogram with the title “data”, the number of bins Nbins=100, and the range of axis [math]X[/math] to be binned, which is defined by the minimum and the maximum values, Min=0 and Max=20. Thus, the bin width of this histogram is fixed to 0.2. The bin size and the number of bins are given by the following methods:

d=h1.getBinSize() 
i=h1.getBins() 

We should note that a fixed-size binning is used. In the following sections, we will consider a more general case when the histogram bin size is not fixed to a single value.

The method fill(d) fills a histogram with a single value, where “d” is a double number. The histograms can be displayed on the jhplot.HPlot jhplot.HPlot canvas using the standard draw(h1) method.

Let us give a complete example of how to fill a histogram with Gaussian random numbers:

that fills Max Gaussian random numbers with a mean m and a standard deviation sd. To do this, we use the package java.util.Random.

In the simplest case, that code that makes the figure shown above looks like this:

We have used two new methods of the jhplot.H1D jhplot.H1D class in the script shown above. Due to their importance, we will discuss them here:

setFill(b)
fill a histogram area when [math]b=1[/math] (boolean “True”). When [math]b=0[/math] (boolean “False”), the area is not filled;
setFillColor(c)
color (Java AWT class) for filling the histogram area.

The histograms have the following features:

  • The height of bins for each histogram depends on the bin size. Even when the number of entries is the same, histograms are difficult to compare in shape when the histogram bins are different, see Fig. [h1d] (left).
  • Relative size of errors decreases with increasing the number of entries.

We will show a few basic manipulations useful for examining the shapes of histograms, assuming that the underlying mechanism for occurrence of events reveals itself in shapes of event distributions, rather than in the overall statistics or chosen bin size. The shape of the distributions is very important as it conveys information about the probability distribution of event samples.

First of all, let us get rid of the bin dependence. To do this, we will divide each bin height by the bin size. Assuming that “h1” represents a histogram, this can be done as:

width=h1.getBinSize()
h1.scale(1/width) 

After this operation, all histogram entries (including statistical uncertainties) will be divided by the bin width. You may still want to keep a copy of the original histogram using the method h2=h1.copy(), so one can come back to the original histogram if needed.

Different histograms can have different normalization, thus a visual comparison of histograms might be difficult. In the case if we are interested in histogram shapes, one can divide each bin height by the total number of histogram entries. This involves another scaling:

entries=h1.allEntries()
h1.scale(1/entries)

Obviously, both operations above can be done using one line:

h1.scale(1/(h1.getBinSize()*h1.allEntries())) 

The second step in comparing our histograms would be to shift the bins of the third histogram. Normally, we do not know the exact shift (what should be don in this case will be considered later). At this moment, for the sake of simplicity, we assume that the shift is known and equals to -2. There is a special histogram operation which does such shift:

h2.shift(-2) # shift all bins by -2

Now we are ready to modify all the histograms and to compare them. Look at the example below:

After execution of this script, you will find three overlaid histograms. The shapes of all histograms will be totally consistent with each other, i.e. all bin heights will agree within their statistical uncertainties.

One can find many situations in which you may be interested in how well histogram shapes agree to each other. For example, let us assume that each histogram represents the number of days with rainfall measured during one year for one state. If the distributions are shown as histograms, it is obvious that bigger states have a larger number of days with rainfalls compare to small states. This means that all histograms are shifted (roughly by a factor proportional to the area of states, ignoring other geographical differences). The measurements could be done by different weather stations and the bin widths could be rather different, assuming that there is no agreement between the weather stations about how the histograms should be defined. Moreover, the measurements could be done during different time intervals, therefore, the histograms could have rather different numbers of entries. How one can compare the results from different weather stations, if we are only interested in some regularities in the rainfall distributions? The answer to this question is in the example above: all histograms have to be: 1) normalized; 2) shifted; 3) a bin dependence should be removed.

The only unclear question is how to find the horizontal shifts, since the normalization issue is rather obvious and can be done with the method discussed above. This problem will be addressed in the following chapters when we will discuss a statistical test that evaluates the “fit” of a hypothesis to a sample.

Probability distribution and probability density

The examples above tell that there are several quantities which can be derived from a histogram. One can extract a probability distribution by dividing histogram entries by the total number of entries. The second important quantity is a probability density, when the probability distribution is divided by the bin width, so that the total area of the rectangles sums to one (which is, in fact, the definition of the probability density).

Both the probability distribution and the density distribution can be obtained after dividing histogram entries as discussed above. However, these two characteristics can be obtained easier by calling the following methods:

h2=h1.getProbability()
h2=h1.getDensity() 

which return two new H1D objects: the first represents the probability distribution and the second returns the probability density. In addition to the obvious simplicity, such methods are very useful for variable-bin-size histograms, since this case is taken into account automatically during the division by bin widths. You can check the density distribution using this statement:

print h1.integral()

which prints “1” if the histogram is properly normalized.

Note the following: one can save computation time in the case of calculation of the probability distributions if you know the total number of events (or entries) [math]N_{tot}[/math] beforehand. In this case, one can obtain the probability distribution using the weight [math]w1 = 1.0/N_{tot}[/math] in the method fill(x,w1), without subsequent call to the method getProbability(). After the end of the fill, the histogram will represent the probability distribution normalized to unity by definition. In addition, one can remove the bin dependence by specifying another weight as [math]w2 = 1.0/bsize[/math], where [math]bsize[/math] is the size of the bin. Finally, the density distribution can be obtained using the weight [math]w3=w1*w2[/math].

Histogram characteristics

This section continues our discussion of important characteristics of the jhplot.H1D jhplot.H1D histogram class. The most popular characteristics of a histogram are the median and the standard deviations (RMS). Assuming that h1 represents a H1D histogram, both (double) values can be obtained as:

d=h1.mean()
d=h1.rms() 

We already know that one can obtain the number of entries with the method allEntries(). However, some values could fall outside of the selected range during the fill() method. Luckily, the histogram class has the following list of methods for to access the number of entries:

gt; i=h1.allEntries()   # all entries 
i=h1.entries()      # number entries in the range
i=h1.extraEntries() # under and overflow entries 
i=h1.getUnderflow() # underflow entries 
i=h1.getOverflow()  # overflow entries 

All methods above return integer numbers.

Another useful characteristics is the histogram entropy. It is defined to be the negation of the sum of the products of the probability associated with each bin with the base-2 log of the probability. One can get the value of the entropy with the method:

>>> print "Entropy=",h1.getEntropy() 

Initialization and filling methods

Previously, it has been shown how to initialize a histogram with fixed bin sizes. One can also create a histogram using a simpler constructor, and then using a sequence of methods to set histogram characteristics:

h1=H1D("Title")
h1.setMin(min); h1.setMax(max); h1.setBins(bins) 

which are used to set the minimum, maximum and the number of bins. These methods can also be useful to redefine these histogram characteristics after the histogram was created using the usual approach.

One can also build a variable bin-size histogram by passing a list with the bin edges as shown in this example:

bins=[0,10,100,1000]
h1=H1D("Title",bins) 

This creates a histogram with three bins with the bin edges given by the input list. This constructor is handy when a data is represented by a bell-shaped or falling distribution; in this case it is important to increase the bin size in certain regions (tails) in order to reduce statistical fluctuations.

As we already know, to fill a histogram with numbers, use the method fill(d). More generally, one can assign a weight “w” to each value as

h1.fill(d, w) 

where “w” is any arbitrary number representing a weight for a value “d”. The original method fill(d) assumes that all weights are 1.

But why do we need these weights? We have already discussed in Sect. [h1d:prob] that the weights are useful to reduce the computational time when the expected final answer should be either a probability distribution or density distribution. There are other cases when the weights are useful: We should note again that a histogram object stores the sum of all weights in each bin. This sum runs over the number of entries in a bin only when the weights are set to 1. Events may have smaller weights if they are relatively unimportant compared to other events. It is up to you to make this decision since this depends on a concrete situation.

The method fill(d) is slow in Jython when it used inside loops, therefore, it is more efficient to fill a histogram at once using the method fill(list), where list is an array with numbers passed from another program or file. As before, fill(list, wlist) can be used to fill a histogram from two lists. Each number in list has an appropriate weight given by the second argument.

Instead of Jython (or Java) lists, one can pass a jhplot.P0D jhplot.P0D Java array:

h1.fill(p0d)

where p0d represents the P0D class.

Analogously, one can fill a histogram by passing a jhplot.PND jhplot.PND multidimensional array. This can be done again with the method fill(pnd), where pnd is an array with any size or dimension. One can specify also weights in the form of an additional PND object passed as a second argument to the method fill(pnd, w).

Histograms can be filled with weights which are inversely proportional to the bin size - as it was shown in the previous section, removing the bin size dependence is one the most common operations:

h1.fillInvBinSizeWeight(d)

It should be noted that this method works even when histograms have irregular binning.

Finally, one can set the bin contents (bin heights and their errors) from an external source as shown below:

h1.setContents(values, errors)
h1.setMeanAndRms(mean,rms)

where values and errors are input arrays. Together with the settings for the bin content, the second line of the above example shows how to set the global histogram characteristics, such as the mean and the standard deviation. There are more methods dealing with external arrays; advanced users can find appropriate methods in the API documentation of the class H1D or using the code assist.

One can create a histogram object from a function. Here is a simple example that shows how to do this:

Accessing histogram values

One-dimensional histograms based on the jhplot.H1D jhplot.H1D class can easily be viewed using the following convenient methods designed for visual inspection:

toString()
- convert a H1D histogram into a string
print()
- print a histogram
toTable()
- show a histogram as a table

Integration

Histogram integration is similar to the F1D functions considered in the previous chapters: We simply sum up all bin heights. This can be done using the method integral(). More often, however, it is necessary to sum up heights in a certain bin region, say between a bin “i1” and “i2”. Then use this method:

sum=h1.integral(i1,i2) 

We should note that the integral is not just the number of events between these two bins: the summation is performed using the bin heights. However, if the weights for the method fill() are all set to one, then the integral is equivalent to the summation of numbers of events.

The integration shown above does include multiplication by a bin width. If one needs to calculate an integral over all bins by multiplying the bin content by the bin width (which can be either fixed or variable), use the method:

sum=h1.integral(i1,i2,1)

where the last parameter should be set to 1 (or to “true” in case of Java codding).

The next question is how to integrate a region in [math]X[/math] by translating [math]X[/math]-coordinates into the bin indexes. This can be done by calling the method findBin(x), which returns an index of the bin corresponding to a coordinate [math]X[/math]. One can call this method every time when you need to identify the bin index before calling the method integrate(). Alternatively, this can be done in one line as:

sum=integralRegion(xmin,xmax,b)

The method returns a value of the integral between two coordinates, xmin and xmax. The bin content will be multiplied by the bin width if the boolean value b is set to 1 (boolean “true” in Java).

Histogram operations

Histograms can be added, subtracted, multiplied and divided. Assuming that we have filled two histograms, h1 and h2, all operations can be done using the following generic method:

h1.oper(h2,"NewTitle","operation")

where “operation” is a string which takes the following values: “+” (add), “-” (subtract), “*” (multiply) and “/” (divide). The operations are applied to the histogram h1 using the histogram h2 as an input. One can skip the string with a new title if one has to keep the same title as for the original histogram. In this case, the additive operation will look as h1.oper(h1,“+”)

To create an exact copy of a histogram, use the method copy(). Previously, we have already discussed the scale(d) and shift(d) operations.

A histogram can be smoothed. This topic will be described in Sect. [sec_smoothing], since smoothing and interpolation are widely used data-analysis techniques. Here we just mention that a smoothing of histogram can be done using this method:

h1=h1.operSmooth(b,k)

This is done by averaging over a moving window. If “b=1” then the bins will be weighted using a triangular weighting scheme favoring bins near the central bin (“b=0” for the ordinary smoothing) One should specify the integer parameter “k” which defines the window as “2*k + 1”. The smoothing may be weighted to favor the central value using a “triangular” weighting. For instance, for “k=2”, the central bin would have weight 1/3, the adjacent bins 2/9, and the next adjacent bins 1/9. For all these operations, errors are kept to be the same as for the original (non-smoothed) histogram.

One can also create a Gaussian smoothed version of a H1D histogram. Each band of the histogram is smoothed by a discrete convolution with a kernel approximating a Gaussian impulse response with the specified standard deviation.

h2=h1.operSmoothGauss(rms)

where rms is a double value representing a standard deviation of the Gaussian smoothing kernel (must be non-negative).

One useful technique is histogram re-binning, i.e. when groups of bins are joined together. This approach could be used if statistics in bins is low; in this case, it makes sense to make bins larger in order to reduce relative statistical uncertainty for entries inside bins (we remind that in case of counting experiments, such uncertainty is [math]\sqrt{N}[/math], where [math]N[/math] is a number of entries). The method which implements this operation is called rebin(group), where group defines how many bins should merged together. This method returns a new histogram with a smaller number of bins. However, there is one restriction: the method rebin cannot be used for histograms with non-constant bin sizes.

Accessing low-level Jaida classes

The H1D class is based on the two classes, IAxis and Histogram1D of the Jaida FreeHep library. Assuming h1 represents a H1D object, these two Jaida classes can be obtained as:

a=h1.getAxis() # get IAxis object
h=h1.get()     # get Histogram1D class 

Both objects are rather useful. Although they do not contain graphical attributes, they have many methods for histogram manipulations, which are not present for the higher-level H1D class. The description of these Jaida classes is beyond the scope of this book. Please look at the Java documentation of these classes or use the code assist.

Graphical attributes

Histograms can be shown either by lines (default) or by using symbols. For the default option (lines), one can consider either to fill histogram area or keep this area empty. The following methods below can be useful:

h1.setFill(b) 
h1.setFillColor(c)

For the first method, Jython boolean “b=1” means to fill the histogram, while “b=0” (false) keeps the histogram empty. If the histogram area has to be filled, you may consider to select an appropriate color by specifying Java AWT Color object “c”. How to find appropriate color has been discussed in Sect. [hplot_func].

Histograms can be shown using symbols as:

h1.setStyle("p") 

The style can be set back to the default style (histogram bars). This can be done by passing the string “h” instead of “p”. One can also use symbols connected by lines; in this case, use the character “l” (draw lines) or the string “lp” (draw lines and symbols).

This tutorial is provided under this license agreement.

<addthis />