Distributions

Most variables in science are, in fact, random variables (or variates) if only because they are subject to measurement errors. This means that individual measurements will occupy random positions within a range, the boundaries of which will not, in general, be precisely defined. Variates may be grouped into two distinct classes, discrete and continuous. An example of a discrete variate is the score obtained on a single roll of a die, while an example of a continuous variate is the ages of men at death.

In order to describe the properties of a given variate we imagine that our samples are taken from a theoretical population of infinite number; so, while our die might come up with a score of five, we take it to be one representative of an infinite number of dice that all share common properties.

The primary way of mathematically describing the properties of this imaginary parent population is the distribution function, which may defined in words by:

The distribution function F(x) of a variate v is the probability of v taking on a value less than or equal to the number x.

In mathematical notation we write this as:

F(x) = Prob {v x}

For the example of the score of a single roll of the die, the distribution function is of the form: The properties of F(x) are that it starts from zero at the left, it never decreases towards the right and ends up at 1.

As an example of a continuous case F(x) is the age at death of UK males: A more familiar way of depicting the distribution is in the form of the density (or frequency) function, which better illustrates how the random numbers are concentrated. It is known as f(x) and is the slope of F(x). There are clear difficulties in representing f(x) for discrete variables without the aid of more advanced mathematical concepts (delta functions) but for the death rates above f(x) is of the form: Properties of the density function are that the total area under the curve is unity and that it tends to zero to the left and right. We can get over the difficulties of representing the density function for discrete variates by using a bar chart format and adopting the convention that the numbers against the vertical axis apply to the area of the bar rather than the height. We can then treat discrete and continuous variable in the same way. Thus the density function for the fair die can be drawn as: In mathematical terms the properties of the functions are:

F(x) is monotonic The ideal function f(x) is estimated by a normalised histogram of observations, where the term normalised means that we divide by the total number in the sample, so that the area is unity. In presenting data it is better not to normalise, in order to preserve the information about the magnitude of the numbers involved.

Here is a histogram (from Sorry, wrong number!) of breast cancer mortalities in 99 different UK hospitals, with a fitted normal curve. It shows  just normal random variation and nothing can be deduced from it, but that did not stop them from trying. Footnote: it seems that does not appear as the less-than-or-equals sign in some operating systems.