What has the weakest link to do with fallacies in medical statistics?

Theory

A chain is as strong as its weakest link – a truism that we learn in childhood, yet one that is ignored in a wide range of human activities. The concept leads to a branch of extreme value statistics that applies in a wide area of problems, particularly those involving failure, such as mechanical structures, electrical insulation and human life.

If we make up a chain from a number, n, of links whose probability of failure is a variable F1, then the probability of the chain failing (under given conditions of tension and time) is given by:

Fn = 1- (1-F1) n

This is known as the smallest value transformation. It can be derived directly from the binomial distribution (the easy way is to subtract the probability of no failures from 1).

F will be function of stress and time, but for the present purpose we will regard the stress as constant and treat it as a function of time, in fact the distribution F(t).

Data dredge

In an epidemiological survey, if the researchers look at one disease and one potential cause, they determine the incidence of the disease and then quote a level of significance either as the probability of the number having occurred by accident or as a confidence interval.

If they are looking at two diseases and count either as an event, they are statistically testing a chain of two links. Whichever crosses the given threshold first determines that the event has occurred. By reference to the general population or to a control population they determine the probability that the rate of occurrence is significantly unlikely, i.e. less than a predetermined threshold. More often than not this threshold is chosen to be the rather unsatisfactory value of 0.05 and we get the iconic P<0.05.

The data dredge fallacy arises from looking at more than one disease, but treating them all as though they were each a part of an independent survey. Only the ones that cross the significance threshold are counted and the rest are discarded. They claim a particular value of P, but the reality is that a larger value applies, and it is determined by the smallest value transformation. Likewise, the Confidence Interval, which in a sense is the mirror image of P, is affected and is subject to the largest value transformation. P and CI are simply different ways of expressing the (often dubious) claim of a one in twenty chance of being wrong. Using the formula above we can create a table to show what P and CI ought to be for any number of diseases, n, rather than the numbers 0.05 or 95% respectively that are almost invariably quoted. We can present the effect of the number of diseases on the true value of P or CI in a table.

 n P CI% 1 0.05 95 2 0.098 90.03 3 0.143 85.74 4 0.185 81.45 5 0.226 77.38 6 0.265 73.51 7 0.392 69.83 8 0.337 66.34 9 0.370 63.03 10 0.401 59.87 11 0.431 56.88 12 0.460 54.04

And so it goes on. In a very large data dredge, such as the Harvard Nurses Health study, with hundreds of combinations of disease and potential cause, the chance of getting one accidental correlation is tantamount a certainty as, indeed, is the probability of dozens; yet each result is trotted out with P<0.05 or a CI of 95% as though they are all part of independent trials.

Premature termination

When the Tamoxifen trial was prematurely closed and unblinded by the American contingent, it was treated as a scandal and an outrage by the European participants. Now they are all at it. In drug trials several diseases are monitored and as soon as one crosses the threshold the trial is abandoned. They claim the same old one in twenty chance of accident, but the actually probability depends on the number of diseases being monitored.

Here is a comment from The Epidemiologists:

The policy of cancelling arouses a number of concerns. First, there are security worries. Do we really believe in the efficacy of those “Chinese walls” that are supposed to stop different departments in financial institutions from leaking information to each other. Are there no innuendoes exchanged in the lap-dancing clubs after work? Second, the progress of such a trial in terms of, say, relative risk is a random walk, wandering up and down but, if the trial is long enough, gradually settling down to an equilibrium value. If it is terminated before that equilibrium is reached, can the result be regarded as significant? Would the trend have drifted the other way given time? Third, who prescribes the standards by which the action to terminate will be judged? It is rather disturbing that in at least one case the terminating condition involved a 95% Confidence Interval that embraced the value of relative risk of 1.0, which means there is no effect. Fourth, the very act of termination endows a study with much greater significance than it would otherwise be granted. Following the 2004 announcement of the termination of a Scandinavian HRT and breast cancer trial, the headline was Breast cancer fears force doctors to axe second trial. Yet this trial involved a mere 174 women. Furthermore, it was formed from the combination of two trials, one of which was producing “evidence” that HRT protected against cancer. Fifth, the whole thing involves the extreme value fallacy. If a dozen diseases are being monitored it only needs one of them to cross the arbitrary threshold for the trial to be terminated, yet for one of the others the treatment could have turned out to be wonderfully beneficial or devastatingly malign.

The next HRT trial was abandoned on the grounds of risk of stroke. Breast cancer did not figure. By April 2004, more than half the women on HRT had abandoned it, when yet another study appeared, exonerating it.

Asymptotic distributions

The other question of relevance is the shape of distributions of extreme values. Just as averages tend towards the normal distribution by virtue of the central limit theorem, so extreme value distributions tend towards certain shapes. These may be derived from an idea known as the stability postulate, that the asymptotic distributions are such that they do not change their shape under the smallest (or largest) value transformation. It turns out that there are only six possible types. In medical statistics the two important ones are the exponential distribution (also known as the Poisson traffic law) that governs time to failure or initiation of disease (as in the Vioxx trial) and the Gompertz distribution, which applies to human life duration.

Simple worked example

An electronic component has a constant failure rate of a per hour, what is the failure rate of a system with n such essential components?

The system has failed if one component fails.

The distribution of times to failure for one component is given by F1(t) = 1-exp(at), see links.

Plugging this into the smallest value transformation above, we get Fn(t) = 1-exp(nat). The shape of the distribution is unchanged, illustrating the stability postulate, and the new failure rate is na.

The calculation is the same for n diseases, assuming the statistical properties are similar.