haggholm | Mar. 27th, 2009

Mar. 27th, 2009

This is a pretty hilarious attempt at logic.

“Orac” over at Respectful Insolence has a writing style that’s fairly prone to offend—definitely pugnacious, and very fond of side swipes at those he dislikes (primarily alternative medicine quacks)—and I don’t blame him for his distaste, which in fact I share, but it does sometimes make his essays a bit harder to slog through. (He also has an inordinate fondness for beginning sentences with Indeed. This is one area where I can tentatively claim superiority: I can also be pugnacious and come off as offensive, but while I am no less prone than Orac to complicated sentence structure, I’ve never been accused of any such repetitive verbal tic.)

However, those foibles aside, he has written some very good stuff (he’s on my list of blogs I ready daily for a reason), and this article, summarising and explaining the work of a John Ioannidis, was very interesting indeed. The claim it looks at is a very interesting and puzzling one: Given a set of published clinical studies reporting positive outcomes, all with a confidence interval of 95%, we should expect more than 5% to give wrong results; and, furthermore, studies of phenomena with low prior probability are more likely to give false positives than studies where the prior probabilities are high. He has often cited this result as a reason why we should be even more skeptical of trials of quackery like homeopathy than the confidence intervals and study powers suggest, but I have to confess I never quite understood it.

I would suggest that you go read the article (or this take, referenced therein), but at the risk of being silly in summarising what is essentially a summary to begin with…here’s the issue, along with some prefatory matter for the non-statisticians:

A Type I error is a false positive: We seem to see an effect where there is no effect, simply due to random chance. This sort of thing does happen. Knowing how dice work, I may hypothesise that if you throw a pair of dice, you are not likely to throw two sixes, but one time out of every 36 (¹/₆×¹/₆), you will. I can confidently predict that you won’t roll double sixes twice in a row, but about one time in 1,296, you will. Any time we perform any experiment, we may get this sort of effect, so a statistical test, such as a medical trial, has a confidence level, where a confidence level of 95% means there’s a 5% chance of a Type I error.

There’s also a Type II error, or false negative, where the hypothesis is true but the results just aren’t borne out on this occasion. To the best of my knowledge, there is no equivalent of the confidence level for Type II errors.

This latter observation is a bit problematic, and leads into what Ioannidis observed:

Suppose there are 1000 possible hypotheses to be tested. There are an infinite number of false hypotheses about the world and only a finite number of true hypotheses so we should expect that most hypotheses are false. Let us assume that of every 1000 hypotheses 200 are true and 800 false.

It is inevitable in a statistical study that some false hypotheses are accepted as true. In fact, standard statistical practice [i.e. using a confidence level of 95%] guarantees that at least 5% of false hypotheses are accepted as true. Thus, out of the 800 false hypotheses 40 will be accepted as "true," i.e. statistically significant.

It is also inevitable in a statistical study that we will fail to accept some true hypotheses (Yes, I do know that a proper statistician would say "fail to reject the null when the null is in fact false," but that is ugly). It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.

Did you see that magic? Our confidence interval was 95%, no statistics were abused, no mistakes were made (beyond the ones falling into that 5% gap, which we accounted for), and yet we were only 75% correct.

The root of the problem is, of course, the ubiquitous problem of publication bias: Researchers like to publish, and people like to read so journals like to print, positive outcome studies rather than negative ones, because a journal detailing a long list of ideas that turned out to be wrong isn’t very exciting. The problem is, obviously, that published studies are therefore biased in favour of positive outcomes. (If not, all 800 studies of false hypotheses would have been published and the problem would disappear.)

Definition time again: A prior probability is essentially a plausibility measure before we run an experiment. Plausibility sounds very vague and subjective, but can be pretty concrete. If I know that it rains on (say) 50% of all winter days in Vancouver, I can get up in the morning and assign a prior probability of 50% to the hypothesis that it’s raining. (I can then run experiments, e.g. by looking out a window, and modify my assessment based on new evidence to come up with a posterior probability.)

Now we can go on to look at why Orac is so fond of holding hypotheses with low prior probabilities to higher standards. It’s pretty simple, really: Recall that the reason why we ended up with so many false positives above—the reason why false positives were such a large proportion of the published results—is because there were more false hypotheses than true hypotheses. The more conservative we are in generating hypotheses, the less outrageous we make them, the more likely we are to be correct, and the fewer false hypotheses we will have (in relation to true hypotheses). Put slightly differently, we’re more likely to be right in medical diagnoses if we go by current evidence and practice than if we make wild guesses.

Now we see that modalities with very low prior probability, such as ones with no plausible mechanism, should be regarded as more suspect. Recall that above, we started out with 800 false hypotheses (out of 1000 total hypotheses), ended up accepting 5% = 40 of them, and that

It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.

That is, the proportion of true hypotheses to false hypotheses affects the accuracy of our answer. This is very easy to see—let’s suppose that only half of the hypotheses were false; now we accept 5% of 500, that is 25 false studies, and keeping the same proportions,

…Let's say that of every ~~200~~ 500 true hypotheses we will correctly identify ~~120~~ 300 or 60%. Putting this together we find that of every ~~160 (120+40)~~ 325 (300+25) hypotheses for which there is statistically significant evidence only ~~120~~ 300 will in fact be true or a rate of ~~75%~~ 92% true.

We’re still short of that 95% measure, but we’re way better than the original 75%, simply by making more plausible guesses (within each study, we were still equally likely to make either Type I or Type II errors). The less plausible an idea is, the higher the proportion of false hypotheses will be out of all the hypotheses the idea generates: A true/false ratio. Wild or vague ideas (homeopathy, reiki, …) are very likely to generate false hypotheses along with any true ones they might conceivably generate. More conventional ideas will tend to generate a higher proportion of true hypotheses—if we know from long experience that Aspirin relieves pain, it’s very likely that a similar drug does likewise.

This is not to say that no wild ideas are ever right. Of course they sometimes are (though of course they usually aren’t). What it does mean is that not only should we be skeptical and demand evidence for them, there are sound statistical reasons to set the bar of evidence even higher for implausible than for plausible modalities.

It is also a good argument for the move away from strict EBM (evidence-based medicine) to SBM (science-based medicine) where things like prior probability are taken into account. Accepting 95% double-blind trials at face value isn’t good enough.

I just finished reading The Elegant Universe: Superstrings, Hidden Dimensions, and the Quest for the Ultimate Theory by Brian Greene. It was a good book, it was well-written, and it made superstring theory about as comprehensible as I imagine it can be made to someone with my very limited knowledge of mathematics (a math minor; some linear algebra and basic multivariate calculus years ago). If you’re curious about superstring physics but don’t have the maths background to read a very technical account, I’d recommend it.

That said, the book leaves me with two reflections on why the theory so fails to capture my interest and conviction—I highlight the above because it’s the theory, not the book. Bear in mind my limitations: Someone who knows more physics than I do might view things very differently.

The first reason is simply that the theory is extremely mathematical. I can qualitatively explain the reality of time dilation with nothing more complex than a stick to draw some lines in sand, and someone with vastly less mathematics than myself has no trouble grasping it. String theory isn’t like that; it can’t be discussed without going into higher-dimensional geometry and making reference to very abstruse realisations (I don’t know how many times that book used the term Calabi-Yau manifold—if you were curious, Wikipedia informs me that they are sometimes defined as compact Kähler manifolds whose canonical bundle is trivial, though many other similar but inequivalent definitions are sometimes used). Even when the discussion is clear, it’s littered with footnotes to help mathematically inclined readers actually get it. Since I’m not that mathematically inclined, I have a sour taste of ex cathedra in my mouth: I understand much more of what the theory claims than I did before I read the book, but I don’t understand nearly as much of the wherefores as I should like, and memorising facts is not what learning science is about.

Of course, that’s a consequence of my own limitations as much as it is of the theory. There’s no reason why the laws of the universe should be constrained to my comprehension, even if I do like the idea, variously attributed to Feynman and Einstein, that if you can’t explain it to a six year old, you don’t really understand it.

The second objection I have is that I feel, as I have long felt, that string theory is oversold. Oh, it may very well be the theory with the greatest potential to explain reality we have ever known—but we don’t know whether it is. True, it can be made to generate predictions that match what we know from traditional point-particle quantum mechanics, but that’s post hoc and therefore vastly less impressive. Of all the horribly abstruse mathematical theories of physics anyone could possibly think up, it’s obvious that only ones that agree with known facts will be kept around; but that doesn’t tell us whether they are correct in areas where they don’t just tell us what we already know.

String theory is the theology of physics, in a somewhat narrow sense: Like theology, it’s a lofty framework with many grand implications; like theology, we just don’t have any evidence that it’s true. Of course I do not think that they are equally credible. Since most of the world’s most brilliant theoretical physicists seem to feel fairly confident about it, and since they obviously know more about it than I do, on top of being much smarter than I am, and being in a profession where checking your results against nature is the highest goal, it’s probably true. But I am not prepared to go out and say that it is true until it’s generated some honest-to-god falsifiable predictions (pun intended). And from all that I have heard, and all that the book has taught me, I’m still not excited: There are some fairly out-there things that string theory predicts might happen (and if we see them, it’s almost certainly correct), but then again they might (so if we don’t see them, we still can’t discard it). This, again, brings theology to mind.

If any string physicist wanted to impress me, he should come up with a falsifiable prediction—If string theory is correct, we should see this; if we don’t find that result, then string theory is wrong, and of course it would have to be a result we can check by experiment, not just yet another agreement with existing theory. Ideally, we should then perform the experiment, which is why sentences starting with If we could build a particle accelerator the size of our solar system… don’t impress me, either.

I understand why it is referred to as a ‘theory’—it’s too complex and comprehensive a framework to be accurately summarised as a single hypothesis. But I also find it problematic, or at least annoying, in that we usually reserve the term ‘scientific theory’ for frameworks that are supported by falsifiable evidence. Thus far, string theory is not. It’s an extremely impressive edifice, and it may well be a tower that takes us to the stars, but it might yet turn out to be built on sand.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Petter Häggholm’s blog

Thoughts, essays, rants, and things I find interesting

Mar. 27th, 2009

Mar. 27th, 2009

Presented for your amusement

“Why Most Published Research Findings are False”

The Elegant Universe

Profile

Links

June 2025

Page Summary

Petter Häggholm’s blog

Thoughts, essays, rants, and things I find interesting

Mar. 27th, 2009

Mar. 27th, 2009

Presented for your amusement

“Why Most Published Research Findings are False”

The Elegant Universe

Profile

Links

June 2025

Page Summary

Most Popular Tags