haggholm: (Default)

During my coffee break, I read an article in Scientific American Mind called Knowing Your Chances (available online). I think it is an outstanding article, and you should read it. The most evocative part may have been a simple example:

Consider a woman who has just received a positive result from a mammogram and asks her doctor: Do I have breast cancer for sure, or what are the chances that I have the disease? In a 2007 continuing education course for gynecologists, Gigerenzer asked 160 of these practitioners to answer that question given the following information about women in the region:

  • The probability that a woman has breast cancer (prevalence) is 1 percent.
  • If a woman has breast cancer, the probability that she tests positive (sensitivity) is 90 percent.
  • If a woman does not have breast cancer, the probability that she nonetheless tests positive (false-positive rate) is 9 percent.

What is the best answer to the patient’s query?

  1. The probability that she has breast cancer is about 81 percent.
  2. Out of 10 women with a positive mammogram, about nine have breast cancer.
  3. Out of 10 women with a positive mammogram, about one has breast cancer.
  4. The probability that she has breast cancer is about 1 percent.

Before you read on, take a brief moment to think about it, but also note your gut feeling. Done? Let’s continue:

Gynecologists could derive the answer from the statistics above, or they could simply recall what they should have known anyhow. In either case, the best answer is C; only about one out of every 10 women who test positive in screening actually has breast cancer. The other nine are falsely alarmed. Prior to training, most (60 percent) of the gynecologists answered 90 percent or 81 percent, thus grossly overestimating the probability of cancer. Only 21 percent of physicians picked the best answer—one out of 10.

Doctors would more easily be able to derive the correct probabilities if the statistics surrounding the test were presented as natural frequencies. For example:

  • Ten out of every 1,000 women have breast ­cancer.
  • Of these 10 women with breast cancer, nine test positive.
  • Of the 990 women without cancer, about 89 nonetheless test positive.

Thus, 98 women test positive, but only nine of those actually have the disease. After learning to translate conditional probabilities into natural frequencies, 87 percent of the gynecologists understood that one in 10 is the best answer.

I’m happy to say that I did get it right on the first try, but I strongly agree witht the authors’ opinion that it is not intuitive when the statistics are cited as probabilities rather than natural frequencies. The reason I got it right is because I’ve done a bit of math and a wee bit of stats, I enjoy reading some blogs that talk about medical statistics, I know some of the not-quite-obvious ground rules of probabilities; I know what Type I and Type II errors are (even if I occasionally mix them up)…

…And, perhaps crucially, I’ve spent time thinking about false positives in medical testing before. When I get my periodic routine screenings for STIs (I’ve never had symptoms or tested positive for any, I’m glad to say, but I feel a responsible person should get tested anyway!), I’ve asked myself the hypothetical question What if it did show positive for, say, HIV? What are the odds that I would actually have it? (It turns out that if you’re a heterosexual male, and if you test positive for HIV, there’s about a 50% chance that you don’t have it! You should play it safe, but get re-tested and don’t panic. Some people commit suicide when they get positive test results, even though they’re as likely as not to be healthy.)

Still, while my gut told me the answer was not A (wherein I did better than most of the gynecologists), I had to think about it for a minute to figure out which was the proper answer. People need to be educated on this stuff. Meanwhile, if you haven’t had the benefit of statistical education, keep this one thing in mind: The obvious answer is not always correct, so if you’re unsure, ask someone who can do the maths. And, sadly, even your doctor may not know. I actually find it rather sad that as after learning to translate conditional probabilities into natural frequencies, 87 percent of the gynecologists understood that one in 10 is the best answer, this means that even after simplification, more than 1 in 10 gynecologists didn’t get it. Your doctor can spot the symptoms and order the right tests, but you may need a mathematically inclined friend to actually calculate the risks.

haggholm: (Default)

“Orac” over at Respectful Insolence has a writing style that’s fairly prone to offend—definitely pugnacious, and very fond of side swipes at those he dislikes (primarily alternative medicine quacks)—and I don’t blame him for his distaste, which in fact I share, but it does sometimes make his essays a bit harder to slog through. (He also has an inordinate fondness for beginning sentences with Indeed. This is one area where I can tentatively claim superiority: I can also be pugnacious and come off as offensive, but while I am no less prone than Orac to complicated sentence structure, I’ve never been accused of any such repetitive verbal tic.)

However, those foibles aside, he has written some very good stuff (he’s on my list of blogs I ready daily for a reason), and this article, summarising and explaining the work of a John Ioannidis, was very interesting indeed. The claim it looks at is a very interesting and puzzling one: Given a set of published clinical studies reporting positive outcomes, all with a confidence interval of 95%, we should expect more than 5% to give wrong results; and, furthermore, studies of phenomena with low prior probability are more likely to give false positives than studies where the prior probabilities are high. He has often cited this result as a reason why we should be even more skeptical of trials of quackery like homeopathy than the confidence intervals and study powers suggest, but I have to confess I never quite understood it.

I would suggest that you go read the article (or this take, referenced therein), but at the risk of being silly in summarising what is essentially a summary to begin with…here’s the issue, along with some prefatory matter for the non-statisticians:

A Type I error is a false positive: We seem to see an effect where there is no effect, simply due to random chance. This sort of thing does happen. Knowing how dice work, I may hypothesise that if you throw a pair of dice, you are not likely to throw two sixes, but one time out of every 36 (¹/₆×¹/₆), you will. I can confidently predict that you won’t roll double sixes twice in a row, but about one time in 1,296, you will. Any time we perform any experiment, we may get this sort of effect, so a statistical test, such as a medical trial, has a confidence level, where a confidence level of 95% means there’s a 5% chance of a Type I error.

There’s also a Type II error, or false negative, where the hypothesis is true but the results just aren’t borne out on this occasion. To the best of my knowledge, there is no equivalent of the confidence level for Type II errors.

This latter observation is a bit problematic, and leads into what Ioannidis observed:

Suppose there are 1000 possible hypotheses to be tested. There are an infinite number of false hypotheses about the world and only a finite number of true hypotheses so we should expect that most hypotheses are false. Let us assume that of every 1000 hypotheses 200 are true and 800 false.

It is inevitable in a statistical study that some false hypotheses are accepted as true. In fact, standard statistical practice [i.e. using a confidence level of 95%] guarantees that at least 5% of false hypotheses are accepted as true. Thus, out of the 800 false hypotheses 40 will be accepted as "true," i.e. statistically significant.

It is also inevitable in a statistical study that we will fail to accept some true hypotheses (Yes, I do know that a proper statistician would say "fail to reject the null when the null is in fact false," but that is ugly). It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.

Did you see that magic? Our confidence interval was 95%, no statistics were abused, no mistakes were made (beyond the ones falling into that 5% gap, which we accounted for), and yet we were only 75% correct.

The root of the problem is, of course, the ubiquitous problem of publication bias: Researchers like to publish, and people like to read so journals like to print, positive outcome studies rather than negative ones, because a journal detailing a long list of ideas that turned out to be wrong isn’t very exciting. The problem is, obviously, that published studies are therefore biased in favour of positive outcomes. (If not, all 800 studies of false hypotheses would have been published and the problem would disappear.)

Definition time again: A prior probability is essentially a plausibility measure before we run an experiment. Plausibility sounds very vague and subjective, but can be pretty concrete. If I know that it rains on (say) 50% of all winter days in Vancouver, I can get up in the morning and assign a prior probability of 50% to the hypothesis that it’s raining. (I can then run experiments, e.g. by looking out a window, and modify my assessment based on new evidence to come up with a posterior probability.)

Now we can go on to look at why Orac is so fond of holding hypotheses with low prior probabilities to higher standards. It’s pretty simple, really: Recall that the reason why we ended up with so many false positives above—the reason why false positives were such a large proportion of the published results—is because there were more false hypotheses than true hypotheses. The more conservative we are in generating hypotheses, the less outrageous we make them, the more likely we are to be correct, and the fewer false hypotheses we will have (in relation to true hypotheses). Put slightly differently, we’re more likely to be right in medical diagnoses if we go by current evidence and practice than if we make wild guesses.

Now we see that modalities with very low prior probability, such as ones with no plausible mechanism, should be regarded as more suspect. Recall that above, we started out with 800 false hypotheses (out of 1000 total hypotheses), ended up accepting 5% = 40 of them, and that

It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.

That is, the proportion of true hypotheses to false hypotheses affects the accuracy of our answer. This is very easy to see—let’s suppose that only half of the hypotheses were false; now we accept 5% of 500, that is 25 false studies, and keeping the same proportions,

…Let's say that of every 200 500 true hypotheses we will correctly identify 120 300 or 60%. Putting this together we find that of every 160 (120+40) 325 (300+25) hypotheses for which there is statistically significant evidence only 120 300 will in fact be true or a rate of 75% 92% true.

We’re still short of that 95% measure, but we’re way better than the original 75%, simply by making more plausible guesses (within each study, we were still equally likely to make either Type I or Type II errors). The less plausible an idea is, the higher the proportion of false hypotheses will be out of all the hypotheses the idea generates: A true/false ratio. Wild or vague ideas (homeopathy, reiki, …) are very likely to generate false hypotheses along with any true ones they might conceivably generate. More conventional ideas will tend to generate a higher proportion of true hypotheses—if we know from long experience that Aspirin relieves pain, it’s very likely that a similar drug does likewise.

This is not to say that no wild ideas are ever right. Of course they sometimes are (though of course they usually aren’t). What it does mean is that not only should we be skeptical and demand evidence for them, there are sound statistical reasons to set the bar of evidence even higher for implausible than for plausible modalities.

It is also a good argument for the move away from strict EBM (evidence-based medicine) to SBM (science-based medicine) where things like prior probability are taken into account. Accepting 95% double-blind trials at face value isn’t good enough.

Profile

haggholm: (Default)
Petter Häggholm

April 2016

S M T W T F S
      12
345 6789
10111213141516
17181920212223
24252627282930

Syndicate

RSS Atom

Most Popular Tags