haggholm | Entries tagged with statistics

I came across an article the other day (originally found via Pharyngula), Where is Everyone?, which aims to track the trend in how information is found and accessed over the past 200 years. This is a very ambitious, and very interesting project. The article starts with a colourful graph, and proceeds to analyse the implications of the graph:

Information flow

Of course, somebody asked the author (a Thomas Baekdal) what data the chart was based on, and he freely answered:

The graph was based on combination of a lot of things, a number of interviews, general study, general trend movements, my experience etc. I cannot give you a specific source though, because I used none specifically.

The graphs before 1990 are all based on interviews, and a large number of Google searches to learn about the history of Newspaper, TV and Radio - and more specifically, what people uses in the past. The graphs from 1998 and up to today, is based on all the things that have happened in the past 11 years, of which I have probably seen 1000 surveys (it is what I do for a living). And the graph from 2009 and forward is based on what I, and many other people predict will happen in the years to come.

One very important thing though, this is not a reflection of my opinion. This is the result a careful analysis. There are always variations, and different types of people. But I believe that this graph accurately reflects consumer focus.

…Have you ever seen such tripe in your life? Merely using different sources and methods for different time periods would introduce uncertainty to the results, but at least this would be inevitable. But, given this guy’s process, that doesn’t even register as a problem. He cannot give you a specific source though, because [he] used none specifically. He has probably seen 1000 surveys of the trends of the past 11 years, but can’t be bothered to cite even one. The x-axis has an arbitrarily compressed scale, skewing the shape and the speed of the trends. And, most damning of all, he gives no indication of how he measures the ‘information’ metric! (The only truly objective measure of information that I know of is Shannon’s ‘bits of entropy’, which is certainly very concrete—but there’s no indication that this is what’s meant, nor can I think of any way in which a person’s total information input can be objectively measured in these units.) How on Earth can anyone analyse the graph critically without knowing what the numbers measure?

The answer to the last question, of course, is that it’s impossible, except in the sense that I am analysing it critically here: Calling it bunk. By presenting a graph, he gives himself a veneer of scientific responsibility (Look, I have data!), but since the graph doesn’t actually objectively represent anything (so far as the reader can tell), it’s really just a distraction, an attempt to gain enough credibility in the reader’s mind that the purported analysis that follows is swallowed whole.

And he has the gall to claim that

this is not a reflection of my opinion. This is the result a careful analysis.

If he hadn’t pretended this (id est, if he had said up front that this is a mashup of various analyses of a practically unquantifiable commodity, but that he hopes that his argument, once followed through, will persuasively show a genuine trend), I might have given him some respect, but given what he actually did, he is either a fool or a liar. Either option should persuade you not to take him seriously.

To address the article as though it weren’t total bunk, his extrapolation into the future is on shaky grounds for reasons that should be painfully obvious even to someone who does buy into the graph: By extrapolating current trends into the future, he seems to be ignoring the fact that the big new things of recent years—social networks, social news, etc.—came out of nowhere and took internet culture by storm. What the internet does—the most important thing it does—is enable distribution of information to vast numbers of people with virtually no marginal cost. Logistically, I can reach a thousand people as easily as one; a million almost as easily as a thousand; a billion with only a little more difficulty with a million. When someone does come up with the next killer idea—the next Facebook or Twitter or Google, or whatever it may be—it can explode at an incredible rate. On the internet, where no one is limited by broadcast range, print batch size, or radio band constraints, the primary limiting factor is user interest. The Next Great Thing may grow slowly and incrementally, or it may explode geometrically, as fast as server capability can handle (and how fast that is depends on what the Next Great Thing is, which of course we don’t know).

In other words, even if I try to buy into the general idea, I think that his predictions are about as reliable as any ever are in futurology, and if I view the whole thing critically, it’s bunk. Either way, I can’t say I am impressed.

During my coffee break, I read an article in Scientific American Mind called Knowing Your Chances (available online). I think it is an outstanding article, and you should read it. The most evocative part may have been a simple example:

Consider a woman who has just received a positive result from a mammogram and asks her doctor: Do I have breast cancer for sure, or what are the chances that I have the disease? In a 2007 continuing education course for gynecologists, Gigerenzer asked 160 of these practitioners to answer that question given the following information about women in the region:

The probability that a woman has breast cancer (prevalence) is 1 percent.

If a woman has breast cancer, the probability that she tests positive (sensitivity) is 90 percent.

If a woman does not have breast cancer, the probability that she nonetheless tests positive (false-positive rate) is 9 percent.

What is the best answer to the patient’s query?

The probability that she has breast cancer is about 81 percent.
Out of 10 women with a positive mammogram, about nine have breast cancer.
Out of 10 women with a positive mammogram, about one has breast cancer.
The probability that she has breast cancer is about 1 percent.

Before you read on, take a brief moment to think about it, but also note your gut feeling. Done? Let’s continue:

Gynecologists could derive the answer from the statistics above, or they could simply recall what they should have known anyhow. In either case, the best answer is C; only about one out of every 10 women who test positive in screening actually has breast cancer. The other nine are falsely alarmed. Prior to training, most (60 percent) of the gynecologists answered 90 percent or 81 percent, thus grossly overestimating the probability of cancer. Only 21 percent of physicians picked the best answer—one out of 10.

Doctors would more easily be able to derive the correct probabilities if the statistics surrounding the test were presented as natural frequencies. For example:

Ten out of every 1,000 women have breast cancer.

Of these 10 women with breast cancer, nine test positive.

Of the 990 women without cancer, about 89 nonetheless test positive.

Thus, 98 women test positive, but only nine of those actually have the disease. After learning to translate conditional probabilities into natural frequencies, 87 percent of the gynecologists understood that one in 10 is the best answer.

I’m happy to say that I did get it right on the first try, but I strongly agree witht the authors’ opinion that it is not intuitive when the statistics are cited as probabilities rather than natural frequencies. The reason I got it right is because I’ve done a bit of math and a wee bit of stats, I enjoy reading some blogs that talk about medical statistics, I know some of the not-quite-obvious ground rules of probabilities; I know what Type I and Type II errors are (even if I occasionally mix them up)…

…And, perhaps crucially, I’ve spent time thinking about false positives in medical testing before. When I get my periodic routine screenings for STIs (I’ve never had symptoms or tested positive for any, I’m glad to say, but I feel a responsible person should get tested anyway!), I’ve asked myself the hypothetical question What if it did show positive for, say, HIV? What are the odds that I would actually have it? (It turns out that if you’re a heterosexual male, and if you test positive for HIV, there’s about a 50% chance that you don’t have it! You should play it safe, but get re-tested and don’t panic. Some people commit suicide when they get positive test results, even though they’re as likely as not to be healthy.)

Still, while my gut told me the answer was not A (wherein I did better than most of the gynecologists), I had to think about it for a minute to figure out which was the proper answer. People need to be educated on this stuff. Meanwhile, if you haven’t had the benefit of statistical education, keep this one thing in mind: The obvious answer is not always correct, so if you’re unsure, ask someone who can do the maths. And, sadly, even your doctor may not know. I actually find it rather sad that as after learning to translate conditional probabilities into natural frequencies, 87 percent of the gynecologists understood that one in 10 is the best answer, this means that even after simplification, more than 1 in 10 gynecologists didn’t get it. Your doctor can spot the symptoms and order the right tests, but you may need a mathematically inclined friend to actually calculate the risks.

Autism is a pretty mysterious condition. No one really knows what causes it (all we really know for sure, after all this testing, is that whatever else it does, the MMR vaccine definitely doesn’t cause it…), but it’s thought to be part genetic, part environmental. A Swedish study on indoor air pollutants has now suggested that, although the data are very tentative, vinyl flooring may increase the risk of autism!

The researchers found four environmental factors associated with autism: vinyl flooring, the mother's smoking, family economic problems and condensation on windows, which indicates poor ventilation.

Infants or toddlers who lived in bedrooms with vinyl, or PVC, floors were twice as likely to have autism five years later, in 2005, than those with wood or linoleum flooring.

Whether the link is real is, as the researchers very frankly point out, as yet unknown, and only further studies can reveal it. I find this interesting to consider, however, as a case study in how easy it is to get the wrong impression from results like these. There’s a number of interesting traps to fall into.

It’s fairly likely that someone will report on this, or already has, under a headline like Research finds link between vinyl flooring and autism, giving the impression that it’s clear-cut, whereas the single most clear-cut message of this study is that it ain’t so.
Correlation does not imply causation, and even when there’s causation, we have to make sure we get it the right way around. As one commenter to that article pointed out, autistic children tend to be extremely preoccupied with textures. Even if there’s a direct link between vinyl flooring and autism, that doesn’t mean that the former causes the latter. Maybe families with autistic children prefer vinyl flooring because it makes the children happier, and so in a sense, autism might cause vinyl flooring!
Notice that they found four, that’s four environmental factors associated with autism: vinyl flooring, the mother's smoking, family economic problems and condensation on windows. However, these variables were not controlled for, and may not be independent.

What does this mean? Well, it may be that any or all of these variables are connected: Maybe poorer people are more likely to smoke, less likely to afford good ventilation, and less likely to afford nice hardwood floors. If any of these things really does increase the risk of autism, the other variables will be associated with it: If, say, the mother’s smoking causes autism, and more poor mothers than wealthy mothers smoke, then vinyl floors and everything else associated with poor people shows up as associated with autism in the statistics. But while the correlation is there and is real, there is (in my example) no causative relationship at all.

This sort of thing is always a problem with any studies, especially (I believe) when randomisation is poor or sample sizes are small. These are four known and named variables that may reasonably be correlated. What would we have thought of this article if they hadn’t mentioned smoking, wealth, or ventilation? It would have painted a very different picture. And it’s not necessarily dishonesty or editorial brevity that leaves variables out of the equation: Sometimes relevant data just aren’t measured—what if the study hadn’t asked about wealth or smoking?

I’m reminded of the very poorly thought-out article I read a little while back that claimed that light pollution at night from all the street lights and so forth lead to—I don’t recall: Some health problem or other. However, light pollution goes with industrialisation, and the number of variables you introduce when you compare a more industrial to a more agricultural country is ridiculously large. The article made no mention of those at all, but spoke as though there had to be a direct causal link from light pollution to the health issue at hand (which is why I consider it such a poor article).
The study was not designed to look for these data, which means that we must suspect data mining. Data mining refers to digging through a set of data looking for any relationships, whether the ones originally examined or not. The problem is that some relationship will always be found.

Suppose, for instance, that a study is in some global sense 99% reliable. What does this mean? It means that we set out to discover whether X causes Y, and if the study says yes, we can be 99% certain that we’re right. On the flip side, there’s a 1% chance that we’re wrong. Now suppose that, since we have all these statistics anyway, we decide to check of X causes Z, or A causes B…and so on. For every single one of these, we may (very generously) be 99% certain that it’s correct, but if we look for 100 different relationships, we know that we’ve probably got at least one wrong!

In fact, we’re 73% likely to have got at least one wrong, and that’s with a 99% confidence level and the very generous assumption that the data are as reliable in unknown areas. In reality, I expect that will often not be the case: Even if I design my study to control for a lot of variables surrounding the hypothesis I set out to explore, I can’t possibly do the same for a bunch of hypotheses someone constructs from my data after the fact.

This is why data mining is frowned upon in scientific studies. We can look at data like that and find correlations that intrigue us, and use those correlations to inspire new studies—just as this Swedish study means that it might not be a bad idea to look at possible connections between vinyl flooring (and phthalates) and autism…but we shouldn’t be fooled into thinking that they necessarily mean anything, because we know that if we look hard enough at any set of statistics, we will be able to find some spurious connections.

“Orac” over at Respectful Insolence has a writing style that’s fairly prone to offend—definitely pugnacious, and very fond of side swipes at those he dislikes (primarily alternative medicine quacks)—and I don’t blame him for his distaste, which in fact I share, but it does sometimes make his essays a bit harder to slog through. (He also has an inordinate fondness for beginning sentences with Indeed. This is one area where I can tentatively claim superiority: I can also be pugnacious and come off as offensive, but while I am no less prone than Orac to complicated sentence structure, I’ve never been accused of any such repetitive verbal tic.)

However, those foibles aside, he has written some very good stuff (he’s on my list of blogs I ready daily for a reason), and this article, summarising and explaining the work of a John Ioannidis, was very interesting indeed. The claim it looks at is a very interesting and puzzling one: Given a set of published clinical studies reporting positive outcomes, all with a confidence interval of 95%, we should expect more than 5% to give wrong results; and, furthermore, studies of phenomena with low prior probability are more likely to give false positives than studies where the prior probabilities are high. He has often cited this result as a reason why we should be even more skeptical of trials of quackery like homeopathy than the confidence intervals and study powers suggest, but I have to confess I never quite understood it.

I would suggest that you go read the article (or this take, referenced therein), but at the risk of being silly in summarising what is essentially a summary to begin with…here’s the issue, along with some prefatory matter for the non-statisticians:

A Type I error is a false positive: We seem to see an effect where there is no effect, simply due to random chance. This sort of thing does happen. Knowing how dice work, I may hypothesise that if you throw a pair of dice, you are not likely to throw two sixes, but one time out of every 36 (¹/₆×¹/₆), you will. I can confidently predict that you won’t roll double sixes twice in a row, but about one time in 1,296, you will. Any time we perform any experiment, we may get this sort of effect, so a statistical test, such as a medical trial, has a confidence level, where a confidence level of 95% means there’s a 5% chance of a Type I error.

There’s also a Type II error, or false negative, where the hypothesis is true but the results just aren’t borne out on this occasion. To the best of my knowledge, there is no equivalent of the confidence level for Type II errors.

This latter observation is a bit problematic, and leads into what Ioannidis observed:

Suppose there are 1000 possible hypotheses to be tested. There are an infinite number of false hypotheses about the world and only a finite number of true hypotheses so we should expect that most hypotheses are false. Let us assume that of every 1000 hypotheses 200 are true and 800 false.

It is inevitable in a statistical study that some false hypotheses are accepted as true. In fact, standard statistical practice [i.e. using a confidence level of 95%] guarantees that at least 5% of false hypotheses are accepted as true. Thus, out of the 800 false hypotheses 40 will be accepted as "true," i.e. statistically significant.

It is also inevitable in a statistical study that we will fail to accept some true hypotheses (Yes, I do know that a proper statistician would say "fail to reject the null when the null is in fact false," but that is ugly). It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.

Did you see that magic? Our confidence interval was 95%, no statistics were abused, no mistakes were made (beyond the ones falling into that 5% gap, which we accounted for), and yet we were only 75% correct.

The root of the problem is, of course, the ubiquitous problem of publication bias: Researchers like to publish, and people like to read so journals like to print, positive outcome studies rather than negative ones, because a journal detailing a long list of ideas that turned out to be wrong isn’t very exciting. The problem is, obviously, that published studies are therefore biased in favour of positive outcomes. (If not, all 800 studies of false hypotheses would have been published and the problem would disappear.)

Definition time again: A prior probability is essentially a plausibility measure before we run an experiment. Plausibility sounds very vague and subjective, but can be pretty concrete. If I know that it rains on (say) 50% of all winter days in Vancouver, I can get up in the morning and assign a prior probability of 50% to the hypothesis that it’s raining. (I can then run experiments, e.g. by looking out a window, and modify my assessment based on new evidence to come up with a posterior probability.)

Now we can go on to look at why Orac is so fond of holding hypotheses with low prior probabilities to higher standards. It’s pretty simple, really: Recall that the reason why we ended up with so many false positives above—the reason why false positives were such a large proportion of the published results—is because there were more false hypotheses than true hypotheses. The more conservative we are in generating hypotheses, the less outrageous we make them, the more likely we are to be correct, and the fewer false hypotheses we will have (in relation to true hypotheses). Put slightly differently, we’re more likely to be right in medical diagnoses if we go by current evidence and practice than if we make wild guesses.

Now we see that modalities with very low prior probability, such as ones with no plausible mechanism, should be regarded as more suspect. Recall that above, we started out with 800 false hypotheses (out of 1000 total hypotheses), ended up accepting 5% = 40 of them, and that

It's hard to say what the probability is of not finding evidence for a true hypothesis because it depends on a variety of factors such as the sample size but let's say that of every 200 true hypotheses we will correctly identify 120 or 60%. Putting this together we find that of every 160 (120+40) hypotheses for which there is statistically significant evidence only 120 will in fact be true or a rate of 75% true.

That is, the proportion of true hypotheses to false hypotheses affects the accuracy of our answer. This is very easy to see—let’s suppose that only half of the hypotheses were false; now we accept 5% of 500, that is 25 false studies, and keeping the same proportions,

…Let's say that of every ~~200~~ 500 true hypotheses we will correctly identify ~~120~~ 300 or 60%. Putting this together we find that of every ~~160 (120+40)~~ 325 (300+25) hypotheses for which there is statistically significant evidence only ~~120~~ 300 will in fact be true or a rate of ~~75%~~ 92% true.

We’re still short of that 95% measure, but we’re way better than the original 75%, simply by making more plausible guesses (within each study, we were still equally likely to make either Type I or Type II errors). The less plausible an idea is, the higher the proportion of false hypotheses will be out of all the hypotheses the idea generates: A true/false ratio. Wild or vague ideas (homeopathy, reiki, …) are very likely to generate false hypotheses along with any true ones they might conceivably generate. More conventional ideas will tend to generate a higher proportion of true hypotheses—if we know from long experience that Aspirin relieves pain, it’s very likely that a similar drug does likewise.

This is not to say that no wild ideas are ever right. Of course they sometimes are (though of course they usually aren’t). What it does mean is that not only should we be skeptical and demand evidence for them, there are sound statistical reasons to set the bar of evidence even higher for implausible than for plausible modalities.

It is also a good argument for the move away from strict EBM (evidence-based medicine) to SBM (science-based medicine) where things like prior probability are taken into account. Accepting 95% double-blind trials at face value isn’t good enough.

Profile

Petter Häggholm

petterhaggholm.net

June 2025

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Page Summary

Information market misinformation marketing
Statistics and medicine
A correlation out of left field
“Why Most Published Research Findings are False”

Petter Häggholm’s blog

Thoughts, essays, rants, and things I find interesting

Entries tagged with statistics

Information market misinformation marketing

Statistics and medicine

A correlation out of left field

“Why Most Published Research Findings are False”

Profile

Links

June 2025

Page Summary

Syndicate