Occasional reflections on Life, the World, and Mathematics

Posts tagged ‘statistics’

Medical hype and under-hype

New heart treatment is biggest breakthrough since statins, scientists say

I just came across this breathless headline published in the Guardian from last year. On the one hand, this is just one study, the effect was barely statistically significant, and experience suggests a fairly high likelihood that this will ultimately have no effect on general medical practice or on human health and mortality rates. I understand the exigencies of the daily newspaper publishing model, but it’s problematic that the “new research study” has been defined as the event on which to hang a headline. The only people who need that level of up-to-the-minute detail are those professionally involved in winnowing out the new ideas and turning them into clinical practice. We would all be better served if newspapers instead reported on what new treatments have actually had an effect over the last five years. That would be just as novel to the general readership, and far less erratic.

On the other hand, I want to comment on one point of what I see as exaggerated skepticism: The paragraph that summarises the study results says

For patients who received the canakinumab injections the team reported a 15% reduction in the risk of a cardiovascular event, including fatal and non-fatal heart attacks and strokes. Also, the need for expensive interventional procedures, such as bypass surgery and inserting stents, was cut by more than 30%. There was no overall difference in death rates between patients on canakinumab and those given placebo injections, and the drug did not change cholesterol levels.

There is then a quote:

Prof Martin Bennett, a cardiologist from Cambridge who was not involved in the study, said the trial results were an important advance in understanding why heart attacks happen. But, he said, he had concerns about the side effects, the high cost of the drug and the fact that death rates were not better in those given the drug.

In principle, I think this is a good thing. There are far too many studies that show a treatment scraping out a barely significant reduction in mortality due to one cause, which is highlighted, but a countervailing mortality increase due to other causes, netting out to essentially no improvement. Then you have to say, we really should be aiming to reduce mortality, not to reduce a cause of mortality. (I remember many years ago, a few years after the US started raising the age for purchasing alcohol to 21, reading of a study that was heralded as showing the success of this approach, having found that the number of traffic fatalities attributed to alcohol had decreased substantially. Unfortunately, the number of fatalities not attributed to alcohol had increased by a similar amount, suggesting that some amount of recategorisation was going on.) Sometimes researchers will try to distract attention from a null result for mortality by pointing to a secondary endpoint — improved results on a blood test linked to mortality, for instance — which needs to be viewed with some suspicion.

In this case, though, I think the skepticism is unwarranted. There is no doubt that before the study the researchers would have predicted reduction in mortality from cardiovascular causes, no reduction due to any other cause, and likely an increase due to infection. The worry would be that the increase due to infection — or to some unanticipated side effect — would outweigh the benefits.

The results confirmed the best-case predictions. Cardiovascular mortality was reduced — possibly a lot, possibly only slightly. Deaths due to infections increased significantly in percentage terms, but the numbers were small relative to the cardiovascular improvements. The one big surprise was a very substantial reduction in cancer mortality. The researchers are open about not having predicted this, and not having a clear explanation. In such a case, it would be wrong to put much weight on the statistical “significance”, because it is impossible to quantify the class of hypotheses that are implicitly being ignored. The proper thing is to highlight this observation for further research, as they have properly done.

When you deduct these three groups of causes — cardiovascular, infections, cancer — you are left with approximately equal mortality rates in the placebo and treatment groups, as expected. So there is no reason to be “concerned” that overall mortality was not improved in those receiving the drug. First of all, overall mortality was better in the treatment group. It’s just that the improvement in CV mortality — as predicted — while large enough to be clearly not random when compared with the overall number of CV deaths, it was not large compared with the much larger total number of deaths. This is no more “concerning” than it would be, when reviewing a programme for improving airline safety, to discover that it did not appreciably change the total number of transportation-related fatalities.

The Silver Standard 4: Reconsideration

After writing in praise of the honesty and accuracy of fivethirtyeight’s results, I felt uncomfortable about the asymmetry in the way I’d treated Democrats and Republicans in the evaluation. In the plots I made, low-probability Democratic predictions that went wrong pop out on the left-hand side, whereas low-probability Republican predictions  that went wrong would get buried in the smooth glide down to zero on the right-hand side. So I decided, what I’m really interested in are all low-probability predictions, and I should treat them symmetrically.

For each district there is a predicted loser (PL), with probability smaller than 1/2. In about one third of the districts the PL was assigned a probability of 0. The expected number of PLs (EPL) who would win is simply the sum of all the predicted win probabilities that are smaller than 1/2. (Where multiple candidates from the same party are in the race, I’ve combined them.) The 538 EPL was 21.85. The actual number of winning PLs was 13.

What I am testing is whether 538 made enough wrong predictions. This is the opposite of the usual evaluation, which gives points for getting predictions right. But when measured by their own predictions, the number of districts that went the opposite of the way they said was a lot lower than they said it would be. That is prima facie evidence that the PL win probabilities were being padded somewhat. To be more precise, under the 538 model the number of winning PLs should be approximately Poisson distributed with parameter 21.85, meaning that the probability of only 13 PLs winning is 0.030. Which is kind of low, but still pretty impressive, given all the complications of the prediction game.

Below I show plots of the errors for various scenarios, measuring the cumulative error for these symmetric low predictions. (I’ve added an “Extra Tarnished” scenario, with the transformation based on the even more extreme beta(.25,.25).) I show it first without adjusting for the total number of predicted winning PLs:

image

We see that tarnished predictions predict a lot more PL victories than we actually see. The actual predictions are just slightly more than you should expect, but suspiciously one-sided — that is, all in the direction of over predicting PL victories, consistent with padding the margins slightly, erring in the direction of claiming uncertainty.

And here is an image more like the ones I had before, where all the predictions are normalised to correspond to the same number of predicted wins:

TarnishedSymmetric

 

The Silver Standard, Part 3: The Reckoning

One of the accusations most commonly levelled against Nate Silver and his enterprise is that probabilistic predictions are unfalsifiable. “He never said the Democrats would win the House. He only said there was an 85% chance. So if they don’t win, he has an out.” This is true only if we focus on the top-level prediction, and ignore all the smaller predictions that went into it. (Except in the trivial sense that you can’t say it’s impossible that a fair coin just happened to come up heads 20 times in a row.)

So, since Silver can be tested, I thought I should see how 538’s predictions stood up in the 2018 US House election. I took their predictions of the probability of victory for a Democratic candidate in all 435 congressional districts (I used their “Deluxe” prediction) from the morning of 6 November. (I should perhaps note here that one third of the districts had estimates of 0 (31 districts) or 1 (113 districts), so a victory for the wrong candidate in any one of these districts would have been a black mark for the model.) I ordered the districts by the predicted probability, to compute the cumulative predicted number of seats, starting from the smallest. I plot them against the cumulative actual number of seats won, taking the current leader for the winner in the 11 districts where there is no definite decision yet.

Silver_PredictedvsActual

The predicted number of seats won by Democrats was 231.4, impressively close to the actual 231 won. But that’s not the standard we are judging them by, and in this plot (and the ones to follow) I have normalised the predicted and observed totals to be the same. I’m looking at the cumulative fractions of a seat contributed by each district. If the predicted probabilities are accurate, we would expect the plot (in green) to lie very close to the line with slope 1 (dashed red). It certainly does look close, but the scale doesn’t make it easy to see the differences. So here is the plot of the prediction error, the difference between the red dashed line and the green curve, against the cumulative prediction:

Silver_PredictedvsError

There certainly seems to have been some overestimation of Democratic chances at the low end, leading to a maximum cumulative overprediction of about 6 (which comes at district 155, that is, the 155th most Republican district). It’s not obvious whether these differences are worse than you would expect. So in the next plot we make two comparisons. The red curve replaces the true outcomes with simulated outcomes, where we assume the 538 probabilities are exactly right. This is the best case scenario. (We only plot it out to 100 cumulative seats, because the action is all at the low end. The last 150 districts have essentially no randomness. The red curve and the green curve look very similar (except for the direction; the direction of the error is random). The most extreme error in the simulated election result is a bit more than 5.

What would the curve look like if Silver had cheated, by trying to make his predictions all look less certain, to give himself an out when they go wrong? We imagine an alternative psephologist, call him Nate Tarnished, who has access to the exact true probabilities for Democrats to win each district, but who hedges his bets by reporting a probability closer to 1/2. (As an example, we take the cumulative beta(1/2,1/2) distribution function. this leaves 0, 1/2, and 1 unchanged, but .001 would get pushed up to .02, .05 is pushed up to .14, and .2 becomes .3. Similarly, .999 becomes .98 and .8 drops to .7. Not huge changes, but enough to create more wiggle room after the fact.

In this case, we would expect to accumulate much more excess cumulative predicted probability on the left side. And this is exactly what we illustrate with the blue curve, where the error repeatedly rises nearly to 10, before slowly declining to 0.

SilverTornished

I’d say the performance of the 538 models in this election was impressive. A better test would be to look at the predicted vote shares in all 435 districts. This would require that I manually enter all of the results, since they don’t seem to be available to download. Perhaps I’ll do that some day.

Obesity and cancer

The Guardian has prominently posted a report by Cancer Research UK with a frightening headline:

Obesity to eclipse smoking as biggest cause of cancer in UK women by 2043

That’s pretty sensational. I was intrigued, because the mortality effects of obesity have long intrigued me. It seems like I’ve been hearing claims for decades, loudly trumpeted in the press, that obesity is turning into a health crisis, with the mortality crisis just around the corner. It seems plausible, and yet every time I try to dig into one of these reports, to find out what the estimates are based on, I come up empty. Looking at the data naively, it seems that the shift from BMI 20 to BMI 25 — the threshold of official “overweight” designation — has been associated in the past with a reduction in all-cause mortality. Passing through overweight to “obesity” at BMI 30 raises mortality rates only very slightly. Major increases in mortality seem to be associated with BMI over 35 or 40, but even under current projections those levels remain rare in nearly all populations.

There is a chain of reasoning that goes from obesity to morbid symptoms like high blood pressure and diabetes, to mortality, but this is fairly indirect, and ignores the rapid improvement in treatments for these secondary symptoms, as well as the clear historical association between increasing childhood nutrition and improved longevity. Concerned experts often attribute the reduction in mortality at low levels of “overweight” to errors in study design — such as confusing weight loss due to illness with healthy low weight — which has indeed been a problem and negative health consequences attributable to weight-loss diets tend to be ignored. All in all, it has always seemed to be a murky question, leaving me genuinely puzzled by the quantitative certainty with which catastrophe is predicted. Clearly increasing obesity isn’t helping people’s health — the associated morbidity is a real thing, even if it isn’t shortening people’s lives much — but I’m perplexed by the quantitative claims about mortality.

So, I thought, if obesity is causing cancer, as much as tobacco is, that’s a pretty convincing piece of the mortality story. And then I followed up the citations, and the sand ran through my fingers. Here are some problems:

  1. Just to begin with, the convergence of cancers attributable to smoking with cancers attributable to obesity is almost entirely attributable to the reduction in smoking. “By 2043 smoking may have been reduced to the point that it is no longer the leading cause of cancer in women” seems like a less alarming possible headline. Here’s the plot from the CRUK report:
    Screenshot 2018-09-24 11.48.43
  2. The report entirely conflates the categories “overweight” and “obese”. The formula they cite refers to different levels of exposure, so it is likely they have separated them out in their calculations, but it is not made clear.
  3. The relative risk numbers seem to derive primarily from this paper. There we see a lot of other causes of cancer, such as occupation, alcohol consumption, and exposure to UV radiation, all of which are of similar magnitude to weight. Occupational exposure is about as significant for men as obesity, and more amenable to political control, but is ignored in this report. Again, the real story is that the number of cancers attributable to smoking may be expected to decline over the next quarter century, to something more like the number caused by multiple existing moderate causes.
  4.  Breast cancer makes up a huge part of women’s cancer risk, hence a huge part of the additional risk attributed to overweight, hence presumably makes up the main explanation for why women’s additional risk due to overweight is so much higher than men’s. The study seems to estimate the additional breast cancer risk due to smoking at 0. This seems implausible. No papers are cited on breast cancer risk and smoking, possibly because of the focus on British statistics, but here is a very recent study finding a very substantial increase. And here is a meta-analysis.
  5. The two most common cancers attributable to obesity in women — cancer of the breast and uterus — are among the most survivable, with ten-year survival above 75%. (Survival rates here.) The next two on the list would be bowel and bladder cancer, with ten-year survival above 50%. The cancer caused by smoking, on the other hand, is primarily lung cancer, with ten-year survival around 7%, followed by oesophageal (13%), pancreatic (1%), bowel and bladder. Combining all of these different neoplasms into a risk of “cancer”, and then comparing the risk due to obesity with that due to smoking, is deeply misleading.

UPDATE: My letter to the editor appeared in The Guardian.

Fraud detection and statistics

Elizabeth Holmes, founder of Theranos, has now been formally indicted for criminal fraud. I’ve commented on the company before, and on the journalistic conventions around intellectuals that fostered her rise. But now that the Theranos story is coming to an end, I feel a need to comment on how utterly unnecessary this all was.

At its peak, Theranos was valued at $9 billion and employed 800 people. Yet according to John Carreyrou, the Wall Street Journal reporter whose investigations exposed Theranos’s fraud, the company is down to just 20 employees who are trying to close up shop.

All credit to Carreyrou, who by all accounts has done an excellent job investigating and reporting on this fiasco, but literally any statistician — anyone who has been through and understood a first-year statistics course — could have said from the start that this was sheer nonsense. That’s presumably why the board was made up mainly of politicians and generals.

The promise of Theranos was that they were going to revolutionise medicine by performing a hundred random medical tests on a drop of blood, and give patients a complete readout of their state of health, independent of medical recommendation of specific tests. But any statistician knows — and every medical practitioner should know — that the reason we don’t do lots of random tests without any specific indication isn’t that they’re too expensive — many aren’t — or that they require too much blood, but that the more tests you do, the more false positives you’re going to accumulate.

If you do a hundred tests on an average person, you’re going to find at least a few questionable results — either from measurement error, or because most tests aren’t all that specific — requiring followups and expensive investigations, and possibly unnecessary treatments.

Of course, if I had to evaluate the proposal for such a company I would keep an open mind about the possibility of a conceptual breakthrough that would allow them to control the false positives. But I would have demanded very clear evidence and explanations. The fact that the fawning news reports back in 2013-15 raved about the genius new biomedical technology, and failed to even claim to have produced (or found) any innovative statistical methodology, made me pretty sure that they had no idea what they were doing. In the end, it turned out that the biomedical innovations were also fake, which I probably should have guessed. But if the greedhead generals — among them the current secretary of defense, who definitely should be questioned about this, and probably ought to resign — had asked a statistician, they could have saved a lot of people a lot of unpleasantness, and maybe helped save Elizabeth Holmes from herself.

Statistics, politics, and causal theories

A new headline from the Trump era:

Fewer Immigrants Are Reporting Domestic Abuse. Police Blame Fear of Deportation.

Compare it to this headline from a few months ago:

Arrests along Mexico border drop sharply under Trump, new statistics show

This latter article goes on to comment

The figures show a sharp drop in apprehensions immediately after President Trump’s election win, possibly reflecting the deterrent effect of his rhetoric on would-be border crossers.

It must be noted that these two interpretations of declining enforcement are diametrically opposed: In the first case, declining reports to police are taken as evidence of nothing other than declining reports, whereas the latter analysis eschews such a naive interpretation, suggesting that the decline in apprehensions is actually evidence of a decline in the number of offenses (in this case, illegal border crossings).

I don’t mean to criticise the conventional wisdom, which seems to me eminently sensible. I just think it’s interesting how little the statistical “facts” are able to speak for themselves. The same facts could mean that the election of Trump was associated with a decline in domestic violence in immigrant communities, and also with a reduction in border patrol effectiveness. It’s hard to come up with a causal argument for either of these — Did immigrant men look at Trump with revulsion and decide, abusing women is for the gringos? Did ICE get so caught up with the fun of splitting up families in midwestern towns and harassing Spanish speakers in Montana, that they stopped paying attention to the southern border? — so we default to the opposite conclusion.

Natural frequencies and individual propensities

I’ve just been reading Gerd Gigerenzer’s book Reckoning with Risk, about risk communication, mainly a plaidoyer for the use of “natural frequencies” in place of probabilities: Statements in the form “In how many cases out of 100 similar cases of X would you expect Y to happen”. He cites one study forensic psychiatry experts who were presented with a case study, and asked to estimate the likelihood of the individual being violent in the next six months. Half the subjects were asked “What is the probability that this person will commit a violent act in the next six months?” The other half were asked “How many out of 100 women like this patient would commit a violent act in the next six months?” Looking at these questions, it was obvious to me that the latter question would elicit lower estimates. Which is indeed what happened: The average response to the first question was about 0.3; the average response to the second was about 20.

What surprised me was that Gigerenzer seemed perplexed by this consistent difference in one direction (though, obviously, not by the fact that the experts were confused by the probability statement). He suggested that those answering the first question were thinking about the same patient being released multiple times, which didn’t make much sense to me.

What I think is that the experts were thinking of the individual probability as a hidden fact, not a statistical statement. Asked to estimate this unknown probability it seems natural that they would be cautious: thinking it’s somewhere between 10 and 30 percent they would not want to underestimate this individual’s probability, and so would conservatively state the upper end. This is perfectly consistent with them thinking that, averaged over 100 cases they could confidently state that about 20 would commit a violent act.

Male nurses and politically incorrect comments on gender

I was just reading this article by journalist Conor Friedersdorf, complaining about how Canadian psychologist Jordan Peterson is being unfairly treated by journalists, who try to twist his subtle anti-feminist arguments into crude anti-feminist slurs. He certainly has a point. But then one comes to comments like this

[Interviewer]: Is gender equality desirable?

Peterson: If it means equality of outcome then it is almost certainly undesirable. That’s already been demonstrated in Scandinavia. Men and women won’t sort themselves into the same categories if you leave them to do it of their own accord. It’s 20 to 1 female nurses to male, something like that. And approximately the same male engineers to female engineers. That’s a consequence of the free choice of men and women in the societies that have gone farther than any other societies to make gender equality the purpose of the law. Those are ineradicable differences––you can eradicate them with tremendous social pressure, and tyranny, but if you leave men and women to make their own choices you will not get equal outcomes.

20 to 1? That seems really high. For nurses and for engineers. So I decided to do something rude, and check the numbers. For nurses, I found these statistics. There’s a lot of variation in Scandinavia. In Denmark it seems like about 20:1 female to male. But in Norway it’s 9:1. In Iceland it’s 100:1. Looking further afield, in Israel and Italy 20% of nurses are male. And in the Netherlands nearly 25%. This does not look like an ineradicable difference to me. It looks like path dependence and social context.

What about engineers? Here Peterson is, to use the technical term, talking out of his ass. There is no country in the EU with such an extreme gender imbalance for engineers: The most extreme is the UK, with about a 10:1 male to female ratio. In Sweden it’s 3:1, in Norway 4:1, and in Denmark 5:1. In Latvia the fraction of female engineers is up to 30%.

I think, if you want to make provocative “I’m just trying to be rational here” public arguments, you kind of have an obligation not to make up your supporting facts.

Why people hate statisticians

Andrew Dilnot, former head of the UK Statistics Authority and current warden (no really!) of Nuffield College, gave a talk here last week, at our annual event honouring Florence Nightingale qua statistician. The ostensible title was “Numbers and Public policy: Why statistics really matter”, but the title should have been “Why people hate statisticians”. This was one of the most extreme versions I’ve ever seen of a speaker shopping trite (mostly right-wing) political talking points by dressing them up in statistics to make the dubious assertions seem irrefutable, and to make the trivially obvious look ingenious.

I don’t have the slides from the talk, but video of a similar talk is available here. He spent quite a bit of his talk trying to debunk the Occupy Movement’s slogan that inequality has been increasing. The 90:10 ratio bounced along near 3 for a while, then rose to 4 during the 1980s (the Thatcher years… who knew?!), and hasn’t moved much since. Case closed. Oh, but wait, what about other measures of inequality, you may ask. And since you might ask, he had to set up some straw men to knock down. He showed the same pattern for five other measures of inequality. Case really closed.

Except that these five were all measuring the same thing, more or less. The argument people like Piketty have been making is not that the 90th percentile has been doing so much better than the 10th percentile, but that increases in wealth have been concentrated in ever smaller fractions of the population. None of the measures he looked was designed capture that process. The Gini coefficient, which looks like it measures the whole distribution, because it is a population average is actually extremely insensitive to extreme concentration at the high end. Suppose the top 1% has 20% of the income. Changes of distribution within the top 1% cannot shift the Gini coefficient by more than about 3% of its current value. He also showed the 95:5 ratio, and low-and-behold, that kept rising through the 90s, then stopped. All consistent with the main critique of rising income inequality.

Since he’s obviously not stupid, and obviously understands economics much better than I do, it’s hard to avoid thinking that this was all smoke and mirrors, intended to lull people to sleep about rising inequality, under the cover of technocratic expertise. It’s a well-known trick: Ignore the strongest criticism of your point of view, and give lots of details about weak arguments. Mathematical details are best. “Just do the math” is a nice slogan. Sometimes simple (or complex) calculations can really shed light on a problem that looks to be inextricably bound up with political interests and ideologies. But sometimes not. And sometimes you just have to accept that a political economic argument needs to be melded with statistical reasoning, and you have to be open about the entirety of the argument. (more…)

Small samples

New York Republican Representative Lee Zeldin was asked by reporter Tara Golshan how he felt about the fact that polls seem to show that a large majority of Americans — and even of Republican voters — oppose the Republican plan to reduce corporate tax rates. His response:

What I have come in contact with would reflect different numbers. So it would be interesting to see an accurate poll of 100 million Americans. But sometimes the polls get done of 1,000 [people].

Yes, that does seem suspicious, only asking 1,000 people… The 100 million people he has come in contact with are probably more typical.

Tag Cloud