The return of quota sampling

Everyone knows about the famous Dewey Defeats Truman headline fiasco, and that the Chicago Daily Tribune was inspired to its premature announcement by erroneous pre-election polls. But why were the polls so wrong?

The Social Science Research Council set up a committee to investigate the polling failure. Their report, published in 1949, listed a number of faults, including disparaging the very notion of trying to predict the outcome of a close election. But one important methodological criticism — and the one that significantly influenced the later development of political polling, and became the primary lesson in statistics textbooks — was the critique of quota sampling. (An accessible summary of lessons from the 1948 polling fiasco by the renowned psychologist Rensis Likert was published just a month after the election in Scientific American.)

Serious polling at the time was divided between two general methodologies: random sampling and quota sampling. Random sampling, as the name implies, works by attempting to select from the population of potential voters entirely at random, with each voter equally likely to be selected. This was still considered too theoretically novel to be widely used, whereas quota sampling had been established by Gallup since the mid-1930s. In quota sampling the voting population is modelled by demographic characteristics, based on census data, and each interviewer is assigned a quota to fill of respondents in each category: 51 women and 49 men, say, a certain number in the age range 21-34, or specific numbers in each “economic class” — of which Roper, for example, had five, one of which in the 1940s was “Negro”. The interviewers were allowed great latitude in filling their quotas, finding people at home or on the street.

In a sense, we have returned to quota sampling, in the more sophisticated version of “weighted probability sampling”. Since hardly anyone responds to a survey — response rates are typically no more than about 5% — there’s no way the people who do respond can be representative of the whole population. So pollsters model the population — or the supposed voting population — and reweight the responses they do get proportionately, according to demographic characteristics. If Black women over age 50 are thought to be equally common in the voting population as white men under age 30, but we have twice as many of the former as the latter, we count the responses of the latter twice as much as the former in the final estimates. It’s just a way of making a quota sample after the fact, without the stress of specifically looking for representatives of particular demographic groups.

Consequently, it has most of the deficiencies of a quota sample. The difficulty of modelling the electorate is one that has gotten quite a bit of attention in the modern context: We know fairly precisely how demographic groups are distributed in the population, but we can only theorise about how they will be distributed among voters at the next election. At the same time, it is straightforward to construct these theories, to describe them, and to test them after the fact. The more serious problem — and the one that was emphasised in the commission report in 1948, but has been less emphasised recently — is in the nature of how the quotas are filled. The reason for probability sampling is that taking whichever respondents are easiest to get — a “sample of convenience” — is sure to give you a biased sample. If you sample people from telephone directories in 1936 then it’s easy to see how they end up biased against the favoured candidate of the poor. If you take a sample of convenience within a small demographic group, such as middle-income people, then it won’t be easy to recognise how the sample is biased, but it may still be biased.

For whatever reason, in the 1930s and 1940s, within each demographic group the Republicans were easier for the interviewers to contact than the Democrats. Maybe they were just culturally more like the interviewers, so easier for them to walk up to on the street. And it may very well be that within each demographic group today Democrats are more likely to respond to a poll than Republicans. And if there is such an effect, it’s hard to correct for it, except by simply discounting Democrats by a certain factor based on past experience. (In fact, these effects can be measured in polling fluctuations, where events in the news lead one side or the other to feel discouraged, and to be less likely to respond to the polls. Studies have suggested that this effect explains much of the short-term fluctuation in election polls during a campaign.)

Interestingly, one of the problems that the commission found with the 1948 polling with relevance for the Trump era was the failure to consider education as a significant demographic variable.

All of the major polling organizations interviewed more people with college education than the actual proportion in the adult population over 21 and too few people with grade school education only.

Putting Covid-19 mortality into context

[Cross-posted with Statistics and Biodemography Research Group blog.]

The age-specific estimates of fatality rates for Covid-19 produced by Riou et al. in Bern have gotten a lot of attention:

0-910-1920-2930-3940-4950-5960-6970-7980+Total
.094.22.911.84.013469818016
Estimated fatality in deaths per thousand cases (symptomatic and asymptomatic)

These numbers looked somewhat familiar to me, having just lectured a course on life tables and survival analysis. Recent one-year mortality rates in the UK are in the table below:

0-910-1920-2930-3940-4950-5960-6970-7980-89
.012.17.43.801.84.2102885
One-year mortality probabilities in the UK, in deaths per thousand population. Neonatal mortality has been excluded from the 0-9 class, and the over-80 class has been cut off at 89.

Depending on how you look at it, the Covid-19 mortality is shifted by a decade, or about double the usual one-year mortality probability for an average UK resident (corresponding to the fact that mortality rates double about every 9 years). If you accept the estimates that around half of the population in most of the world will eventually be infected, and if these mortality rates remain unchanged, this means that effectively everyone will get a double dose of mortality risk this year. Somewhat lower (as may be seen in the plots below) for the younger folk, whereas the over-50s get more like a triple dose.

Buddhist causal networks

A little-publicised development in statistics over the past two decades has been the admission of causality into respectable statistical discourse, spearheaded by the computer scientist Judea Pearl. Pearl’s definition (joint with Joseph Harpern) of causation (“X having setting x caused effect E”) has been formulated approximately as follows:

  • X=x and E occurs.
  • But for the fact that X=x, E would not have occurred.

Of course, Pearl is not the first person to think carefully about causality. He would certainly recognise the similarity to Koch’s postulates on demonstrating disease causation by a candidate microbe:

  1. No disease without presence of the organism;
  2. The organism must be isolated from a host containing the disease ;
  3. The disease must arise when the organism is introduced into a healthy animal;
  4. The organism isolated from that animal must be identified as the same original organism.

I was reminded of this recently in reading the Buddhist Assutava Sutta, the discourse on “dependent co-arising”, where this formula (that also appears in very similar wording in a wide range of other Buddhist texts) is stated:

When this is, that is;

This arising, that arises;

When this is not, that is not;

This ceasing, that ceases.

Trump supporters are ignoring the base (rate) — Or, Ich möcht’ so gerne wissen, ob Trumps erpressen

One of the key insights from research on decision-making — from Tversky and Kahneman, Gigerenzer, and others — is the “base rate fallacy”: in judging new evidence people tend to ignore the underlying (prior) likelihood of various outcomes. A famous example, beloved of probability texts and lectures, is the reasonably accurate — 99% chance of a correct result — test for a rare disease (1 in 10,000 in the population). A randomly selected person with a positive test has a 99% chance of not having the disease, since correct positive tests on the 1 in 10,000 infected individuals are far less common than false positive tests on the other 9,999.

This seems to fit into a more general pattern of prioritising new and/or private information over public information that may be more informative, or at least more accessible. Journalists are conspicuously prone to this bias. For instance, as Brexit blogger Richard North has lamented repeatedly, UK journalists would breathlessly hype the latest leaks of government planning documents revealing the extent of adjustments that would be needed for phytosanitary checks at the border, for instance, or aviation, where the same information had been available for a year in official planning documents on the European Commission website. This psychological bias was famously exploited by WWII British intelligence operatives in Operation Mincemeat, where they dropped a corpse stuffed with fake plans for an invasion at Calais into the sea, where they knew it would wind up on the shore in Spain. They knew that the Germans would take the information much more seriously if they thought they had found it covertly. In my own experience of undergraduate admissions at Oxford I have found it striking the extent to which people consider what they have seen in a half-hour interview to be the deep truth about a candidate, outweighing the evidence of examinations and teacher evaluations.

Which brings us to Donald Trump, who has been accused of colluding with foreign governments to defame his political opponents. He has done his collusion both in private and in public. He famously announced in a speech during the 2016 election campaign, “Russia, if you’re listening, I hope you’re able to find the 30,000 emails that are missing. I think you will probably be rewarded mightily by our press.” And just the other day he said “I would think that if [the Ukrainean government] were honest about it, they’d start a major investigation into the Bidens. It’s a very simple answer. They should investigate the Bidens because how does a company that’s newly formed—and all these companies—and by the way, likewise, China should start an investigation into the Bidens because what happened in China is just about as bad as what happened with Ukraine.”

It seems pretty obvious. But no, that’s public information. Trump has dismissed his appeal to Russia as “a joke”, and just yesterday Senator Marco Rubio contended that the fact that the appeal to China was so blatant and public shows that it probably wasn’t “real”, that Trump was “just needling the press knowing that you guys are going to get outraged by it.” The private information is, of course, being kept private, and there seems to be a process by which formerly shocking secrets are moved into the public sphere gradually, so that they slide imperceptibly from being “shocking if true” to “well-known, hence uninteresting”.

I am reminded of the epistemological conundrum posed by the Weimar-era German cabaret song, “Ich möcht’ so gern wissen, ob sich die Fische küssen”:

Ich möcht’ so gerne wissen
Ob sich die Fische küssen –
Unterm Wasser sieht man’s nicht
Na, und überm Wasser tun sie’s nicht!

I would so like to know
if fish sometimes kiss.
Underwater we can’t see it.
And out of the water they never do it.

The power of baselines

From today’s Guardian:


It took decades to establish that smoking causes lung cancer. Heavy smoking increases the risk of lung cancer by a factor of about 11, the largest risk ratio for any common risk factor for any disease. But that doesn’t make it peculiar that there should be any non-smokers with lung cancer.

As with my discussion of the horrified accounts of obesity someday overtaking smoking as a cause of cancer, the main cause is a change in the baseline level of smoking. As fewer people smoke, and as non-smokers stubbornly continue to age and die, the proportional mortality of non-smokers will inevitably increase.

It is perfectly reasonable to say we should consider diverting public-health resources from tobacco toward other causes of disease, as the fraction of disease caused by smoking declines. And it’s particularly of concern for physicians, who tend toward essentialism in their view of risk factors — “lung cancer is a smoker’s disease” — to the neglect of base rates. But the Guardian article frames the lung cancer deaths in non-smokers as a worrying “rise”:

They blame the rise on car fumes, secondhand smoke and indoor air pollution, and have urged people to stop using wood-burning stoves because the soot they generate increases risk… About 6,000 non-smoking Britons a year now die of the disease, more than lose their lives to ovarian or cervical cancer or leukaemia, according to research published on Friday in the Journal of the Royal Society of Medicine.

While the scientific article they are reporting on never explicitly says that lung cancer incidence in non-smokers [LCINS] is increasing, certainly some fault for the confusion may be found there:

the absolute numbers and rates of lung cancers in never-smokers are increasing, and this does not appear to be confounded by passive smoking or misreported smoking status.

This sounds like a serious matter. Except, the source they cite a) doesn’t provide much evidence of this and b) is itself 7 years old, and only refers to evidence that dates back well over a decade. It cites one study that found an increase in LCINS in Swedish males in the 1970s and 1980s, a much larger study that found no change over time in LCINS in the US between 1959 and 2004, and a French study that found rates increasing in women and decreasing in men, concluding finally

An increase in LCINS incidence could be real, or the result of the decrease in the proportion of ever smokers in some strata of the general population, and/or ageing within these categories.

What proportion of lung cancers should we expect to be found in non-smokers? Taking the 11:1 risk ratio, and 15% smoking rate in the UK population, we should actually expect about 85/(15×11)≈52% of lung cancer to occur in non-smokers. Why is it only 1/6, then? The effect of smoking on lu estimated that lung cancer develops after about 30 years of smoking. If we look back at the 35% smoking incidence of the mid 1980s, we would get an estimate of about 65/(35×11)≈17%.

Medical hype and under-hype

New heart treatment is biggest breakthrough since statins, scientists say

I just came across this breathless headline published in the Guardian from last year. On the one hand, this is just one study, the effect was barely statistically significant, and experience suggests a fairly high likelihood that this will ultimately have no effect on general medical practice or on human health and mortality rates. I understand the exigencies of the daily newspaper publishing model, but it’s problematic that the “new research study” has been defined as the event on which to hang a headline. The only people who need that level of up-to-the-minute detail are those professionally involved in winnowing out the new ideas and turning them into clinical practice. We would all be better served if newspapers instead reported on what new treatments have actually had an effect over the last five years. That would be just as novel to the general readership, and far less erratic.

On the other hand, I want to comment on one point of what I see as exaggerated skepticism: The paragraph that summarises the study results says

For patients who received the canakinumab injections the team reported a 15% reduction in the risk of a cardiovascular event, including fatal and non-fatal heart attacks and strokes. Also, the need for expensive interventional procedures, such as bypass surgery and inserting stents, was cut by more than 30%. There was no overall difference in death rates between patients on canakinumab and those given placebo injections, and the drug did not change cholesterol levels.

There is then a quote:

Prof Martin Bennett, a cardiologist from Cambridge who was not involved in the study, said the trial results were an important advance in understanding why heart attacks happen. But, he said, he had concerns about the side effects, the high cost of the drug and the fact that death rates were not better in those given the drug.

In principle, I think this is a good thing. There are far too many studies that show a treatment scraping out a barely significant reduction in mortality due to one cause, which is highlighted, but a countervailing mortality increase due to other causes, netting out to essentially no improvement. Then you have to say, we really should be aiming to reduce mortality, not to reduce a cause of mortality. (I remember many years ago, a few years after the US started raising the age for purchasing alcohol to 21, reading of a study that was heralded as showing the success of this approach, having found that the number of traffic fatalities attributed to alcohol had decreased substantially. Unfortunately, the number of fatalities not attributed to alcohol had increased by a similar amount, suggesting that some amount of recategorisation was going on.) Sometimes researchers will try to distract attention from a null result for mortality by pointing to a secondary endpoint — improved results on a blood test linked to mortality, for instance — which needs to be viewed with some suspicion.

In this case, though, I think the skepticism is unwarranted. There is no doubt that before the study the researchers would have predicted reduction in mortality from cardiovascular causes, no reduction due to any other cause, and likely an increase due to infection. The worry would be that the increase due to infection — or to some unanticipated side effect — would outweigh the benefits.

The results confirmed the best-case predictions. Cardiovascular mortality was reduced — possibly a lot, possibly only slightly. Deaths due to infections increased significantly in percentage terms, but the numbers were small relative to the cardiovascular improvements. The one big surprise was a very substantial reduction in cancer mortality. The researchers are open about not having predicted this, and not having a clear explanation. In such a case, it would be wrong to put much weight on the statistical “significance”, because it is impossible to quantify the class of hypotheses that are implicitly being ignored. The proper thing is to highlight this observation for further research, as they have properly done.

When you deduct these three groups of causes — cardiovascular, infections, cancer — you are left with approximately equal mortality rates in the placebo and treatment groups, as expected. So there is no reason to be “concerned” that overall mortality was not improved in those receiving the drug. First of all, overall mortality was better in the treatment group. It’s just that the improvement in CV mortality — as predicted — while large enough to be clearly not random when compared with the overall number of CV deaths, it was not large compared with the much larger total number of deaths. This is no more “concerning” than it would be, when reviewing a programme for improving airline safety, to discover that it did not appreciably change the total number of transportation-related fatalities.

The Silver Standard 4: Reconsideration

After writing in praise of the honesty and accuracy of fivethirtyeight’s results, I felt uncomfortable about the asymmetry in the way I’d treated Democrats and Republicans in the evaluation. In the plots I made, low-probability Democratic predictions that went wrong pop out on the left-hand side, whereas low-probability Republican predictions  that went wrong would get buried in the smooth glide down to zero on the right-hand side. So I decided, what I’m really interested in are all low-probability predictions, and I should treat them symmetrically.

For each district there is a predicted loser (PL), with probability smaller than 1/2. In about one third of the districts the PL was assigned a probability of 0. The expected number of PLs (EPL) who would win is simply the sum of all the predicted win probabilities that are smaller than 1/2. (Where multiple candidates from the same party are in the race, I’ve combined them.) The 538 EPL was 21.85. The actual number of winning PLs was 13.

What I am testing is whether 538 made enough wrong predictions. This is the opposite of the usual evaluation, which gives points for getting predictions right. But when measured by their own predictions, the number of districts that went the opposite of the way they said was a lot lower than they said it would be. That is prima facie evidence that the PL win probabilities were being padded somewhat. To be more precise, under the 538 model the number of winning PLs should be approximately Poisson distributed with parameter 21.85, meaning that the probability of only 13 PLs winning is 0.030. Which is kind of low, but still pretty impressive, given all the complications of the prediction game.

Below I show plots of the errors for various scenarios, measuring the cumulative error for these symmetric low predictions. (I’ve added an “Extra Tarnished” scenario, with the transformation based on the even more extreme beta(.25,.25).) I show it first without adjusting for the total number of predicted winning PLs:

image

We see that tarnished predictions predict a lot more PL victories than we actually see. The actual predictions are just slightly more than you should expect, but suspiciously one-sided — that is, all in the direction of over predicting PL victories, consistent with padding the margins slightly, erring in the direction of claiming uncertainty.

And here is an image more like the ones I had before, where all the predictions are normalised to correspond to the same number of predicted wins:

TarnishedSymmetric

 

The Silver Standard, Part 3: The Reckoning

One of the accusations most commonly levelled against Nate Silver and his enterprise is that probabilistic predictions are unfalsifiable. “He never said the Democrats would win the House. He only said there was an 85% chance. So if they don’t win, he has an out.” This is true only if we focus on the top-level prediction, and ignore all the smaller predictions that went into it. (Except in the trivial sense that you can’t say it’s impossible that a fair coin just happened to come up heads 20 times in a row.)

So, since Silver can be tested, I thought I should see how 538’s predictions stood up in the 2018 US House election. I took their predictions of the probability of victory for a Democratic candidate in all 435 congressional districts (I used their “Deluxe” prediction) from the morning of 6 November. (I should perhaps note here that one third of the districts had estimates of 0 (31 districts) or 1 (113 districts), so a victory for the wrong candidate in any one of these districts would have been a black mark for the model.) I ordered the districts by the predicted probability, to compute the cumulative predicted number of seats, starting from the smallest. I plot them against the cumulative actual number of seats won, taking the current leader for the winner in the 11 districts where there is no definite decision yet.

Silver_PredictedvsActual

The predicted number of seats won by Democrats was 231.4, impressively close to the actual 231 won. But that’s not the standard we are judging them by, and in this plot (and the ones to follow) I have normalised the predicted and observed totals to be the same. I’m looking at the cumulative fractions of a seat contributed by each district. If the predicted probabilities are accurate, we would expect the plot (in green) to lie very close to the line with slope 1 (dashed red). It certainly does look close, but the scale doesn’t make it easy to see the differences. So here is the plot of the prediction error, the difference between the red dashed line and the green curve, against the cumulative prediction:

Silver_PredictedvsError

There certainly seems to have been some overestimation of Democratic chances at the low end, leading to a maximum cumulative overprediction of about 6 (which comes at district 155, that is, the 155th most Republican district). It’s not obvious whether these differences are worse than you would expect. So in the next plot we make two comparisons. The red curve replaces the true outcomes with simulated outcomes, where we assume the 538 probabilities are exactly right. This is the best case scenario. (We only plot it out to 100 cumulative seats, because the action is all at the low end. The last 150 districts have essentially no randomness. The red curve and the green curve look very similar (except for the direction; the direction of the error is random). The most extreme error in the simulated election result is a bit more than 5.

What would the curve look like if Silver had cheated, by trying to make his predictions all look less certain, to give himself an out when they go wrong? We imagine an alternative psephologist, call him Nate Tarnished, who has access to the exact true probabilities for Democrats to win each district, but who hedges his bets by reporting a probability closer to 1/2. (As an example, we take the cumulative beta(1/2,1/2) distribution function. this leaves 0, 1/2, and 1 unchanged, but .001 would get pushed up to .02, .05 is pushed up to .14, and .2 becomes .3. Similarly, .999 becomes .98 and .8 drops to .7. Not huge changes, but enough to create more wiggle room after the fact.

In this case, we would expect to accumulate much more excess cumulative predicted probability on the left side. And this is exactly what we illustrate with the blue curve, where the error repeatedly rises nearly to 10, before slowly declining to 0.

SilverTornished

I’d say the performance of the 538 models in this election was impressive. A better test would be to look at the predicted vote shares in all 435 districts. This would require that I manually enter all of the results, since they don’t seem to be available to download. Perhaps I’ll do that some day.

Obesity and cancer

The Guardian has prominently posted a report by Cancer Research UK with a frightening headline:

Obesity to eclipse smoking as biggest cause of cancer in UK women by 2043

That’s pretty sensational. I was intrigued, because the mortality effects of obesity have long intrigued me. It seems like I’ve been hearing claims for decades, loudly trumpeted in the press, that obesity is turning into a health crisis, with the mortality crisis just around the corner. It seems plausible, and yet every time I try to dig into one of these reports, to find out what the estimates are based on, I come up empty. Looking at the data naively, it seems that the shift from BMI 20 to BMI 25 — the threshold of official “overweight” designation — has been associated in the past with a reduction in all-cause mortality. Passing through overweight to “obesity” at BMI 30 raises mortality rates only very slightly. Major increases in mortality seem to be associated with BMI over 35 or 40, but even under current projections those levels remain rare in nearly all populations.

There is a chain of reasoning that goes from obesity to morbid symptoms like high blood pressure and diabetes, to mortality, but this is fairly indirect, and ignores the rapid improvement in treatments for these secondary symptoms, as well as the clear historical association between increasing childhood nutrition and improved longevity. Concerned experts often attribute the reduction in mortality at low levels of “overweight” to errors in study design — such as confusing weight loss due to illness with healthy low weight — which has indeed been a problem and negative health consequences attributable to weight-loss diets tend to be ignored. All in all, it has always seemed to be a murky question, leaving me genuinely puzzled by the quantitative certainty with which catastrophe is predicted. Clearly increasing obesity isn’t helping people’s health — the associated morbidity is a real thing, even if it isn’t shortening people’s lives much — but I’m perplexed by the quantitative claims about mortality.

So, I thought, if obesity is causing cancer, as much as tobacco is, that’s a pretty convincing piece of the mortality story. And then I followed up the citations, and the sand ran through my fingers. Here are some problems:

  1. Just to begin with, the convergence of cancers attributable to smoking with cancers attributable to obesity is almost entirely attributable to the reduction in smoking. “By 2043 smoking may have been reduced to the point that it is no longer the leading cause of cancer in women” seems like a less alarming possible headline. Here’s the plot from the CRUK report:
    Screenshot 2018-09-24 11.48.43
  2. The report entirely conflates the categories “overweight” and “obese”. The formula they cite refers to different levels of exposure, so it is likely they have separated them out in their calculations, but it is not made clear.
  3. The relative risk numbers seem to derive primarily from this paper. There we see a lot of other causes of cancer, such as occupation, alcohol consumption, and exposure to UV radiation, all of which are of similar magnitude to weight. Occupational exposure is about as significant for men as obesity, and more amenable to political control, but is ignored in this report. Again, the real story is that the number of cancers attributable to smoking may be expected to decline over the next quarter century, to something more like the number caused by multiple existing moderate causes.
  4.  Breast cancer makes up a huge part of women’s cancer risk, hence a huge part of the additional risk attributed to overweight, hence presumably makes up the main explanation for why women’s additional risk due to overweight is so much higher than men’s. The study seems to estimate the additional breast cancer risk due to smoking at 0. This seems implausible. No papers are cited on breast cancer risk and smoking, possibly because of the focus on British statistics, but here is a very recent study finding a very substantial increase. And here is a meta-analysis.
  5. The two most common cancers attributable to obesity in women — cancer of the breast and uterus — are among the most survivable, with ten-year survival above 75%. (Survival rates here.) The next two on the list would be bowel and bladder cancer, with ten-year survival above 50%. The cancer caused by smoking, on the other hand, is primarily lung cancer, with ten-year survival around 7%, followed by oesophageal (13%), pancreatic (1%), bowel and bladder. Combining all of these different neoplasms into a risk of “cancer”, and then comparing the risk due to obesity with that due to smoking, is deeply misleading.

UPDATE: My letter to the editor appeared in The Guardian.