The first principle of statistical inference

When I first started teaching basic statistics, I thought about how to explain the importance of statistical hypothesis testing. I focused on a textbook example (specifically, Freedman, Pisani, Purves Statistics, 3rd ed., sec 28.2) of a data set that seems to show more women being right-handed than men. I pointed out that we could think of many possible explanations: Girls are pressured more to conform, women are more rational — hence left-brain-centred. But before we invest too much time and credibility in abstruse theories to explain the phenomenon, we should first make sure that the phenomenon is real, that it’s not just the kind of fluctuation that could happen by accident. (It turns out that the phenomenon is real. I don’t know if either of my explanations is valid, or if anyone has a more plausible theory.)

I thought if this when I heard about the strange Oxford-AstraZeneca vaccine serendipity that was announced this week. The third vaccine success announced in as many weeks, the researchers announced that they had found about a 70% efficacy, which is good, but not nearly as impressive as the 95% efficacy of the mRNA vaccines announced earlier in the month. But the strange thing was, they found that a subset of the test subjects who received only a half dose at the first injection, and a full dose later, showed a 90% efficacy. Experts have been all over the news media trying to explain how some weird idiosyncrasies of the human immune system and the chimpanzee adenovirus vector could make a smaller dose more effective. Here’s a summary from Science:

Researchers especially want to know why the half-dose prime would lead to a better outcome. The leading hypothesis is that people develop immune responses against adenoviruses, and the higher first dose could have spurred such a strong attack that it compromised the adenovirus’ ability to deliver the spike gene to the body with the booster shot. “I would bet on that being a contributor but not the whole story,” says Adrian Hill, director of Oxford’s Jenner Institute, which designed the vaccine…

Some evidence also suggests that slowly escalating the dose of a vaccine more closely mimics a natural viral infection, leading to a more robust immune response. “It’s not really mechanistically pinned down exactly how it works,” Hill says.

Because the different dosing schemes likely led to different immune responses, Hill says researchers have a chance to suss out the mechanism by comparing vaccinated participants’ antibody and T cell levels. The 62% efficacy, he says, “is a blessing in disguise.”

Others have pointed out that the populations receiving the full dose and the half dose were substantially different: The half dose was given by accident to a couple of thousand subjects at the start of the British arm of the study. These were exclusively younger, healthier individuals, something that could also explain the higher efficacy, in a less benedictory fashion.

But before we start arguing over these very interesting explanations, much less trying to use them to “suss out the mechanisms” the question they should be asking is, is the effect real? The Science article quotes immunologist John Moore asking “Was that a real, statistically robust 90%?” To ask that question is to answer it resoundingly: No.

They haven’t provided much data, but the AstraZeneca press release does give enough clues:

One dosing regimen (n=2,741) showed vaccine efficacy of 90% when AZD1222 was given as a half dose, followed by a full dose at least one month apart, and another dosing regimen (n=8,895) showed 62% efficacy when given as two full doses at least one month apart. The combined analysis from both dosing regimens (n=11,636) resulted in an average efficacy of 70%. All results were statistically significant (p<=0.0001)

Note two tricks they play here. First of all, they give those (n=big number) which makes it seem reassuringly like they have an impressively big study. But these are the numbers of people vaccinated, which is completely irrelevant for judging the uncertainty in the estimate of efficacy. The reason you need such huge numbers of subjects is so that you can get moderately large numbers where it counts: the number of subjects who become infected. Further, while it is surely true that the “results” were highly statistically significant — that is, the efficacy in each individual group was not zero — this tells us nothing about whether we can be confident that the efficacy is actually higher than what has been considered the minimum acceptable level of 50%, or — and this is crucial for the point at issue here — whether the two groups were different from each other.

They report a total of 131 cases. They don’t say how many cases were in each group, but if we assume that there were equal numbers of subjects getting the vaccine and the treatment in all groups then we can back-calculate the rest. We end up with 98 cases in the full-dose group (of which 27 received the vaccine) and 33 cases in the half-dose group, of which 3 received the vaccine. Just 33! Using the Clopper-Pearson exact method, we obtain 90% confidence intervals of (.781,.975) for the efficacy of the half dose and (.641, .798) for the efficacy of the full dose. Clearly some overlap there, and not much to justify drawing substantive conclusions from the difference between the two groups — which may actually be zero, or close to 0.

The return of quota sampling

Everyone knows about the famous Dewey Defeats Truman headline fiasco, and that the Chicago Daily Tribune was inspired to its premature announcement by erroneous pre-election polls. But why were the polls so wrong?

The Social Science Research Council set up a committee to investigate the polling failure. Their report, published in 1949, listed a number of faults, including disparaging the very notion of trying to predict the outcome of a close election. But one important methodological criticism — and the one that significantly influenced the later development of political polling, and became the primary lesson in statistics textbooks — was the critique of quota sampling. (An accessible summary of lessons from the 1948 polling fiasco by the renowned psychologist Rensis Likert was published just a month after the election in Scientific American.)

Serious polling at the time was divided between two general methodologies: random sampling and quota sampling. Random sampling, as the name implies, works by attempting to select from the population of potential voters entirely at random, with each voter equally likely to be selected. This was still considered too theoretically novel to be widely used, whereas quota sampling had been established by Gallup since the mid-1930s. In quota sampling the voting population is modelled by demographic characteristics, based on census data, and each interviewer is assigned a quota to fill of respondents in each category: 51 women and 49 men, say, a certain number in the age range 21-34, or specific numbers in each “economic class” — of which Roper, for example, had five, one of which in the 1940s was “Negro”. The interviewers were allowed great latitude in filling their quotas, finding people at home or on the street.

In a sense, we have returned to quota sampling, in the more sophisticated version of “weighted probability sampling”. Since hardly anyone responds to a survey — response rates are typically no more than about 5% — there’s no way the people who do respond can be representative of the whole population. So pollsters model the population — or the supposed voting population — and reweight the responses they do get proportionately, according to demographic characteristics. If Black women over age 50 are thought to be equally common in the voting population as white men under age 30, but we have twice as many of the former as the latter, we count the responses of the latter twice as much as the former in the final estimates. It’s just a way of making a quota sample after the fact, without the stress of specifically looking for representatives of particular demographic groups.

Consequently, it has most of the deficiencies of a quota sample. The difficulty of modelling the electorate is one that has gotten quite a bit of attention in the modern context: We know fairly precisely how demographic groups are distributed in the population, but we can only theorise about how they will be distributed among voters at the next election. At the same time, it is straightforward to construct these theories, to describe them, and to test them after the fact. The more serious problem — and the one that was emphasised in the commission report in 1948, but has been less emphasised recently — is in the nature of how the quotas are filled. The reason for probability sampling is that taking whichever respondents are easiest to get — a “sample of convenience” — is sure to give you a biased sample. If you sample people from telephone directories in 1936 then it’s easy to see how they end up biased against the favoured candidate of the poor. If you take a sample of convenience within a small demographic group, such as middle-income people, then it won’t be easy to recognise how the sample is biased, but it may still be biased.

For whatever reason, in the 1930s and 1940s, within each demographic group the Republicans were easier for the interviewers to contact than the Democrats. Maybe they were just culturally more like the interviewers, so easier for them to walk up to on the street. And it may very well be that within each demographic group today Democrats are more likely to respond to a poll than Republicans. And if there is such an effect, it’s hard to correct for it, except by simply discounting Democrats by a certain factor based on past experience. (In fact, these effects can be measured in polling fluctuations, where events in the news lead one side or the other to feel discouraged, and to be less likely to respond to the polls. Studies have suggested that this effect explains much of the short-term fluctuation in election polls during a campaign.)

Interestingly, one of the problems that the commission found with the 1948 polling with relevance for the Trump era was the failure to consider education as a significant demographic variable.

All of the major polling organizations interviewed more people with college education than the actual proportion in the adult population over 21 and too few people with grade school education only.

Exotic animal farming

I remember when people were muttering about Covid-19 being all the fault of the weird Chinese and their weird obsession with eating weird animals like pangolins.

So now we have a second version of Covid, that may start a completely novel pandemic, and it comes from the weird Europeans and their weird obsession with wearing the fur of weird animals like minks. Apparently, it was well known that Covid was spreading widely among the minks, but the animals were too valuable to give up on, so they tried to get away with just culling the obviously sick ones. And now we can just hope that they can get the new plague out of Denmark under control before it becomes a second pandemic.

But the people who advocate just giving up on eating and wearing animals are still treated as something between dreamy mystics and lunatics…