The model didn’t fail us, we failed the model

THE ALGORITHM. It’s all anyone can talk about, when they’re talking about universities these days. Illustrative of the unique ability of the current UK government to take a challenging societal problem in hand, and transform it into a flaming chaos that simultaneously exacerbates divisions and satisfies no one.

In this case, it’s about the assignment of marks in A-levels (18 year-olds) released last week, and GCSEs (16-year-olds) still ahead this week. Scotland had its own small version of the fiasco that played out earlier in the week for their own Scottish Higher exams, but the UK government, responsible for English A-levels, managed not only not to learn from the Scottish situation and change course early, it managed to parlay the political challenge into a systemic disaster for higher education that will now roll on for at least the next year or two.

Like any great governing disaster, this one has been years in the making. Pupils doing A levels used to have intermediate exams — AS levels — after the first year of their two-year course, as well as significant amounts of coursework that counted for a substantial portion of their final marks. AS levels were progressively eliminated and coursework reduced over the past decade in England (but not Wales), as the Conservatives seem to have believed that todays pupils were being inappropriately coddled by having too little stress and uncontrollable randomness in their lives, leaving several weeks of exams right at the end of their course as the only determinant of the marks that would decide high-stakes competition for university places. Then they cancelled the exams, in a panicked response to the first wave of Covid-19. Leaving them with nothing.

Weirdly, it’s not as though they don’t have frequent exams during their school (and university) time. But these exams are called “mock exams”, and don’t count for anything in the end.

Which brings us to THE ALGORITHM. How do you assign marks to students when you don’t have any exams? Teachers have quite a lot of information, even if it doesn’t formally count for anything in the regular process. (Weirdly, teachers are regularly expected to produce “predicted grades” based on mock exams, coursework, and general impressions, because the official marks arrive too late for university admissions.) But on average they tend to be overly optimistic — or, one might also say, either generous or strategic, since the university admission offer that results from an overpredicted A-level grade is not necessarily withdrawn when the exam result exceeds it, whereas the university place that is lost from an underprediction is almost impossible to make good.

If you were a mindless machine-learning bot trying to optimise the accuracy of prediction of missing marks in an overall minimum-mean-error way, you would take data about each student’s family income, ethnicity, sex, parents’ occupations, and region, all of which are likely to be correlated to exam scores. But that would seem outrageously biased: Why should the young person with wealthy parents get higher A-level marks than the one with poor parents, after they had the same mock exam grades? The machine-learning answer is, because that’s what’s happened with real grades in the past. The wealthy family is likely to provide more support, maybe tutors, a more stable environment for studying for the exams. The child of the poor family may have been working hard since year 12, but there’s a much higher chance that the family would have had a crisis — maybe a parent losing a job, illness, homelessness — that would have distracted from exam preparation and led to underperformance at the exam. And since that might have happened in reality, that needs to be reflected in our optimal prediction algorithm.

But that looks bad, so the Ofqual boffins used past school performance as a proxy. Effectively, they said that each school gets the same marks this year that they got last year. Teacher evaluations were used to rank the students in each subject, to decide which students get the school’s quota of A*’s, etc. If you made the bad life choice to go to a low-performing school where no one in living memory has scored better than B in chemistry, then B is the ceiling for your marks, no matter what scores you may personally have been achieving on your mockeries.

Averaged over the whole population of English students your misfortune is just a small blemish on an overall excellent prediction.

It’s a good illustration of the problems of ethical machine learning. People say, if you don’t want your algorithm to be biased based on gender, don’t include gender information in the dataset. But if you instead include height information, say, the algorithm will learn all the gender bias in the training set and assign it to the height variable.

Just to rub salt in the wounds, there was an extra fillip for students in small — heavily private — schools: Since average performance fluctuates more in small groups, courses with 15 or fewer students had their (generally higher) teacher predictions more heavily weighted in their final marks, and those with 5 or fewer received their teacher predictions unfiltered.

Now, this way of using past school performance seems… surprising, to those of us who have been involved in UK university admissions in the past, given the extent of government and public outrage every year when the elite universities once again draw their intake from a very small sliver of UK secondary schools, predominantly private schools. You might think that this outrage reflects a belief that the differences in average exam performance, that drive most of the differential in university admissions, are unfair, that they do not accurately represent student ability, performance, and potential. If you believed that, you might propose a very different way of using school performance to assign marks, namely: Every school gets the same proportion of A*, A, B, etc., to be allocated within each subject according to the teacher rankings. I’m not advocating this method, but it is no more extreme, in its own way, than the application of past school performance that was actually implemented.

To the extent that A-level marks are primarily a tool for sorting graduates for university admissions, this would function somewhat similarly to the practice of some US states, of guaranteeing admission to their state universities to a certain percentile of every high school in the state. This leverages housing and school segregation to benefit equality, as opposed to the opposite.

The fact that my algorithm seems obviously unfair to individuals, while the other algorithm was seen as not only credible but actually self-evident, reflects nothing but naked ideology about the nature of class.

Education minister (a position whose relationship to that of education secretary confuses me) Nick Gibb responded to the fiasco thus:

So the model itself was fair, it was very popular, it was widely consulted upon. The problem arose in the way in which the three phases of the application of that model – the historic data of the school, the prior attainment of the cohort of pupils at the school, and then the national standard correction – it’s that element of the application of the model that I think there is a concern.”

The minister went on: “The application of the model is a regulatory approach and it’s the development of that that emerged on the Thursday when the algorithm was published. And at that stage it became clear that there were some results that were being published on Thursday and Friday that were just not right and they were not what the model had intended.”

The poor misunderstood beast. It meant well…