Occasional reflections on Life, the World, and Mathematics

Archive for the ‘Technical’ Category

Your shadow genetic profile

So, the “Golden Gate killer” has been caught, after forty years. Good news, to be sure, and it’s exciting to hear of the police using modern data systems creatively:

Investigators used DNA from crime scenes that had been stored all these years and plugged the genetic profile of the suspected assailant into an online genealogy database. One such service, GEDmatch, said in a statement on Friday that law enforcement officials had used its database to crack the case. Officers found distant relatives of Mr. DeAngelo’s and, despite his years of eluding the authorities, traced their DNA to his front door.

And yet… This is just another example of how all traditional notions of privacy are crumbling in the face of the twin assaults from information technology and networks. We see this in the way Facebook generates shadow profiles with information provided by your friends and acquaintances, even if you’ve never had a Facebook account. It doesn’t matter how cautious you are about protecting your own data: As long as you are connected to other people, quite a lot can be inferred about you from your network connections, or assembled from bits that you share with people to whom you are connected.

Nowhere is this more true than with genetic data. When DNA identification started being used by police, civil-liberties and privacy activists in many countries forced stringent restrictions on whose DNA could be collected, and under what circumstances it could be kept and catalogued. But now, effectively, everyone’s genome is public. It was noticed a few years back that it was possible to identify (or de-anonymize) participants in the Personal Genome Project, by drawing on patterns of information in their phenotypes. Here’s a more recent discussion of the issue. But those people had knowingly allowed their genotypes to be recorded and made publicly available. In the Golden Gate Killer case we see that random samples of genetic material can be attributed to individuals purely based on their biological links to other people who volunteered to be genotyped.

The next step will be, presumably, “shadow genetic profiles”: A company like GEDmatch — or the FBI — could generate imputed genetic profiles for anyone in the population, based solely on knowledge of their relationships to other people in their database, whether voluntarily (for the private company) or compulsorily (FBI).

Natural frequencies and individual propensities

I’ve just been reading Gerd Gigerenzer’s book Reckoning with Risk, about risk communication, mainly a plaidoyer for the use of “natural frequencies” in place of probabilities: Statements in the form “In how many cases out of 100 similar cases of X would you expect Y to happen”. He cites one study forensic psychiatry experts who were presented with a case study, and asked to estimate the likelihood of the individual being violent in the next six months. Half the subjects were asked “What is the probability that this person will commit a violent act in the next six months?” The other half were asked “How many out of 100 women like this patient would commit a violent act in the next six months?” Looking at these questions, it was obvious to me that the latter question would elicit lower estimates. Which is indeed what happened: The average response to the first question was about 0.3; the average response to the second was about 20.

What surprised me was that Gigerenzer seemed perplexed by this consistent difference in one direction (though, obviously, not by the fact that the experts were confused by the probability statement). He suggested that those answering the first question were thinking about the same patient being released multiple times, which didn’t make much sense to me.

What I think is that the experts were thinking of the individual probability as a hidden fact, not a statistical statement. Asked to estimate this unknown probability it seems natural that they would be cautious: thinking it’s somewhere between 10 and 30 percent they would not want to underestimate this individual’s probability, and so would conservatively state the upper end. This is perfectly consistent with them thinking that, averaged over 100 cases they could confidently state that about 20 would commit a violent act.

Quoth the raven, Never Trump

Carl Hempel famously crystallised an obstruction to the formalisation of inductive reasoning as the “Raven paradox”: Suppose I am an ornithologist, concerned to prove my world-shaking hypothesis, “All ravens are black”. I could go out into the field with binoculars and observe ravens. Suppose that over the course of the week I see 198 black ravens, 0 white ravens, 0 green ravens, and so on. These are strong data in favour of my hypothesis, and my publication in the Journal of Chromo-ornithology is assured. (And if they turn it down, I’ve heard there are numerous black studies journals…) But it gets cold out in the field, and sometimes damp, so I could reason as follows: “‘All ravens are black’ is equivalent to ‘all non-black objects are not ravens’.” And in my warm and dry study there may be no ravens, but there are many non-black objects. So I catalogue all the pink erasers and yellow textbooks and white mugs, and list them all as evidence for my hypothesis.

The status of this charming story as a paradox depends on the belief that no one would actually make such an inference. Behold, the president of the United States: Last week the special prosecutor for matters related to Russian interference with the 2016 US election released an indictment of 13 Russians. None of them had worked with the Trump campaign. Trump’s response:

In other words, while it is proving too difficult to collect proof of the contention “No anti-American voter fraud was performed by Trump,” he is collecting evidence that “There were actions not performed by Trump that were anti-American voter fraud.”

The EU OS

Twenty years ago I had a short visit from a college friend* who had just discovered the technical utopia. Completely enthralled. The Internet was going to upend all power relations, make all governments irrelevant, make censorship impossible. I was fascinated, but I did ask, How is The Internet going to clean the sewers?

But there was something else that intrigued me. He was very much on the nonscience side as a student, but he had just been learning some programming. And he had discovered something amazing: When your computer looks like it isn’t doing anything, it’s actually constantly active, checking whether any input has come. The user interface is a metaphorical desktop, inert and passive until you prod it, but beneath the surface a huge amount of complicated machinery is thrumming to generate this placid illusion.

I thought of this when reading The European Union: A Very Short Introduction. The European Union is complicated. For instance, in EU governance there is the European Council and the Council of the European Union, which are distinct, and neither one is the same as the Council of Europe (which is not part of the EU at all). There is a vast amount of work for lawyers, diplomats, economists, and various other specialists — “bureaucrats” in the common parlance — to give form and reality to certain comprehensible goals, the famous “four freedoms” — free movement of goods, capital, services, and labour. The four freedoms are the user interface of the EU, if you will, and the

There’s a lot of legacy code in the EU. In the absence of a further world war to flatten the institutions and allow a completely new constitution to be created, EU institutions had to be made backward compatible with existing nation states. There is a great deal of human work involved in carrying out these compatibility tasks. When people complain that the EU is “bureaucratic”, that’s more or less what they mean. And when they complain about “loss of sovereignty” what they mean is that their national operating system has been repurposed to run the EU code, so that some of the action of national parliaments has become senseless on its own terms.

Some people look at complicated but highly useful structures with a certain kind of awe. When these were social constructs, the people who advised treating them with care used to be called “conservatives”. The people who call themselves Conservative these days, faced with complicated structures that they can’t understand, feel only an irresistible urge to smash them.

* German has a word — Kommilitone — for exactly this relationship (fellow student), lacking in English. Because it’s awkward to say “former fellow student”.

Horse thieves and inverse probabilities

Reading Ron Chernow’s magisterial new biography of Ulysses Grant, I came across this very correct statistical inverse reasoning from the celebrated journalist Horace Greeley (whose role in the high school history curriculum has been reduced to the phrase, “Go West, young man” — that he denied having invented):

All Democrats are not horse thieves, but all horse thieves are Democrats.

This seems like an ironic bon mot, but after he became the Democratic candidate for president against Grant in 1872 he tried to use a milder version unironically as a defence of his new party colleagues:

I never said all Democrats were saloon keepers. What I said was all saloon keepers are Democrats.

Presumably he meant to add that if we knew the base rate of saloonkeeping (or horse thievery) in the population at large, we could calculate from the Democratic vote share the exact fraction of Democrats (and of Republicans) who are saloonkeepers (or horse thieves).

Why people hate statisticians

Andrew Dilnot, former head of the UK Statistics Authority and current warden (no really!) of Nuffield College, gave a talk here last week, at our annual event honouring Florence Nightingale qua statistician. The ostensible title was “Numbers and Public policy: Why statistics really matter”, but the title should have been “Why people hate statisticians”. This was one of the most extreme versions I’ve ever seen of a speaker shopping trite (mostly right-wing) political talking points by dressing them up in statistics to make the dubious assertions seem irrefutable, and to make the trivially obvious look ingenious.

I don’t have the slides from the talk, but video of a similar talk is available here. He spent quite a bit of his talk trying to debunk the Occupy Movement’s slogan that inequality has been increasing. The 90:10 ratio bounced along near 3 for a while, then rose to 4 during the 1980s (the Thatcher years… who knew?!), and hasn’t moved much since. Case closed. Oh, but wait, what about other measures of inequality, you may ask. And since you might ask, he had to set up some straw men to knock down. He showed the same pattern for five other measures of inequality. Case really closed.

Except that these five were all measuring the same thing, more or less. The argument people like Piketty have been making is not that the 90th percentile has been doing so much better than the 10th percentile, but that increases in wealth have been concentrated in ever smaller fractions of the population. None of the measures he looked was designed capture that process. The Gini coefficient, which looks like it measures the whole distribution, because it is a population average is actually extremely insensitive to extreme concentration at the high end. Suppose the top 1% has 20% of the income. Changes of distribution within the top 1% cannot shift the Gini coefficient by more than about 3% of its current value. He also showed the 95:5 ratio, and low-and-behold, that kept rising through the 90s, then stopped. All consistent with the main critique of rising income inequality.

Since he’s obviously not stupid, and obviously understands economics much better than I do, it’s hard to avoid thinking that this was all smoke and mirrors, intended to lull people to sleep about rising inequality, under the cover of technocratic expertise. It’s a well-known trick: Ignore the strongest criticism of your point of view, and give lots of details about weak arguments. Mathematical details are best. “Just do the math” is a nice slogan. Sometimes simple (or complex) calculations can really shed light on a problem that looks to be inextricably bound up with political interests and ideologies. But sometimes not. And sometimes you just have to accept that a political economic argument needs to be melded with statistical reasoning, and you have to be open about the entirety of the argument. (more…)

Small samples

New York Republican Representative Lee Zeldin was asked by reporter Tara Golshan how he felt about the fact that polls seem to show that a large majority of Americans — and even of Republican voters — oppose the Republican plan to reduce corporate tax rates. His response:

What I have come in contact with would reflect different numbers. So it would be interesting to see an accurate poll of 100 million Americans. But sometimes the polls get done of 1,000 [people].

Yes, that does seem suspicious, only asking 1,000 people… The 100 million people he has come in contact with are probably more typical.

Parliamentary mortality

An article in the New Statesman raised the question of whether the Conservatives could lose their hold on power via by-elections over the next few years only to dismiss the possibility because by-elections simply don’t happen frequently enough. The reason? Reduced mortality rates. Quite sensible, but then this strange claim was made:

In 1992-7, the last time that the Conservatives had seven by-elections in a parliament, life expectancy was 15 years lower than it is today.

Ummm… If life expectancy had increased by 15 years over the last 20 years, we’d be getting close to achieving mortality escape velocity. In fact, the increase has been about 5 years for men and 4 years for women.

But that raised for me the somewhat morbid question: How many MPs would be expected to die in the next 5 years? Approximate age distribution of MPs is available here. It’s for the last parliament, but I’ll assume it remains pretty similar. It’s interesting that Labour had twice as large a proportion (25% vs 12%) in the over-60 category. In addition, I’ll make the following assumptions:

  1. Within coarse age categories the distribution is the same between parties. (This is required to deal with the fact that the numbers by party are only divided into three age categories.)
  2. Since I don’t have detailed mortality data by class or occupation, I’ll simply treat them as being 5 years younger than their calendar age, since that’s the difference in median age at death between men in managerial occupations and average occupations.
  3. I assume women to have the same age distribution as men.
  4. I’m using 2013 mortality rates from the Human Mortality Database.

My calculations say that the expected number of deaths over the next 5 years is about 6.4 Conservatives and 6.5 Labour. So we can estimate that the probability of at least 7 by-elections due to deceased Tory MPs is just a shade under 50%.

Learning to count

The US espionage services promised last year to reveal roughly how many Americans were illegally spied upon through “accidents” in the warrantless surveillance law restricted to communications by foreigners overseas.

Last month the promise was retracted.

“The NSA has made Herculean, extensive efforts to devise a counting strategy that would be accurate,” Dan Coats, a career Republican politician appointed by Republican President Donald Trump as the top U.S. intelligence official, testified to a Senate panel on Wednesday.

Coats said “it remains infeasible to generate an exact, accurate, meaningful, and responsive methodology that can count how often a U.S. person’s communications may be collected” under the law known as Section 702 of the Foreign Intelligence Surveillance Act.

So we’re supposed to believe that the NSA is capable of making brilliant use of the full depth of private communications to map out threats to US national security… but isn’t capable of counting them. Presumably these “Herculean” efforts involve a strong helping of Cretan Bull—-.

I am reminded of this passage in Der Mann ohne Eigenschaften [The Man without Qualities]:

Es gibt also in Wirklichkeit zwei Geistesverfassungen, die einander nicht nur bekämpfen, sondern die gewöhnlich, was schlimmer ist, nebeneinander bestehen, ohne ein Wort zu wechseln, außer daß sie sich gegenseitig versichern, sie seien beide wünschenswert, jede auf ihrem Platz. Die eine begnügt sich nicht damit, genau zu sein, und hält sich an die Tatsachen; die andere begnügt sich nicht damit, sondern schaut immer auf das Ganze und leitet ihre Erkenntnisse von sogenannten ewigen und großen Wahrheiten her. Die eine gewinnt dabei an Erfolg, und die andere an Umfang und Würde. Es ist klar, daß ein Pessimist auch sagen könnte, die Ergebnisse der einen seien nichts wert und die der anderen nicht wahr. Denn was fängt man am Jüngsten Tag, wenn die menschlichen Werke gewogen werden, mit drei Abhandlungen über die Ameisensäure an, und wenn es ihrer dreißig wären?! Andererseits, was weiß man vom Jüngsten Tag, wenn man nicht einmal weiß, was alles bis dahin aus der Ameisensäure werden kann?!

Thus there are in fact two distinct mental types, which not only battle each other, but which, even worse, generally coexist side by side without ever exchanging a word, other than that they assure each other that they are each desirable in their own place. The one contents itself with being exact and keeping to the facts; the other is not content with that, always looking for the big picture, and derives its knowledge from so-called great eternal truths. The one gains success thereby, the other gains scope and value. Of course, a pessimist could always say, the results of the one have no value, while those of the other have no truth. After all, when the Last Judgment comes, what are we going to do with three treatises on formic acid, and so what if we even had thirty of them?! On the other hand, what could we possibly know about the Last Judgment, if we don’t even know what could happen by then with formic acid?!

The NSA is focusing on the big picture…

Worst mathematics metaphor ever?

I’ve come to accept “growing exponentially” — though I once had to bite my tongue at a cancer researcher claiming that “exponential growth” of cancer rates began at age 50, because earlier the rates were just generally low — and didn’t say anything when someone recently referred to having “lots of circles to square”. But here’s a really new bad mathematics metaphor: the Guardian editorialises that after Brexit

Europe will be less than the sum of its remaining parts.

“More than the sum of its parts” or “less than” is something you say when you’re adding things together, and pointing out either that you don’t actually get as much extra as you’d think or, on the contrary, that you get more. That you get less when you take something away really doesn’t need much explanation and, in any case, it’s not about the sum of the parts. Whether the remains of Europe are more or less than the sum of the other parts seems kind of irrelevant to whatever argument is being suggested.

Tag Cloud