Search This Blog

Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Wednesday 31 January 2018

Numbers aren't neutral

A S Paneerselvan in The Hindu



Analysing data without providing sufficient context is dangerous


An inherent challenge in journalism is to meet deadlines without compromising on quality, while sticking to the word limit. However, brevity takes a toll when it comes to reporting on surveys, indexes, and big data. Let me examine three sets of stories which were based on surveys and carried prominently by this newspaper, to understand the limits of presenting data without providing comprehensive context.

Three reports

The Annual Status of Education Report (ASER), Oxfam’s report titled ‘Reward Work, Not Wealth’, and the World Bank’s ease of doing business (EoDB) rankings have been widely reported, commented on, and editorialised. In most cases, the numbers and rankings were presented as neutral evaluations; they were not seen as data originating from institutions that have political underpinnings. Data become meaningful only when the methodology of data collection is spelt out in clear terms.

Every time I read surveys, indexes, and big data, I look for at least three basic parameters to understand the numbers: the sample size, the sample questionnaire, and the methodology. The sample size used indicates the robustness of the study, the questionnaire reveals whether there are leading questions, and the methodology reveals the rigour in the study. As a reporter, there were instances where I failed to mention these details in my resolve to stick to the word limit. Those were my mistakes.

The ASER study covering specific districts in States is about children’s schooling status. It attempts to measure children’s abilities with regard to basic reading and writing. It is a significant study as it gives us an insight into some of the problems with our educational system. However, we must be aware of the fact that these figures are restricted only to the districts in which the survey was conducted. It cannot be extrapolated as a State-wide sample, nor is it fair to rank States based on how specific districts fare in the study. A news item, “Report highlights India’s digital divide” (Jan. 19, 2018), conflated these figures.

For instance, the district surveyed in Kerala was Ernakulam, which is an urban district; in West Bengal it was South 24 Parganas, a complex district that stretches from metropolitan Kolkata to remote villages at the mouth of the Bay of Bengal. How can we compare these two districts with Odisha’s Khordha, Jharkhand’s Purbi Singhbhum and Bihar’s Muzaffarpur? It could be irresistible for a reporter, who accessed the data, to paint a larger picture based on these specific numbers. But we may not learn anything when we compare oranges and apples.

Questionable methodology


Oxfam, in the ‘Reward Work, Not Wealth’ report, used a methodology that has been questioned by many economists. Inequality is calculated on the basis of “net assets”. The economists point out that in this method, the poorest are not those living with very little resources, but young professionals who own no assets and with a high educational loan. Inequality is the elephant in the room which we cannot ignore. But Oxfam’s figures seem to mimic the huge notional loss figures put out by the Comptroller and Auditor General of India. Readers should know that Oxfam’s study has drawn its figures from disparate sources such as the Global Wealth Report by Credit Suisse, the Forbes’ billionaires list, adjusting last year’s figure using the average annual U.S. Consumer Price Index inflation rate from the U.S. Bureau of Labour Statistics, the World Bank’s household survey data, and an online survey in 10 countries.

When the World Bank announced the EoDB index last year, there was euphoria in India. However, this newspaper’s editorial “Moving up” (Nov. 2, 2017), which looked at India’s surge in the latest World Bank ranking from the 130th position to the 100th in a year, cautioned and asked the government, which has great orators in its ranks, to be a better listener. In hindsight, this position was vindicated when the World Bank’s chief economist, Paul Romer, said that he could no longer defend the integrity of changes made to the methodology and that the Bank would recalculate the national rankings of business competitiveness going back to at least four years. Readers would have appreciated the FAQ section (“Recalculating ease of doing business”, Jan. 25) that explained this controversy in some detail, had it looked at India’s ranking using the old methodology.

Thursday 23 March 2017

Momentum, a convenient sporting myth

Suresh Menon in The Hindu

So Australia has the “momentum” going into the final Test match in Dharamsala.

At least, their skipper Steve Smith thinks so. Had Virat Kohli said that India have the momentum, he would have been right too. The reason is quite simple. “Momentum” does not exist, so you can pour into the word any meaning you want. Sportsmen do it all the time. It is as uplifting as the thought: “I am due a big score” or “the rivals are due a defeat”. Sport does not work that way, but there is consolation in thinking that it does.

“Momentum” is one of our most comforting sporting myths, the favourite of television pundits and newspaper columnists as well as team coaches everywhere. It reaffirms what we love to believe about sport: that winning is a habit, set to continue if unchecked; that confidence is everything, and players carry it from one victory to the next; and above all, that randomness, which is a more fundamental explanation, is anathema. It is at once the loser’s solace and the winner’s excuse. Few streaks transcend random processes. Of course streaks occur — that is the nature of sport. But that is no guide to future  performance.

Momentum, momentum, who’s got the momentum? is a popular sport-within-a-sport. It is a concept that borders on the verge of meaning, and sounds better than “I have a feeling about this.”

A study in the 1980s by Thomas Gilovich and Amos Tversky raised the question of “hot hands” or streaks in the NBA. They studied the Philadelphia 76ers and found no evidence of momentum. Immediate past success had no bearing on future attempts, just as a coin might fall heads or tails regardless of what the previous toss might have been.
That and later studies — including the probability of the winner of the fourth set winning the fifth too in tennis — confirmed what a coin-tossing logician might have suspected: that momentum, like the unicorn, does not exist.

Statistics and mythology are strange bedfellows, wrote the late Stephen Jay Gould, evolutionary biologist and baseball fan. One can lead to the other over the course of an entire series or even through a single over in cricket.

Gould has also explained the attraction of patterns, and how we are hard-wired to see patterns in randomness. In many cases, patterns can be discerned in retrospect anyway, but only in retrospect. “Momentum” is usually recognised after the event, and seems to be borne of convenience rather than logic.

The momentum in the current series was with India before the matches began. Then they lost the first Test in Pune, and the momentum swung to Australia for the Bengaluru Test which then India won, grabbing the momentum again.

The third Test was drawn, so the momentum is either with Australia for plucking a draw from the jaws of defeat or with India for pushing Australia to the edge. Such simplistic analyses have kept pundits in business and given “momentum” a respectability and false importance in competitive sport. There is something romantic too in the idea, and many find that irresistible.

Momentum is such a vital component of sport that it has assumed the contours of a tangible object. Fans can reach out and touch it. Teams have it, they carry it, they ride it, they take great comfort from it and work hard to ensure that the opposition does not steal it from them. They carry it from venue to venue like they might their bats and boots and helmets.

To be fair to Steve Smith, what he actually said was “If there’s anything called momentum, it’s with us at the moment,” giving us a glimpse into a measured skepticism. If it exists, then we have it.

Does Peter Handscomb have momentum on his side, after a match-saving half-century in Ranchi? By the same token, does Ravindra Jadeja, after a half-century and nine wickets in the same match? Is team momentum the sum total of all the individual momentums? Will Ravi Ashwin, in that case, begin the final Test with a negative momentum having been less than at his best on the final day in Ranchi? How long before someone decides that momentum is temporary, but skill is permanent?

It is convenient to believe that either one team or the other has the momentum going into the final Test. Yet it is equally possible that those who swing the match with their performance might be players who haven’t been a great success in the series so far.

Someone like fast bowler Pat Cummins, or Virat Kohli himself. A whole grocery list of attributes then becomes more important than momentum: motivation, attitude, desperation, and imponderables that cannot be easily packaged and labeled.

Whichever team wins, momentum will have nothing to do with it. But that will not stop the next captain from telling us that the momentum is with his side. It might seem like blasphemy to disagree with him, so deeply is the concept grouted into our sporting consciousness.

Tuesday 7 February 2017

The hi-tech war on science fraud

Stephen Buranyi in The Guardian


One morning last summer, a German psychologist named Mathias Kauff woke up to find that he had been reprimanded by a robot. In an email, a computer program named Statcheck informed him that a 2013 paper he had published on multiculturalism and prejudice appeared to contain a number of incorrect calculations – which the program had catalogued and then posted on the internet for anyone to see. The problems turned out to be minor – just a few rounding errors – but the experience left Kauff feeling rattled. “At first I was a bit frightened,” he said. “I felt a bit exposed.”

Kauff wasn’t alone. Statcheck had read some 50,000 published psychology papers and checked the maths behind every statistical result it encountered. In the space of 24 hours, virtually every academic active in the field in the past two decades had received an email from the program, informing them that their work had been reviewed. Nothing like this had ever been seen before: a massive, open, retroactive evaluation of scientific literature, conducted entirely by computer.

Statcheck’s method was relatively simple, more like the mathematical equivalent of a spellchecker than a thoughtful review, but some scientists saw it as a new form of scrutiny and suspicion, portending a future in which the objective authority of peer review would be undermined by unaccountable and uncredentialed critics.

Susan Fiske, the former head of the Association for Psychological Science, wrote an op-ed accusing “self-appointed data police” of pioneering a new “form of harassment”. The German Psychological Society issued a statement condemning the unauthorised use of Statcheck. The intensity of the reaction suggested that many were afraid that the program was not just attributing mere statistical errors, but some impropriety, to the scientists.

The man behind all this controversy was a 25-year-old Dutch scientist named Chris Hartgerink, based at Tilburg University’s Meta-Research Center, which studies bias and error in science. Statcheck was the brainchild of Hartgerink’s colleague Michèle Nuijten, who had used the program to conduct a 2015 study that demonstrated that about half of all papers in psychology journals contained a statistical error. Nuijten’s study was written up in Nature as a valuable contribution to the growing literature acknowledging bias and error in science – but she had not published an inventory of the specific errors it had detected, or the authors who had committed them. The real flashpoint came months later,when Hartgerink modified Statcheck with some code of his own devising, which catalogued the individual errors and posted them online – sparking uproar across the scientific community.

Hartgerink is one of only a handful of researchers in the world who work full-time on the problem of scientific fraud – and he is perfectly happy to upset his peers. “The scientific system as we know it is pretty screwed up,” he told me last autumn. Sitting in the offices of the Meta-Research Center, which look out on to Tilburg’s grey, mid-century campus, he added: “I’ve known for years that I want to help improve it.” Hartgerink approaches his work with a professorial seriousness – his office is bare, except for a pile of statistics textbooks and an equation-filled whiteboard – and he is appealingly earnest about his aims. His conversations tend to rapidly ascend to great heights, as if they were balloons released from his hands – the simplest things soon become grand questions of ethics, or privacy, or the future of science.

“Statcheck is a good example of what is now possible,” he said. The top priority,for Hartgerink, is something much more grave than correcting simple statistical miscalculations. He is now proposing to deploy a similar program that will uncover fake or manipulated results – which he believes are far more prevalent than most scientists would like to admit.

When it comes to fraud – or in the more neutral terms he prefers, “scientific misconduct” – Hartgerink is aware that he is venturing into sensitive territory. “It is not something people enjoy talking about,” he told me, with a weary grin. Despite its professed commitment to self-correction, science is a discipline that relies mainly on a culture of mutual trust and good faith to stay clean. Talking about its faults can feel like a kind of heresy. In 1981, when a young Al Gore led a congressional inquiry into a spate of recent cases of scientific fraud in biomedicine, the historian Daniel Kevles observed that “for Gore and for many others, fraud in the biomedical sciences was akin to pederasty among priests”.

The comparison is apt. The exposure of fraud directly threatens the special claim science has on truth, which relies on the belief that its methods are purely rational and objective. As the congressmen warned scientists during the hearings, “each and every case of fraud serves to undermine the public’s trust in the research enterprise of our nation”.

But three decades later, scientists still have only the most crude estimates of how much fraud actually exists. The current accepted standard is a 2009 study by the Stanford researcher Daniele Fanelli that collated the results of 21 previous surveys given to scientists in various fields about research misconduct. The studies, which depended entirely on scientists honestly reporting their own misconduct, concluded that about 2% of scientists had falsified data at some point in their career.

If Fanelli’s estimate is correct, it seems likely that thousands of scientists are getting away with misconduct each year. Fraud – including outright fabrication, plagiarism and self-plagiarism – accounts for the majority of retracted scientific articles. But, according to RetractionWatch, which catalogues papers that have been withdrawn from the scientific literature, only 684 were retracted in 2015, while more than 800,000 new papers were published. If even just a few of the suggested 2% of scientific fraudsters – which, relying on self-reporting, is itself probably a conservative estimate – are active in any given year, the vast majority are going totally undetected. “Reviewers and editors, other gatekeepers – they’re not looking for potential problems,” Hartgerink said.

But if none of the traditional authorities in science are going to address the problem, Hartgerink believes that there is another way. If a program similar to Statcheck can be trained to detect the traces of manipulated data, and then make those results public, the scientific community can decide for itself whether a given study should still be regarded as trustworthy.

Hartgerink’s university, which sits at the western edge of Tilburg, a small, quiet city in the southern Netherlands, seems an unlikely place to try and correct this hole in the scientific process. The university is best known for its economics and business courses and does not have traditional lab facilities. But Tilburg was also the site of one of the biggest scientific scandals in living memory – and no one knows better than Hartgerink and his colleagues just how devastating individual cases of fraud can be.

In September 2010, the School of Social and Behavioral Science at Tilburg University appointed Diederik Stapel, a promising young social psychologist, as its new dean. Stapel was already popular with students for his warm manner, and with the faculty for his easy command of scientific literature and his enthusiasm for collaboration. He would often offer to help his colleagues, and sometimes even his students, by conducting surveys and gathering data for them.

As dean, Stapel appeared to reward his colleagues’ faith in him almost immediately. In April 2011 he published a paper in Science, the first study the small university had ever landed in that prestigious journal. Stapel’s research focused on what psychologists call “priming”: the idea that small stimuli can affect our behaviour in unnoticed but significant ways. “Could being discriminated against depend on such seemingly trivial matters as garbage on the streets?” Stapel’s paper in Science asked. He proceeded to show that white commuters at the Utrecht railway station tended to sit further away from visible minorities when the station was dirty. Similarly, Stapel found that white people were more likely to give negative answers on a quiz about minorities if they were interviewed on a dirty street, rather than a clean one.

Stapel had a knack for devising and executing such clever studies, cutting through messy problems to extract clean data. Since becoming a professor a decade earlier, he had published more than 100 papers, showing, among other things, that beauty product advertisements, regardless of context, prompted women to think about themselves more negatively, and that judges who had been primed to think about concepts of impartial justice were less likely to make racially motivated decisions.

His findings regularly reached the public through the media. The idea that huge, intractable social issues such as sexism and racism could be affected in such simple ways had a powerful intuitive appeal, and hinted at the possibility of equally simple, elegant solutions. If anything united Stapel’s diverse interests, it was this Gladwellian bent. His studies were often featured in the popular press, including the Los Angeles Times and New York Times, and he was a regular guest on Dutch television programmes.

But as Stapel’s reputation skyrocketed, a small group of colleagues and students began to view him with suspicion. “It was too good to be true,” a professor who was working at Tilburg at the time told me. (The professor, who I will call Joseph Robin, asked to remain anonymous so that he could frankly discuss his role in exposing Stapel.) “All of his experiments worked. That just doesn’t happen.”

A student of Stapel’s had mentioned to Robin in 2010 that some of Stapel’s data looked strange, so that autumn, shortly after Stapel was made Dean, Robin proposed a collaboration with him, hoping to see his methods first-hand. Stapel agreed, and the data he returned a few months later, according to Robin, “looked crazy. It was internally inconsistent in weird ways; completely unlike any real data I had ever seen.” Meanwhile, as the student helped get hold of more datasets from Stapel’s former students and collaborators, the evidence mounted: more “weird data”, and identical sets of numbers copied directly from one study to another.
In August 2011, the whistleblowers took their findings to the head of the department, Marcel Zeelenberg, who confronted Stapel with the evidence. At first, Stapel denied the charges, but just days later he admitted what his accusers suspected: he had never interviewed any commuters at the railway station, no women had been shown beauty advertisements and no judges had been surveyed about impartial justice and racism.

Stapel hadn’t just tinkered with numbers, he had made most of them up entirely, producing entire datasets at home in his kitchen after his wife and children had gone to bed. His method was an inversion of the proper scientific method: he started by deciding what result he wanted and then worked backwards, filling out the individual “data” points he was supposed to be collecting.

On 7 September 2011, the university revealed that Stapel had been suspended. The media initially speculated that there might have been an issue with his latest study – announced just days earlier, showing that meat-eaters were more selfish and less sociable – but the problem went much deeper. Stapel’s students and colleagues were about to learn that his enviable skill with data was, in fact, a sham, and his golden reputation, as well as nearly a decade of results that they had used in their own work, were built on lies.

Chris Hartgerink was studying late at the library when he heard the news. The extent of Stapel’s fraud wasn’t clear by then, but it was big. Hartgerink, who was then an undergraduate in the Tilburg psychology programme, felt a sudden disorientation, a sense that something solid and integral had been lost. Stapel had been a mentor to him, hiring him as a research assistant and giving him constant encouragement. “This is a guy who inspired me to actually become enthusiastic about research,” Hartgerink told me. “When that reason drops out, what remains, you know?”

Hartgerink wasn’t alone; the whole university was stunned. “It was a really difficult time,” said one student who had helped expose Stapel. “You saw these people on a daily basis who were so proud of their work, and you know it’s just based on a lie.” Even after Stapel resigned, the media coverage was relentless. Reporters roamed the campus – first from the Dutch press, and then, as the story got bigger, from all over the world.

On 9 September, just two days after Stapel was suspended, the university convened an ad-hoc investigative committee of current and former faculty. To help determine the true extent of Stapel’s fraud, the committee turned to Marcel van Assen, a statistician and psychologist in the department. At the time, Van Assen was growing bored with his current research, and the idea of investigating the former dean sounded like fun to him. Van Assen had never much liked Stapel, believing that he relied more on the force of his personality than reason when running the department. “Some people believe him charismatic,” Van Assen told me. “I am less sensitive to it.”

Van Assen – who is 44, tall and rangy, with a mop of greying, curly hair – approaches his work with relentless, unsentimental practicality. When speaking, he maintains an amused, half-smile, as if he is joking. He once told me that to fix the problems in psychology, it might be simpler to toss out 150 years of research and start again; I’m still not sure whether or not he was serious.

To prove misconduct, Van Assen said, you must be a pitbull: biting deeper and deeper, clamping down not just on the papers, but the datasets behind them, the research methods, the collaborators – using everything available to bring down the target. He spent a year breaking down the 45 studies Stapel produced at Tilburg and cataloguing their individual aberrations, noting where the effect size – a standard measure of the difference between the two groups in an experiment –seemed suspiciously large, where sequences of numbers were copied, where variables were too closely related, or where variables that should have moved in tandem instead appeared adrift.

The committee released its final report in October 2012 and, based largely on its conclusions, 55 of Stapel’s publications were officially retracted by the journals that had published them. Stapel also returned his PhD to the University of Amsterdam. He is, by any measure, one of the biggest scientific frauds of all time. (RetractionWatch has him third on their all-time retraction leaderboard.) The committee also had harsh words for Stapel’s colleagues, concluding that “from the bottom to the top, there was a general neglect of fundamental scientific standards”. “It was a real blow to the faculty,” Jacques Hagenaars, a former professor of methodology at Tilburg, who served on the committee, told me.

By extending some of the blame to the methods and attitudes of the scientists around Stapel, the committee situated the case within a larger problem that was attracting attention at the time, which has come to be known as the “replication crisis”. For the past decade, the scientific community has been grappling with the discovery that many published results cannot be reproduced independently by other scientists – in spite of the traditional safeguards of publishing and peer-review – because the original studies were marred by some combination of unchecked bias and human error.

After the committee disbanded, Van Assen found himself fascinated by the way science is susceptible to error, bias, and outright fraud. Investigating Stapel had been exciting, and he had no interest in returning to his old work. Van Assen had also found a like mind, a new professor at Tilburg named Jelte Wicherts, who had a long history working on bias in science and who shared his attitude of upbeat cynicism about the problems in their field. “We simply agree, there are findings out there that cannot be trusted,” Van Assen said. They began planning a new sort of research group: one that would investigate the very practice of science.

Van Assen does not like assigning Stapel too much credit for the creation of the Meta-Research Center, which hired its first students in late 2012, but there is an undeniable symmetry: he and Wicherts have created, in Stapel’s old department, a platform to investigate the sort of “sloppy science” and misconduct that very department had been condemned for.

Hartgerink joined the group in 2013. “For many people, certainly for me, Stapel launched an existential crisis in science,” he said. After Stapel’s fraud was exposed, Hartgerink struggled to find “what could be trusted” in his chosen field. He began to notice how easy it was for scientists to subjectively interpret data – or manipulate it. For a brief time he considered abandoning a future in research and joining the police.


There are probably several very famous papers that have fake data, and very famous people who have done it


Van Assen, who Hartgerink met through a statistics course, helped put him on another path. Hartgerink learned that a growing number of scientists in every field were coming to agree that the most urgent task for their profession was to establish what results and methods could still be trusted – and that many of these people had begun to investigate the unpredictable human factors that, knowingly or not, knocked science off its course. What was more, he could be a part of it. Van Assen offered Hartgerink a place in his yet-unnamed research group. All of the current projects were on errors or general bias, but Van Assen proposed they go out and work closer to the fringes, developing methods that could detect fake data in published scientific literature.

“I’m not normally an expressive person,” Hartgerink told me. “But I said: ‘Hell, yes. Let’s do that.’”

Hartgerink and Van Assen believe not only that most scientific fraud goes undetected, but that the true rate of misconduct is far higher than 2%. “We cannot trust self reports,” Van Assen told me. “If you ask people, ‘At the conference, did you cheat on your fiancee?’ – people will very likely not admit this.”

Uri Simonsohn, a psychology professor at University of Pennsylvania’s Wharton School who gained notoriety as a “data vigilante” for exposing two serious cases of fraud in his field in 2012, believes that as much as 5% of all published research contains fraudulent data. “It’s not only in the periphery, it’s not only in the journals people don’t read,” he told me. “There are probably several very famous papers that have fake data, and very famous people who have done it.”
But as long as it remains undiscovered, there is a tendency for scientists to dismiss fraud in favour of more widely documented – and less seedy – issues. Even Arturo Casadevall, an American microbiologist who has published extensively on the rate, distribution, and detection of fraud in science, told me that despite his personal interest in the topic, my time would be better served investigating the broader issues driving the replication crisis. Fraud, he said, was “probably a relatively minor problem in terms of the overall level of science”.

This way of thinking goes back at least as far as scientists have been grappling with high-profile cases of misconduct. In 1983, Peter Medawar, the British immunologist and Nobel laureate, wrote in the London Review of Books: “The number of dishonest scientists cannot, of course, be known, but even if they were common enough to justify scary talk of ‘tips of icebergs’, they have not been so numerous as to prevent science’s having become the most successful enterprise (in terms of the fulfilment of declared ambitions) that human beings have ever engaged upon.”

From this perspective, as long as science continues doing what it does well – as long as genes are sequenced and chemicals classified and diseases reliably identified and treated – then fraud will remain a minor concern. But while this may be true in the long run, it may also be dangerously complacent. Furthermore, scientific misconduct can cause serious harm, as, for instance, in the case of patients treated by Paolo Macchiarini, a doctor at Karolinska Institute in Sweden who allegedly misrepresented the effectiveness of an experimental surgical procedure he had developed. Macchiarini is currently being investigated by a Swedish prosecutor after several of the patients who received the procedure later died.

Even in the more mundane business of day-to-day research, scientists are constantly building on past work, relying on its solidity to underpin their own theories. If misconduct really is as widespread as Hartgerink and Van Assen think, then false results are strewn across scientific literature, like unexploded mines that threaten any new structure built over them. At the very least, if science is truly invested in its ideal of self-correction, it seems essential to know the extent of the problem.

But there is little motivation within the scientific community to ramp up efforts to detect fraud. Part of this has to do with the way the field is organised. Science isn’t a traditional hierarchy, but a loose confederation of research groups, institutions, and professional organisations. Universities are clearly central to the scientific enterprise, but they are not in the business of evaluating scientific results, and as long as fraud doesn’t become public they have little incentive to go after it. There is also the questionable perception, although widespread in the scientific community, that there are already measures in place that preclude fraud. When Gore and his fellow congressmen held their hearings 35 years ago, witnesses routinely insisted that science had a variety of self-correcting mechanisms, such as peer-review and replication. But, as the science journalists William Broad and Nicholas Wade pointed out at the time, the vast majority of cases of fraud are actually exposed by whistleblowers, and that holds true to this day.
And so the enormous task of keeping science honest is left to individual scientists in the hope that they will police themselves, and each other. “Not only is it not sustainable,” said Simonsohn, “it doesn’t even work. You only catch the most obvious fakers, and only a small share of them.” There is also the problem of relying on whistleblowers, who face the thankless and emotionally draining prospect of accusing their own colleagues of fraud. (“It’s like saying someone is a paedophile,” one of the students at Tilburg told me.) Neither Simonsohn nor any of the Tilburg whistleblowers I interviewed said they would come forward again. “There is no way we as a field can deal with fraud like this,” the student said. “There has to be a better way.”

In the winter of 2013, soon after Hartgerink began working with Van Assen, they began to investigate another social psychology researcher who they noticed was reporting suspiciously large effect sizes, one of the “tells” that doomed Stapel. When they requested that the researcher provide additional data to verify her results, she stalled – claiming that she was undergoing treatment for stomach cancer. Months later, she informed them that she had deleted all the data in question. But instead of contacting the researcher’s co-authors for copies of the data, or digging deeper into her previous work, they opted to let it go.

They had been thoroughly stonewalled, and they knew that trying to prosecute individual cases of fraud – the “pitbull” approach that Van Assen had taken when investigating Stapel – would never expose more than a handful of dishonest scientists. What they needed was a way to analyse vast quantities of data in search of signs of manipulation or error, which could then be flagged for public inspection without necessarily accusing the individual scientists of deliberate misconduct. After all, putting a fence around a minefield has many of the same benefits as clearing it, with none of the tricky business of digging up the mines.

As Van Assen had earlier argued in a letter to the journal Nature, the traditional approach to investigating other scientists was needlessly fraught – since it combined the messy task of proving that a researcher had intended to commit fraud with a much simpler technical problem: whether the data underlying their results was valid. The two issues, he argued, could be separated.

Scientists can commit fraud in a multitude of ways. In 1974, the American immunologist William Summerlin famously tried to pass a patch of skin on a mouse darkened with permanent marker pen as a successful interspecies skin-graft. But most instances are more mundane: the majority of fraud cases in recent years have emerged from scientists either falsifying images – deliberately mislabelling scans and micrographs – or fabricating or altering their recorded data. And scientists have used statistical tests to scrutinise each other’s data since at least the 1930s, when Ronald Fisher, the father of biostatistics, used a basic chi-squared test to suggest that Gregor Mendel, the father of genetics, had cherrypicked some of his data.

In 2014, Hartgerink and Van Assen started to sort through the variety of tests used in ad-hoc investigations of fraud in order to determine which were powerful and versatile enough to reliably detect statistical anomalies across a wide range of fields. After narrowing down a promising arsenal of tests, they hit a tougher problem. To prove that their methods work, Hartgerink and Van Assen have to show they can reliably distinguish false from real data. But research misconduct is relatively uncharted territory. Only a handful of cases come to light each year – a dismally small sample size – so it’s hard to get an idea of what constitutes “normal” fake data, what its features and particular quirks are. Hartgerink devised a workaround, challenging other academics to produce simple fake datasets, a sort of game to see if they could come up with data that looked real enough to fool the statistical tests, with an Amazon gift card as a prize.

By 2015, the Meta-Research group had expanded to seven researchers, and Hartgerink was helping his colleagues with a separate error-detection project that would become Statcheck. He was pleased with the study that Michèle Nuitjen published that autumn, which used Statcheck to show that something like half of all published psychology papers appeared to contain calculation errors, but as he tinkered with the program and the database of psychology papers they had assembled, he found himself increasingly uneasy about what he saw as the closed and secretive culture of science.
When scientists publish papers in journals, they release only the data they wish to share. Critical evaluation of the results by other scientists – peer review – takes place in secret and the discussion is not released publicly. Once a paper is published, all comments, concerns, and retractions must go through the editors of the journal before they reach the public. There are good, or at least defensible, arguments for all of this. But Hartgerink is part of an increasingly vocal group that believes that the closed nature of science, with authority resting in the hands of specific gatekeepers – journals, universities, and funders – is harmful, and that a more open approach would better serve the scientific method.

Hartgerink realised that with a few adjustments to Statcheck, he could make public all the statistical errors it had exposed. He hoped that this would shift the conversation away from talk of broad, representative results – such as the proportion of studies that contained errors – and towards a discussion of the individual papers and their mistakes. The critique would be complete, exhaustive, and in the public domain, where the authors could address it; everyone else could draw their own conclusions.

In August 2016, with his colleagues’ blessing, he posted the full set of Statcheck results publicly on the anonymous science message board PubPeer. At first there was praise on Twitter and science blogs, which skew young and progressive – and then, condemnations, largely from older scientists, who feared an intrusive new world of public blaming and shaming. In December, after everyone had weighed in, Nature, a bellwether of mainstream scientific thought for more than a century, cautiously supported a future of automated scientific scrutiny in an editorial that addressed the Statcheck controversy without explicitly naming it. Its conclusion seemed to endorse Hartgerink’s approach, that “criticism itself must be embraced”.

In the same month, the Office of Research Integrity (ORI), an obscure branch of the US National Institutes of Health, awarded Hartgerink a small grant – about $100,000 – to pursue new projects investigating misconduct, including the completion of his program to detect fabricated data. For Hartgerink and Van Assen, who had not received any outside funding for their research, it felt like vindication.

Yet change in science comes slowly, if at all, Van Assen reminded me. The current push for more open and accountable science, of which they are a part, has “only really existed since 2011”, he said. It has captured an outsize share of the science media’s attention, and set laudable goals, but it remains a small, fragile outpost of true believers within the vast scientific enterprise. “I have the impression that many scientists in this group think that things are going to change.” Van Assen said. “Chris, Michèle, they are quite optimistic. I think that’s bias. They talk to each other all the time.”

When I asked Hartgerink what it would take to totally eradicate fraud from the scientific process, he suggested that scientists make all of their data public; register the intentions of their work before conducting experiments, to prevent post-hoc reasoning, and that they have their results checked by algorithms during and after the publishing process.

To any working scientist – currently enjoying nearly unprecedented privacy and freedom for a profession that is in large part publicly funded – Hartgerink’s vision would be an unimaginably draconian scientific surveillance state. For his part, Hartgerink believes the preservation of public trust in science requires nothing less – but in the meantime, he intends to pursue this ideal without the explicit consent of the entire scientific community, by investigating published papers and making the results available to the public.

Even scientists who have done similar work uncovering fraud have reservations about Van Assen and Hartgerink’s approach. In January, I met with Dr John Carlisle and Dr Steve Yentis at an anaesthetics conference that took place in London, near Westminster Abbey. In 2012, Yentis, then the editor of the journal Anaesthesia, asked Carlisle to investigate data from a researcher named Yoshitaka Fujii, who the community suspected was falsifying clinical trials. In time, Carlisle demonstrated that 168 of Fujii’s trials contained dubious statistical results. Yentis and the other journal editors contacted Fujii’s employers, who launched a full investigation. Fujii currently sits at the top of the RetractionWatch leaderboard with 183 retracted studies. By sheer numbers he is the biggest scientific fraud in recorded history.


You’re saying to a person, ‘I think you’re a liar.’ How many fraudulent papers are worth one false accusation?

Carlisle, who, like Van Assen, found that he enjoyed the detective work (“it takes a certain personality, or personality disorder”, he said), showed me his latest project, a larger-scale analysis of the rate of suspicious clinical trial results across multiple fields of medicine. He and Yentis discussed their desire to automate these statistical tests – which, in theory, would look a lot like what Hartgerink and Van Assen are developing – but they have no plans to make the results public; instead they envision that journal editors might use the tests to screen incoming articles for signs of possible misconduct.

“It is an incredibly difficult balance,” said Yentis, “you’re saying to a person, ‘I think you’re a liar.’ We have to decide how many fraudulent papers are worth one false accusation. How many is too many?”

With the introduction of programs such as Statcheck, and the growing desire to conduct as much of the critical conversation as possible in public view, Yentis expects a stormy reckoning with those very questions. “That’s a big debate that hasn’t happened,” he said, “and it’s because we simply haven’t had the tools.”

For all their dispassionate distance, when Hartgerink and Van Assen say that they are simply identifying data that “cannot be trusted”, they mean flagging papers and authors that fail their tests. And, as they learned with Statcheck, for many scientists, that will be indistinguishable from an accusation of deceit. When Hartgerink eventually deploys his fraud-detection program, it will flag up some very real instances of fraud, as well as many unintentional errors and false positives – and present all of the results in a messy pile for the scientific community to sort out. Simonsohn called it “a bit like leaving a loaded gun on a playground”.

When I put this question to Van Assen, he told me it was certain that some scientists would be angered or offended by having their work and its possible errors exposed and discussed. He didn’t want to make anyone feel bad, he said – but he didn’t feel bad about it. Science should be about transparency, criticism, and truth.

“The problem, also with scientists, is that people think they are important, they think they have a special purpose in life,” he said. “Maybe you too. But that’s a human bias. I think when you look at it objectively, individuals don’t matter at all. We should only look at what is good for science and society.”

Thursday 19 January 2017

How statistics lost their power

William Davies in The Guardian


In theory, statistics should help settle arguments. They ought to provide stable reference points that everyone – no matter what their politics – can agree on. Yet in recent years, divergent levels of trust in statistics has become one of the key schisms that have opened up in western liberal democracies. Shortly before the November presidential election, a study in the US discovered that 68% of Trump supporters distrusted the economic data published by the federal government. In the UK, a research project by Cambridge University and YouGov looking at conspiracy theories discovered that 55% of the population believes that the government “is hiding the truth about the number of immigrants living here”.

Rather than diffusing controversy and polarisation, it seems as if statistics are actually stoking them. Antipathy to statistics has become one of the hallmarks of the populist right, with statisticians and economists chief among the various “experts” that were ostensibly rejected by voters in 2016. Not only are statistics viewed by many as untrustworthy, there appears to be something almost insulting or arrogant about them. Reducing social and economic issues to numerical aggregates and averages seems to violate some people’s sense of political decency.

Nowhere is this more vividly manifest than with immigration. The thinktank British Future has studied how best to win arguments in favour of immigration and multiculturalism. One of its main findings is that people often respond warmly to qualitative evidence, such as the stories of individual migrants and photographs of diverse communities. But statistics – especially regarding alleged benefits of migration to Britain’s economy – elicit quite the opposite reaction. People assume that the numbers are manipulated and dislike the elitism of resorting to quantitative evidence. Presented with official estimates of how many immigrants are in the country illegally, a common response is to scoff. Far from increasing support for immigration, British Future found, pointing to its positive effect on GDP can actually make people more hostile to it. GDP itself has come to seem like a Trojan horse for an elitist liberal agenda. Sensing this, politicians have now largely abandoned discussing immigration in economic terms.

All of this presents a serious challenge for liberal democracy. Put bluntly, the British government – its officials, experts, advisers and many of its politicians – does believe that immigration is on balance good for the economy. The British government did believe that Brexit was the wrong choice. The problem is that the government is now engaged in self-censorship, for fear of provoking people further.

This is an unwelcome dilemma. Either the state continues to make claims that it believes to be valid and is accused by sceptics of propaganda, or else, politicians and officials are confined to saying what feels plausible and intuitively true, but may ultimately be inaccurate. Either way, politics becomes mired in accusations of lies and cover-ups.

The declining authority of statistics – and the experts who analyse them – is at the heart of the crisis that has become known as “post-truth” politics. And in this uncertain new world, attitudes towards quantitative expertise have become increasingly divided. From one perspective, grounding politics in statistics is elitist, undemocratic and oblivious to people’s emotional investments in their community and nation. It is just one more way that privileged people in London, Washington DC or Brussels seek to impose their worldview on everybody else. From the opposite perspective, statistics are quite the opposite of elitist. They enable journalists, citizens and politicians to discuss society as a whole, not on the basis of anecdote, sentiment or prejudice, but in ways that can be validated. The alternative to quantitative expertise is less likely to be democracy than an unleashing of tabloid editors and demagogues to provide their own “truth” of what is going on across society.


Is there a way out of this polarisation? Must we simply choose between a politics of facts and one of emotions, or is there another way of looking at this situation? One way is to view statistics through the lens of their history. We need to try and see them for what they are: neither unquestionable truths nor elite conspiracies, but rather as tools designed to simplify the job of government, for better or worse. Viewed historically, we can see what a crucial role statistics have played in our understanding of nation states and their progress. This raises the alarming question of how – if at all – we will continue to have common ideas of society and collective progress, should statistics fall by the wayside.

In the second half of the 17th century, in the aftermath of prolonged and bloody conflicts, European rulers adopted an entirely new perspective on the task of government, focused upon demographic trends – an approach made possible by the birth of modern statistics. Since ancient times, censuses had been used to track population size, but these were costly and laborious to carry out and focused on citizens who were considered politically important (property-owning men), rather than society as a whole. Statistics offered something quite different, transforming the nature of politics in the process.

Statistics were designed to give an understanding of a population in its entirety,rather than simply to pinpoint strategically valuable sources of power and wealth. In the early days, this didn’t always involve producing numbers. In Germany, for example (from where we get the term Statistik) the challenge was to map disparate customs, institutions and laws across an empire of hundreds of micro-states. What characterised this knowledge as statistical was its holistic nature: it aimed to produce a picture of the nation as a whole. Statistics would do for populations what cartography did for territory.

Equally significant was the inspiration of the natural sciences. Thanks to standardised measures and mathematical techniques, statistical knowledge could be presented as objective, in much the same way as astronomy. Pioneering English demographers such as William Petty and John Graunt adapted mathematical techniques to estimate population changes, for which they were hired by Oliver Cromwell and Charles II.

The emergence in the late 17th century of government advisers claiming scientific authority, rather than political or military acumen, represents the origins of the “expert” culture now so reviled by populists. These path-breaking individuals were neither pure scholars nor government officials, but hovered somewhere between the two. They were enthusiastic amateurs who offered a new way of thinking about populations that privileged aggregates and objective facts. Thanks to their mathematical prowess, they believed they could calculate what would otherwise require a vast census to discover.

There was initially only one client for this type of expertise, and the clue is in the word “statistics”. Only centralised nation states had the capacity to collect data across large populations in a standardised fashion and only states had any need for such data in the first place. Over the second half of the 18th century, European states began to collect more statistics of the sort that would appear familiar to us today. Casting an eye over national populations, states became focused upon a range of quantities: births, deaths, baptisms, marriages, harvests, imports, exports, price fluctuations. Things that would previously have been registered locally and variously at parish level became aggregated at a national level.

New techniques were developed to represent these indicators, which exploited both the vertical and horizontal dimensions of the page, laying out data in matrices and tables, just as merchants had done with the development of standardised book-keeping techniques in the late 15th century. Organising numbers into rows and columns offered a powerful new way of displaying the attributes of a given society. Large, complex issues could now be surveyed simply by scanning the data laid out geometrically across a single page.

These innovations carried extraordinary potential for governments. By simplifying diverse populations down to specific indicators, and displaying them in suitable tables, governments could circumvent the need to acquire broader detailed local and historical insight. Of course, viewed from a different perspective, this blindness to local cultural variability is precisely what makes statistics vulgar and potentially offensive. Regardless of whether a given nation had any common cultural identity, statisticians would assume some standard uniformity or, some might argue, impose that uniformity upon it.

Not every aspect of a given population can be captured by statistics. There is always an implicit choice in what is included and what is excluded, and this choice can become a political issue in its own right. The fact that GDP only captures the value of paid work, thereby excluding the work traditionally done by women in the domestic sphere, has made it a target of feminist critique since the 1960s. In France, it has been illegal to collect census data on ethnicity since 1978, on the basis that such data could be used for racist political purposes. (This has the side-effect of making systemic racism in the labour market much harder to quantify.)

Despite these criticisms, the aspiration to depict a society in its entirety, and to do so in an objective fashion, has meant that various progressive ideals have been attached to statistics. The image of statistics as a dispassionate science of society is only one part of the story. The other part is about how powerful political ideals became invested in these techniques: ideals of “evidence-based policy”, rationality, progress and nationhood grounded in facts, rather than in romanticised stories.

Since the high-point of the Enlightenment in the late 18th century, liberals and republicans have invested great hope that national measurement frameworks could produce a more rational politics, organised around demonstrable improvements in social and economic life. The great theorist of nationalism, Benedict Anderson, famously described nations as “imagined communities”, but statistics offer the promise of anchoring this imagination in something tangible. Equally, they promise to reveal what historical path the nation is on: what kind of progress is occurring? How rapidly? For Enlightenment liberals, who saw nations as moving in a single historical direction, this question was crucial.

The potential of statistics to reveal the state of the nation was seized in post-revolutionary France. The Jacobin state set about imposing a whole new framework of national measurement and national data collection. The world’s first official bureau of statistics was opened in Paris in 1800. Uniformity of data collection, overseen by a centralised cadre of highly educated experts, was an integral part of the ideal of a centrally governed republic, which sought to establish a unified, egalitarian society.

From the Enlightenment onwards, statistics played an increasingly important role in the public sphere, informing debate in the media, providing social movements with evidence they could use. Over time, the production and analysis of such data became less dominated by the state. Academic social scientists began to analyse data for their own purposes, often entirely unconnected to government policy goals. By the late 19th century, reformers such as Charles Booth in London and WEB Du Bois in Philadelphia were conducting their own surveys to understand urban poverty.


 
Illustration by Guardian Design

To recognise how statistics have been entangled in notions of national progress, consider the case of GDP. GDP is an estimate of the sum total of a nation’s consumer spending, government spending, investments and trade balance (exports minus imports), which is represented in a single number. This is fiendishly difficult to get right, and efforts to calculate this figure began, like so many mathematical techniques, as a matter of marginal, somewhat nerdish interest during the 1930s. It was only elevated to a matter of national political urgency by the second world war, when governments needed to know whether the national population was producing enough to keep up the war effort. In the decades that followed, this single indicator, though never without its critics, took on a hallowed political status, as the ultimate barometer of a government’s competence. Whether GDP is rising or falling is now virtually a proxy for whether society is moving forwards or backwards.

Or take the example of opinion polling, an early instance of statistical innovation occurring in the private sector. During the 1920s, statisticians developed methods for identifying a representative sample of survey respondents, so as to glean the attitudes of the public as a whole. This breakthrough, which was first seized upon by market researchers, soon led to the birth of the opinion polling. This new industry immediately became the object of public and political fascination, as the media reported on what this new science told us about what “women” or “Americans” or “manual labourers” thought about the world.

Nowadays, the flaws of polling are endlessly picked apart. But this is partly due to the tremendous hopes that have been invested in polling since its origins. It is only to the extent that we believe in mass democracy that we are so fascinated or concerned by what the public thinks. But for the most part it is thanks to statistics, and not to democratic institutions as such, that we can know what the public thinks about specific issues. We underestimate how much of our sense of “the public interest” is rooted in expert calculation, as opposed to democratic institutions.

As indicators of health, prosperity, equality, opinion and quality of life have come to tell us who we are collectively and whether things are getting better or worse, politicians have leaned heavily on statistics to buttress their authority. Often, they lean too heavily, stretching evidence too far, interpreting data too loosely, to serve their cause. But that is an inevitable hazard of the prevalence of numbers in public life, and need not necessarily trigger the type of wholehearted rejections of expertise that we have witnessed recently.

In many ways, the contemporary populist attack on “experts” is born out of the same resentment as the attack on elected representatives. In talking of society as a whole, in seeking to govern the economy as a whole, both politicians and technocrats are believed to have “lost touch” with how it feels to be a single citizen in particular. Both statisticians and politicians have fallen into the trap of “seeing like a state”, to use a phrase from the anarchist political thinker James C Scott. Speaking scientifically about the nation – for instance in terms of macroeconomics – is an insult to those who would prefer to rely on memory and narrative for their sense of nationhood, and are sick of being told that their “imagined community” does not exist.

On the other hand, statistics (together with elected representatives) performed an adequate job of supporting a credible public discourse for decades if not centuries. What changed?

The crisis of statistics is not quite as sudden as it might seem. For roughly 450 years, the great achievement of statisticians has been to reduce the complexity and fluidity of national populations into manageable, comprehensible facts and figures. Yet in recent decades, the world has changed dramatically, thanks to the cultural politics that emerged in the 1960s and the reshaping of the global economy that began soon after. It is not clear that the statisticians have always kept pace with these changes. Traditional forms of statistical classification and definition are coming under strain from more fluid identities, attitudes and economic pathways. Efforts to represent demographic, social and economic changes in terms of simple, well-recognised indicators are losing legitimacy.

Consider the changing political and economic geography of nation states over the past 40 years. The statistics that dominate political debate are largely national in character: poverty levels, unemployment, GDP, net migration. But the geography of capitalism has been pulling in somewhat different directions. Plainly globalisation has not rendered geography irrelevant. In many cases it has made the location of economic activity far more important, exacerbating the inequality between successful locations (such as London or San Francisco) and less successful locations (such as north-east England or the US rust belt). The key geographic units involved are no longer nation states. Rather, it is cities, regions or individual urban neighbourhoods that are rising and falling.

The Enlightenment ideal of the nation as a single community, bound together by a common measurement framework, is harder and harder to sustain. If you live in one of the towns in the Welsh valleys that was once dependent on steel manufacturing or mining for jobs, politicians talking of how “the economy” is “doing well” are likely to breed additional resentment. From that standpoint, the term “GDP” fails to capture anything meaningful or credible.

When macroeconomics is used to make a political argument, this implies that the losses in one part of the country are offset by gains somewhere else. Headline-grabbing national indicators, such as GDP and inflation, conceal all sorts of localised gains and losses that are less commonly discussed by national politicians. Immigration may be good for the economy overall, but this does not mean that there are no local costs at all. So when politicians use national indicators to make their case, they implicitly assume some spirit of patriotic mutual sacrifice on the part of voters: you might be the loser on this occasion, but next time you might be the beneficiary. But what if the tables are never turned? What if the same city or region wins over and over again, while others always lose? On what principle of give and take is that justified?

In Europe, the currency union has exacerbated this problem. The indicators that matter to the European Central Bank (ECB), for example, are those representing half a billion people. The ECB is concerned with the inflation or unemployment rate across the eurozone as if it were a single homogeneous territory, at the same time as the economic fate of European citizens is splintering in different directions, depending on which region, city or neighbourhood they happen to live in. Official knowledge becomes ever more abstracted from lived experience, until that knowledge simply ceases to be relevant or credible.

The privileging of the nation as the natural scale of analysis is one of the inbuilt biases of statistics that years of economic change has eaten away at. Another inbuilt bias that is coming under increasing strain is classification. Part of the job of statisticians is to classify people by putting them into a range of boxes that the statistician has created: employed or unemployed, married or unmarried, pro-Europe or anti-Europe. So long as people can be placed into categories in this way, it becomes possible to discern how far a given classification extends across the population.

This can involve somewhat reductive choices. To count as unemployed, for example, a person has to report to a survey that they are involuntarily out of work, even if it may be more complicated than that in reality. Many people move in and out of work all the time, for reasons that might have as much to do with health and family needs as labour market conditions. But thanks to this simplification, it becomes possible to identify the rate of unemployment across the population as a whole.

Here’s a problem, though. What if many of the defining questions of our age are not answerable in terms of the extent of people encompassed, but the intensity with which people are affected? Unemployment is one example. The fact that Britain got through the Great Recession of 2008-13 without unemployment rising substantially is generally viewed as a positive achievement. But the focus on “unemployment” masked the rise of underemployment, that is, people not getting a sufficient amount of work or being employed at a level below that which they are qualified for. This currently accounts for around 6% of the “employed” labour force. Then there is the rise of the self-employed workforce, where the divide between “employed” and “involuntarily unemployed” makes little sense.

This is not a criticism of bodies such as the Office for National Statistics (ONS), which does now produce data on underemployment. But so long as politicians continue to deflect criticism by pointing to the unemployment rate, the experiences of those struggling to get enough work or to live on their wages go unrepresented in public debate. It wouldn’t be all that surprising if these same people became suspicious of policy experts and the use of statistics in political debate, given the mismatch between what politicians say about the labour market and the lived reality.

The rise of identity politics since the 1960s has put additional strain on such systems of classification. Statistical data is only credible if people will accept the limited range of demographic categories that are on offer, which are selected by the expert not the respondent. But where identity becomes a political issue, people demand to define themselves on their own terms, where gender, sexuality, race or class is concerned.

Opinion polling may be suffering for similar reasons. Polls have traditionally captured people’s attitudes and preferences, on the reasonable assumption that people will behave accordingly. But in an age of declining political participation, it is not enough simply to know which box someone would prefer to put an “X” in. One also needs to know whether they feel strongly enough about doing so to bother. And when it comes to capturing such fluctuations in emotional intensity, polling is a clumsy tool.

Statistics have faced criticism regularly over their long history. The challenges that identity politics and globalisation present to them are not new either. Why then do the events of the past year feel quite so damaging to the ideal of quantitative expertise and its role in political debate?

In recent years, a new way of quantifying and visualising populations has emerged that potentially pushes statistics to the margins, ushering in a different era altogether. Statistics, collected and compiled by technical experts, are giving way to data that accumulates by default, as a consequence of sweeping digitisation. Traditionally, statisticians have known which questions they wanted to ask regarding which population, then set out to answer them. By contrast, data is automatically produced whenever we swipe a loyalty card, comment on Facebook or search for something on Google. As our cities, cars, homes and household objects become digitally connected, the amount of data we leave in our trail will grow even greater. In this new world, data is captured first and research questions come later.

In the long term, the implications of this will probably be as profound as the invention of statistics was in the late 17th century. The rise of “big data” provides far greater opportunities for quantitative analysis than any amount of polling or statistical modelling. But it is not just the quantity of data that is different. It represents an entirely different type of knowledge, accompanied by a new mode of expertise.

First, there is no fixed scale of analysis (such as the nation) nor any settled categories (such as “unemployed”). These vast new data sets can be mined in search of patterns, trends, correlations and emergent moods. It becomes a way of tracking the identities that people bestow upon themselves (such as “#ImwithCorbyn” or “entrepreneur”) rather than imposing classifications upon them. This is a form of aggregation suitable to a more fluid political age, in which not everything can be reliably referred back to some Enlightenment ideal of the nation state as guardian of the public interest.

Second, the majority of us are entirely oblivious to what all this data says about us, either individually or collectively. There is no equivalent of an Office for National Statistics for commercially collected big data. We live in an age in which our feelings, identities and affiliations can be tracked and analysed with unprecedented speed and sensitivity – but there is nothing that anchors this new capacity in the public interest or public debate. There are data analysts who work for Google and Facebook, but they are not “experts” of the sort who generate statistics and who are now so widely condemned. The anonymity and secrecy of the new analysts potentially makes them far more politically powerful than any social scientist.

A company such as Facebook has the capacity to carry quantitative social science on hundreds of billions of people, at very low cost. But it has very little incentive to reveal the results. In 2014, when Facebook researchers published results of a study of “emotional contagion” that they had carried out on their users – in which they altered news feeds to see how it affected the content that users then shared in response – there was an outcry that people were being unwittingly experimented on. So, from Facebook’s point of view, why go to all the hassle of publishing? Why not just do the study and keep quiet?

What is most politically significant about this shift from a logic of statistics to one of data is how comfortably it sits with the rise of populism. Populist leaders can heap scorn upon traditional experts, such as economists and pollsters, while trusting in a different form of numerical analysis altogether. Such politicians rely on a new, less visible elite, who seek out patterns from vast data banks, but rarely make any public pronouncements, let alone publish any evidence. These data analysts are often physicists or mathematicians, whose skills are not developed for the study of society at all. This, for example, is the worldview propagated by Dominic Cummings, former adviser to Michael Gove and campaign director of Vote Leave. “Physics, mathematics and computer science are domains in which there are real experts, unlike macro-economic forecasting,” Cummings has argued.

Figures close to Donald Trump, such as his chief strategist Steve Bannon and the Silicon Valley billionaire Peter Thiel, are closely acquainted with cutting-edge data analytics techniques, via companies such as Cambridge Analytica, on whose board Bannon sits. During the presidential election campaign, Cambridge Analytica drew on various data sources to develop psychological profiles of millions of Americans, which it then used to help Trump target voters with tailored messaging.

This ability to develop and refine psychological insights across large populations is one of the most innovative and controversial features of the new data analysis. As techniques of “sentiment analysis”, which detect the mood of large numbers of people by tracking indicators such as word usage on social media, become incorporated into political campaigns, the emotional allure of figures such as Trump will become amenable to scientific scrutiny. In a world where the political feelings of the general public are becoming this traceable, who needs pollsters?

Few social findings arising from this kind of data analytics ever end up in the public domain. This means that it does very little to help anchor political narrative in any shared reality. With the authority of statistics waning, and nothing stepping into the public sphere to replace it, people can live in whatever imagined community they feel most aligned to and willing to believe in. Where statistics can be used to correct faulty claims about the economy or society or population, in an age of data analytics there are few mechanisms to prevent people from giving way to their instinctive reactions or emotional prejudices. On the contrary, companies such as Cambridge Analytica treat those feelings as things to be tracked.

But even if there were an Office for Data Analytics, acting on behalf of the public and government as the ONS does, it is not clear that it would offer the kind of neutral perspective that liberals today are struggling to defend. The new apparatus of number-crunching is well suited to detecting trends, sensing the mood and spotting things as they bubble up. It serves campaign managers and marketers very well. It is less well suited to making the kinds of unambiguous, objective, potentially consensus-forming claims about society that statisticians and economists are paid for.

In this new technical and political climate, it will fall to the new digital elite to identify the facts, projections and truth amid the rushing stream of data that results. Whether indicators such as GDP and unemployment continue to carry political clout remains to be seen, but if they don’t, it won’t necessarily herald the end of experts, less still the end of truth. The question to be taken more seriously, now that numbers are being constantly generated behind our backs and beyond our knowledge, is where the crisis of statistics leaves representative democracy.

On the one hand, it is worth recognising the capacity of long-standing political institutions to fight back. Just as “sharing economy” platforms such as Uber and Airbnb have recently been thwarted by legal rulings (Uber being compelled to recognise drivers as employees, Airbnb being banned altogether by some municipal authorities), privacy and human rights law represents a potential obstacle to the extension of data analytics. What is less clear is how the benefits of digital analytics might ever be offered to the public, in the way that many statistical data sets are. Bodies such as the Open Data Institute, co-founded by Tim Berners-Lee, campaign to make data publicly available, but have little leverage over the corporations where so much of our data now accumulates. Statistics began life as a tool through which the state could view society, but gradually developed into something that academics, civic reformers and businesses had a stake in. But for many data analytics firms, secrecy surrounding methods and sources of data is a competitive advantage that they will not give up voluntarily.

A post-statistical society is a potentially frightening proposition, not because it would lack any forms of truth or expertise altogether, but because it would drastically privatise them. Statistics are one of many pillars of liberalism, indeed of Enlightenment. The experts who produce and use them have become painted as arrogant and oblivious to the emotional and local dimensions of politics. No doubt there are ways in which data collection could be adapted to reflect lived experiences better. But the battle that will need to be waged in the long term is not between an elite-led politics of facts versus a populist politics of feeling. It is between those still committed to public knowledge and public argument and those who profit from the ongoing disintegration of those things.

Thursday 13 October 2016

The numbers behind dropped catches and missed stumpings

Charles Davis in Cricinfo 

For all its bewildering array of data, cricket statistics still has a few blind spots. One of the most obvious is in the area of missed chances, where there have been few extensive studies. Gerald Brodribb in Next Man In mentioned that statistician RH Campbell estimated that 30% of catches were missed in Tests in the 1920s. I have seen a figure of more than 30 dropped catches by West Indies in Australia in 1968-69, a team plagued by poor fielding. But really basic questions like "Overall, what percentage of chances are dropped?" lack answers.

For a number of years I have collected all the missed chances I could find in ESPNcricinfo's ball-by-ball texts for Test matches. Since the site does not always use standard terms to describe missed chances, and different ball-by-ball commentators have their own ways of expressing themselves, I searched the text for 40 or more words and phrases that might indicate a miss, from "drop" and "dolly" to "shell", "grass" and "hash". The process generally flagged about 100 to 200 lines in the commentary for each Test, which I then searched manually to identify real chances. For some Tests, I also confirmed data by checking match reports and other ball-by-ball sources.

While the commentary goes back to 1999, the textual detail can be patchy in the early years. I logged missed chances from late 2000 onwards, but consider that data to be substantially complete only from 2003. I have compiled a list of over 4000 missed chances in Tests from this century; about a third of all Tests (635) are represented.

Unavoidably, there are caveats. Sometimes opinions may vary as to whether a chance should be considered a miss. I take a hard line: "half", "technical" and "academic" chances are included, and I try to include any chances where the fielder failed to touch the ball but should have done so, if they can be identified. Edges passing between the wicketkeeper and first slip are considered chances even if no one has touched the ball. Since 2005, I have divided chances into two categories, "normal" and "difficult", according to how they are described. About half fall into each category.

There will always be uncertainty about some dropped catches, as there is always the possibility that some others have been overlooked. However, as long as the collection method is as consistent and exhaustive as possible, I would argue that a great majority of misses have been identified and that the data can be collated into useful statistics.

So back to the original question: how many chances are dropped? The answer is about one-quarter; typically seven missed chances per Test. Here is a table showing missed chances by country.




Percentage of catches and stumpings missed from 2003 to 2015
Fielding team2003-20092010-2015
New Zealand23.6%21.4%
South Africa20.9%21.6%
Australia23.2%21.8%
England25.5%24.8%
West Indies30.5%25.4%
Sri Lanka25.3%26.8%
India24.6%27.2%
Pakistan30.8%30.2%
Zimbabwe27.1%31.9%
Bangladesh33.3%33.1%


The difference between the top three countries in the last five years is not significant; however, there are more substantial differences down the list. Generally, Bangladesh have had the weakest catching record since they started in Test cricket, although there are recent signs of improvement. Other countries have had fluctuating fortunes. West Indies had a miss rate of over 30% from 2003 to 2009, but have tightened up their game in the last couple of years. India have seen a rate of 33% in 2013 fall to 23% in 2015, and Sri Lanka have also improved their catching significantly in just the last two years.

In some years, countries like Australia, South Africa and New Zealand have seen their rates drop below 20%; the best single-year result was 16.9% by South Africa in 2013, when they were the No. 1-ranked team in Tests. In good years, the proportion of dropped catches rated as "difficult" generally increases; good teams still miss the hard ones but drop fewer easy ones. Typically, two-thirds of Australia's missed chances are rated as difficult, but the same applies to only one-third of Bangladesh's missed chances.

The lucky

As a batsman's innings progresses, the odds of him offering a chance increase. About 72% of batsmen reaching 50 do so without giving a chance, but the percentage for century-makers is 56% in the first 100 runs. Only 33% of double-centuries are chanceless in the first 200 runs. The highest absolutely chanceless innings is 374 by Mahela Jayawardene in Colombo; Lara's 400 in Antigua contained a couple of "academic" chances.

The most expensive missed chance since the start of 2000 is 297 runs for Inzamam-ul-Haq, who made 329 after being missed on 32 in Lahore in 2002. Historically there have been more expensive misses: Mark Taylor (334 not out) was dropped on 18 and 27 by Saeed Anwar, and there was a missed stumping on 40 for Len Hutton (364) in 1938. Perhaps even luckier was Kumar Sangakkara, who made 270 in Bulawayo after being dropped on 0. Sachin Tendulkar was dropped on 0 when he made his highest score, 248 not out in Dhaka. Mike Hussey gave a possible chance first ball at the Gabba in 2010, and went on to make 195. Graham Gooch was famously dropped by Kiran More when on 36 at Lord's in 1990. He went on to make 333.

The data turns up four batsmen who have been dropped five times in an innings: one was Andy Blignaut, whose 84 not out in Harare in 2005 included an extremely rare hat-trick of dropped catches; Zaheer Khan was the unhappy bowler. (There was also a hat-trick of missed chances at Old Trafford in 1972, when two batsmen survived against Geoff Arnold.) The others who have been missed five times are Hashim Amla (253 in Nagpur, 2010), Taufeeq Umar (135 in St Kitts, 2011) and Kane Williamson (242 not out in Wellington, 2014). Nothing in this century quite matches the seven or eight missed catches (reports vary) off George Bonnor when he made 87 in Sydney in 1883, or six misses off Bill Ponsford in his 266 at The Oval in 1934 (as recorded by veteran scorer Bill Ferguson). Wavell Hinds was dropped twice at the MCG in 2000, and still made a duck.

The batsman with most reprieves in the study period is Virender Sehwag, missed 68 times
, just one ahead of Sangakkara. About 37% of the chances Sehwag offered were dropped, which is well above average and probably a testament to the power of his hitting.
The unlucky

Broadly, spin bowlers suffer more from dropped catches and (of course) missed stumpings. Chances at short leg, along with caught and bowled, have the highest miss rates among fielding positions, and these positions happen to feature more strongly among spinners' wickets than pace bowlers'. Overall, 27% of chances off spin bowlers are missed, as against 23% of chances off pace bowlers.

In the study period, the bowlers with the most missed chances in Tests are Harbhajan Singh (99) and Danish Kaneria (93). Harbhajan has had 26 chances missed at short leg alone. Bear in mind that these bowlers' careers are not fully covered; the data for about 10% of Harbhajan's career is not available to analyse. Pace bowlers with the most misses, as of January 2016, are Jimmy Anderson (89) and Stuart Broad (85).

Spare a thought for James Tredwell, who has played only two Tests but suffered ten missed chances, including seven on debut, the most for any bowler since the start of 2000. Most were very difficult, with three of them missed by the bowler himself. Also worth mentioning is Zulfiqar Babar, who has had 30 chances missed in his Test career and only 28 catches (and stumpings) taken.

At the other end of the scale, Adil Rashid has had eight catches taken off his bowling with no misses (as of August 2016). Neil Wagner of New Zealand has had only seven misses out of 63 chances, a rate of 11%.

Two bowlers have had catches missed off their first ball in Test cricket: David Warner (Dean Brownlie dropped by James Pattinson, Brisbane, 2011) and RP Singh (Shoaib Malik dropped by Anil Kumble, Faisalabad, 2006).

There is an intriguing case from 1990. Against West Indies in Lahore, Wasim Akram took four wickets in five balls: W, W, 1, W, W. In a surviving scorebook, the single, by Ian Bishop, is marked as a dropped catch at mid-on. If so, Akram came within a hair's breadth of five wickets in five balls, since the batsmen crossed, and Bishop did not face again. (Wisden, it should be noted, says that the catch was out of reach.)

The guilty

When it comes to catching, some positions are much more challenging than others. That will come as no surprise, but putting some numbers to this is an interesting exercise. The table below shows the miss rate for different field positions.
 

hances by position (December 2008 to January 2016)
PositionChances% Missed
Keeper (ct)218815%
Stumping25436%
Slip206229%
Gully40430%
Third man3617%
Point37129%
Cover31923%
Mid-off25320%
Bowler37847%
Mid-on34022%
Midwicket45523%
Short leg51838%
Square leg28619%
Fine leg17030%



The highest miss rates are seen for caught and bowled, and for catches at short leg.
Bowlers lack the luxury of setting themselves up for catches, while short leg has the least time of any position to react to a ball hit well. Many of the chances there are described as half-chances or technical. Slips catches are twice as likely to be dropped as wicketkeepers' catches, a measure of the advantage of gloves.

However, it would be unwise to read too much into the table above. Slip fielders or short-leg fielders are not inferior to those at mid-off; they get much more difficult chances. In the period of the study, Alastair Cook missed more chances than any other non-wicketkeeper, some 62 misses, but since many of his misses came at short leg, his miss rate doesn't look so bad.

While comparing lapse rates of different fielders is risky, it is worth mentioning Graeme Smith, whose drop rate of only 14% is the best among long-serving players by a considerable margin. Between August 2012 and February 2013, Smith took 25 catches and recorded no missed chances. Other slip fielders with outstanding catching records include Andrew Strauss and Ross Taylor on 20%, Michael Clarke on 21%, and Ricky Ponting on 22%. Elsewhere in the field, Warner at one stage took 20 consecutive chances that came to him.

Of those who have recorded more drops than catches, Umar Gul leads the list, with 11 catches and 14 misses. In 2014, Mushfiqur Rahim missed ten consecutive chances that came his way. Oddly enough, he caught his next 13 chances. Kevin Pietersen came to Test cricket with a fine catching reputation, but he dropped the first seven chances that came to him. He then caught his next 16 chances. The most missed chances in a match for one team, in this data set, is 12 by India against England in Mumbai in 2006. The most missed chances in an innings is nine by Pakistan against England in Faisalabad in 2005, and also by Bangladesh against Pakistan in Dhaka in 2011. In Karachi in 2009, Mahela Jayawardene (240) was dropped on 17 and 43, Thilan Samaraweera (231) was dropped on 73 and 77, and Younis Khan (313) was dropped on 92. The combined cost of all the missed chances in the match was 1152 runs, or 684 runs based on "first" drops off each of the batsmen.

There is some evidence of "contagious" butterfingers in teams. In the second Test of the 1985 series in Colombo, India dropped seven catches against Sri Lanka on the first day, on which the only wicket to fall was thanks to a run-out. India also dropped six catches in the space of ten overs in Rawalpindi in 2004, five of them coming in the first hour of the fourth day. It is rare enough for six chances to be offered at all in the space of ten overs at all, let alone to see all of them missed.

Behind the stumps


Here is some data on the miss rates, including stumpings, of various wicketkeepers of the 21st century. Not all are listed, but those with particularly low or high drop rates are given. 



Missed chances by keepers
Chances%Miss
Mark Boucher36410%
BJ Watling11911%
Tatenda Taibu5711%
AB de Villiers9411%
Adam Gilchrist35712%
Kamran Akmal20320%
Sarfraz Ahmed6321%
Dinesh Karthik5022%
Adnan Akmal7722%
Mushfiqur Rahim8632%


The wicketkeeper with the most misses is MS Dhoni with 66 (18%).
In his defence, Dhoni had to deal with a high percentage of spin bowling, which presents a much greater challenge for keepers. Miss rates for leading wicketkeepers off spinners average around 30%, for both catches and stumpings, but it is only 10% for catches off pace bowlers. It can certainly be argued that keeping to spinners is the true test of a keeper.

It is not uncommon for keepers to start with a bang but fade later in their careers. Boucher, Watling, Gilchrist and de Villiers all had miss rates in single digits earlier in their careers. Gilchrist's miss rate rose in the last couple of years before his retirement. Others with very low rates, who did not qualify for the table, include Peter Nevill and Chris Read, on 7%. Read, to my eye, was one of the best modern wicketkeepers, but he did not get very many opportunities since he was unable to score enough runs to hold his place.


A short history of dropped catches

In addition to the data for the 21st century I have gathered data from other periods of Test history, using scorebooks that recorded dropped catches. The best sources are scorebooks by Bill Ferguson in the 1910s and 1920s, and by Bill Frindall from the early 1970s to the late 1990s. I have also used a limited number of other sources, including scores by Irving Rosenwater and some by Pakistan TV scorers. I have extracted data from about 200 Test scores in all, dating from before 1999.

Again, there must be caveats. We cannot be sure that the judging of dropped catches was on the same terms throughout, and we cannot be sure of the effect of TV replays on these assessments. I would say, however, that in the case of Frindall we have a meticulous observer with a very consistent style over multiple decades.

Once again, it would be unwise to read too much into each little blip in the data, but in general there is a trend toward lower rates of missed chances. The trend would probably be steeper if the data was limited to Australia and England, as the recent data includes countries such as Bangladesh that have had little or no coverage in earlier decades.

I might add an opinion from decades of observation: I believe that the greatest area of improvement has been with weaker fielders. Today everyone, including those with limited skills, has to do extensive fielding drills and take that part of the game very seriously. This has been one effect of the one-day game. In past decades many took fielding seriously. Jack Hobbs, Don Bradman and Neil Harvey worked hard at it, and I doubt if any player today works as hard on fielding as Colin Bland did in the 1960s (Bland would spend hours picking up and throwing a ball at a single stump: his record of run-outs is superior to that of anyone today). However, there were also players who did much less work on their fielding skills. In the modern game there is nowhere to hide, and everyone must put in the training effort. As a result, overall standards have risen.