Unpopular Opinion: replicate

Showing posts with label replicate. Show all posts

Monday, 15 November 2021

Are swathes of prestigious financial academic research statistically bogus?

Robin Wigglesworth in The FT

It may sound like a low-budget Blade Runner rip-off, but over the past decade the scientific world has been gripped by a “replication crisis” — the findings of many seminal studies cannot be repeated, with huge implications. Is investing suffering from something similar?

That is the incendiary argument of Campbell Harvey, professor of finance at Duke university. He reckons that at least half of the 400 supposedly market-beating strategies identified in top financial journals over the years are bogus. Worse, he worries that many fellow academics are in denial about this.

“It’s a huge issue,” he says. “Step one in dealing with the replication crisis in finance is to accept that there is a crisis. And right now, many of my colleagues are not there yet.”

Harvey is not some obscure outsider or performative contrarian attempting to gain attention through needless controversy. He is the former editor of the Journal of Finance, a former president of the American Finance Association, and an adviser to investment firms like Research Affiliates and Man Group.

He has written more than 150 papers on finance, several of which have won prestigious prizes. In fact, Harvey’s 1986 PhD thesis first showed how the bond market’s curves can predict recessions. In other words, this is not like a child saying the emperor has no clothes. Harvey’s escalating criticism of the rigour of financial academia since 2015 is more akin to the emperor regretfully proclaiming his own nudity.

To understand what the ‘replication crisis’ is, how it has happened and its implications for finance, it helps to start at its broader genesis.

In 2005, Stanford medical professor John Ioannidis published a bombshell essay titled “Why Most Published Research Findings Are False”, which noted that the results of many medical research papers could not be replicated by other researchers. Subsequently, several other fields have turned a harsh eye on themselves and come to similar conclusions. The heart of the issue is a phenomenon that researchers call “p-hacking”.

In statistics, a p-value is the probability of whether a finding could be because of pure chance — a simple data oddity like the correlation of Nicolas Cage films to US swimming pool drownings — or whether it is “statistically significant”. P-scores indicate whether a certain drug really does help, or if cheap stocks do outperform over time.

P-hacking is when researchers overtly or subconsciously twist the data to find a superficially compelling but ultimately spurious relationship between variables. It can be done by cherry-picking what metrics to measure, or subtly changing the time period used. Just because something is narrowly statistically significant, does not mean it is actually meaningful. A trading strategy that looks golden on paper might turn up nothing but lumps of coal when actually implemented.

Harvey attributes the scourge of p-hacking to incentives in academia. Getting a paper with a sensational finding published in a prestigious journal can earn an ambitious young professor the ultimate prize — tenure. Wasting months of work on a theory that does not hold up to scrutiny would frustrate anyone. It is therefore tempting to torture the data until it yields something interesting, even if other researchers are later unable to duplicate the results.

Obviously, the stakes of the replication crisis are much higher in medicine, where lives can be in play. But it is not something that remains confined to the ivory towers of business schools, as investment groups often smell an opportunity to sell products based on apparently market-beating factors, Harvey argues. “It filters into the real world,” he says. “It definitely makes it into people’s portfolios.”

AQR, a prominent quant investment group, is also sceptical that there are hundreds of durable and successful factors that can help investors beat markets, but argues that the “replication crisis” brouhaha is overdone. Earlier this year it published a paper that concluded that not only could the majority of the studies it examined be replicated, they still worked “out of sample” — in actual live trading — and were actually further corroborated by international data.

Harvey is unconvinced by the riposte, and will square up to the AQR paper’s authors at the American Finance Association’s annual meeting in early January. “That’s going to be a very interesting discussion,” he promises. Many of the industry’s geekier members will be rubbing their hands at the prospect of a gladiatorial, if cerebral, showdown to kick off 2022.

Thursday, 22 September 2016

Why bad science persists?

From The Economist

IN 1962 Jacob Cohen, a psychologist at New York University, reported an alarming finding. He had analysed 70 articles published in the Journal of Abnormal and Social Psychology and calculated their statistical “power” (a mathematical estimate of the probability that an experiment would detect a real effect). He reckoned most of the studies he looked at would actually have detected the effects their authors were looking for only about 20% of the time—yet, in fact, nearly all reported significant results. Scientists, Cohen surmised, were not reporting their unsuccessful research. No surprise there, perhaps. But his finding also suggested some of the papers were actually reporting false positives, in other words noise that looked like data. He urged researchers to boost the power of their studies by increasing the number of subjects in their experiments.

Wind the clock forward half a century and little has changed. In a new paper, this time published in Royal Society Open Science, two researchers, Paul Smaldino of the University of California, Merced, and Richard McElreath at the Max Planck Institute for Evolutionary Anthropology, in Leipzig, show that published studies in psychology, neuroscience and medicine are little more powerful than in Cohen’s day.

They also offer an explanation of why scientists continue to publish such poor studies. Not only are dodgy methods that seem to produce results perpetuated because those who publish prodigiously prosper—something that might easily have been predicted. But worryingly, the process of replication, by which published results are tested anew, is incapable of correcting the situation no matter how rigorously it is pursued.

The preservation of favoured places

First, Dr Smaldino and Dr McElreath calculated that the average power of papers culled from 44 reviews published between 1960 and 2011 was about 24%. This is barely higher than Cohen reported, despite repeated calls in the scientific literature for researchers to do better. The pair then decided to apply the methods of science to the question of why this was the case, by modelling the way scientific institutions and practices reproduce and spread, to see if they could nail down what is going on.

They focused in particular on incentives within science that might lead even honest researchers to produce poor work unintentionally. To this end, they built an evolutionary computer model in which 100 laboratories competed for “pay-offs” representing prestige or funding that result from publications. They used the volume of publications to calculate these pay-offs because the length of a researcher’s CV is a known proxy of professional success. Labs that garnered more pay-offs were more likely to pass on their methods to other, newer labs (their “progeny”).

Some labs were better able to spot new results (and thus garner pay-offs) than others. Yet these labs also tended to produce more false positives—their methods were good at detecting signals in noisy data but also, as Cohen suggested, often mistook noise for a signal. More thorough labs took time to rule these false positives out, but that slowed down the rate at which they could test new hypotheses. This, in turn, meant they published fewer papers.

In each cycle of “reproduction”, all the laboratories in the model performed and published their experiments. Then one—the oldest of a randomly selected subset—“died” and was removed from the model. Next, the lab with the highest pay-off score from another randomly selected group was allowed to reproduce, creating a new lab with a similar aptitude for creating real or bogus science.

Sharp-eyed readers will notice that this process is similar to that of natural selection, as described by Charles Darwin, in “The Origin of Species”. And lo! (and unsurprisingly), when Dr Smaldino and Dr McElreath ran their simulation, they found that labs which expended the least effort to eliminate junk science prospered and spread their methods throughout the virtual scientific community.

Their next result, however, was surprising. Though more often honoured in the breach than in the execution, the process of replicating the work of people in other labs is supposed to be one of the things that keeps science on the straight and narrow. But the two researchers’ model suggests it may not do so, even in principle.

Replication has recently become all the rage in psychology. In 2015, for example, over 200 researchers in the field repeated 100 published studies to see if the results of these could be reproduced (only 36% could). Dr Smaldino and Dr McElreath therefore modified their model to simulate the effects of replication, by randomly selecting experiments from the “published” literature to be repeated.

A successful replication would boost the reputation of the lab that published the original result. Failure to replicate would result in a penalty. Worryingly, poor methods still won—albeit more slowly. This was true in even the most punitive version of the model, in which labs received a penalty 100 times the value of the original “pay-off” for a result that failed to replicate, and replication rates were high (half of all results were subject to replication efforts).

The researchers’ conclusion is therefore that when the ability to publish copiously in journals determines a lab’s success, then “top-performing laboratories will always be those who are able to cut corners”—and that is regardless of the supposedly corrective process of replication.

Ultimately, therefore, the way to end the proliferation of bad science is not to nag people to behave better, or even to encourage replication, but for universities and funding agencies to stop rewarding researchers who publish copiously over those who publish fewer, but perhaps higher-quality papers. This, Dr Smaldino concedes, is easier said than done. Yet his model amply demonstrates the consequences for science of not doing so.

Saturday, 29 August 2015

Psychology experiments are failing the replication test – for good reason

John Ioannidis in The Guardian

‘The replication failure rate of psychology seems to be in the same ballpark as those rates in observational epidemiology, cancer drug targets and preclinical research, and animal experiments.’ Photograph: Sebastian Kaulitzki/Alamy

Science is the best thing that has happened to humankind because its results can be questioned, retested, and demonstrated to be wrong. Science is not about proving at all cost some preconceived dogma. Conversely religious devotees, politicians, soccer fans, and pseudo-science quacks won’t allow their doctrines, promises, football clubs or bizarre claims to be proven illogical, exaggerated, second-rate or just absurd.

Despite this clear superiority of the scientific method, we researchers are still fallible humans. This week, an impressive collaboration of 270 investigators working for five years published in Science the results of their efforts to replicate 100 important results that had been previously published in three top psychology journals. The replicators worked closely with the original authors to make the repeat experiments close replicas of the originals. The results were bleak: 64% of the experiments could not be replicated.
We often feel uneasy about having our results probed for possible debunking. We don’t always exactly celebrate when we are proven wrong. For example, retracting published papers can take many years and many editors, lawyers, and whistleblowers – and most debunked published papers are never retracted. Moreover, with fierce competition for limited research funds and with millions of researchers struggling to make a living (publish, get grants, get promoted), we are under immense pressure to make “significant”, “innovative” discoveries. Many scientific fields are thus being flooded with claimed discoveries that nobody ever retests. Retesting (called replication) is discouraged. In most fields, no funding is given for what is pooh-poohed as me-too efforts. We are forced to hasten from one “significant” paper to the next without ever reassessing our previously claimed successes.

Multiple lines of evidence suggest this is a recipe for disaster, leading to a scientific literature littered with long chains of irreproducible results. Irreproducibility is rarely an issue of fraud. Simply having millions of hardworking scientists searching fervently and creatively in billions of analyses for something statistically significant can lead to very high rates of false-positives (red-herring claims about things that don’t exist) or inflated results.

This is more likely to happen in fields that chase subtle, complex phenomena, in those that have more noise in measurement, and where there is more room for subjective choices to be introduced in designing and running experiments and crunching the data. Ten years ago I tried to model these factors. These models predicted that in most scientific fields and settings the majority of published research, findings may be false. They also anticipated that the false rates could vary greatly (from almost 0% to almost 100%), depending on the features of a scientific discipline and how scientists run their work.

Probably the failure rate in the Science data would have been higher for work published in journals of lesser quality. There are tens of thousands of journals in the scientific-publishing market, and most will publish almost anything submitted to them. The failure rate may also be higher for studies that are so complex that none of the collaborating replicators offered to attempt a replication. This group accounted for one-third of the studies published in the three top journals. So the replication failure rate for psychology at large may be 80% or more overall.

This performance is even worse than I would have predicted. In 2012 my anticipation of a 53% replication failure rate for psychology at large was published. Compared with other empirical studies, the failure rate of psychology seems to be in the same ballpark as replication failure rates in observational epidemiology, cancer drug targets and preclinical research, and animal experiments.

However, I think it is important to focus on the positive side. The Science paper shows that large-scale replication efforts of high quality are doable even in fields like psychology where there was no strong replication culture until recently. Hopefully this successful, highly informative paradigm will help improve research practices in this field. Many other scientific fields without strong replication cultures may also be prompted now to embrace replications and reproducible research practices. Thus these seemingly disappointing results offer a great opportunity to strengthen scientific investigation. I look forward to celebrate one day when my claim that most published research findings are false is thoroughly refuted across most, if not all, scientific fields.

Search This Blog