Thursday, September 3, 2015

Reproducibility project: A front row seat

A recent paper in Science reports the results of a large-scale effort to test reproducibility in psychological science. The results have caused much discussion (as well they should) in both general public and science forums. I thought I would offer my perspective as the lead author of one of the studies that was included in the reproducibility analysis. I had heard about the project even before being contacted to participate and one of the things that appealed to me about it was that they were trying to be unbiased in their selection of studies for replication: all papers published in three prominent journals in 2008.  Jim Magnuson and I had published a paper in one of those journals (Journal of Experimental Psychology: Learning, Memory, & Cognition) in 2008 (Mirman & Magnuson, 2008), so I figured I would hear from them sooner or later. 

In 2012 I was contacted by one of the members of the Open Science Collaboration requesting either our original experiment files or details of the procedure so they could replicate it as closely as possible. I provided the experiment files and we had a little email discussion during which I provided the details of our data analysis procedure (exclusion of error trials and reaction time outliers, etc.) and verified which of the effects from our original paper was the critical one for replication -- an inhibitory effect of near semantic neighbors on visual word recognition. They conducted a power analysis and ran their final data collection plans by me for my input. I flagged some minor issues, but didn't see anything that would be a significant problem. It was great to be informed every step of the way -- it felt like a true replication effort that was independent and transparent, but I could identify any significant problems.

My key finding was statistically significant in their replication, though the effect size was smaller than in my original report. Thanks to the project's open sharing of the data and analysis code, I was able to make a version of their Figure 3 with my study identified (X and arrow):
My experience with the reproducibility project was that they were extremely careful and professional. The studies for replication were selected systematically rather than due to particular skepticism or certainty and I was consulted every step of the way of the replication of my study, which allowed me to both help make it a true replication and to raise any concerns about differences between my original study and the replication.

The hand wringing

Most of the discussion about the Science paper has been about whether or not there is a crisis in psychology or whether psychology is a "real" science -- where physics, chemistry, maybe biology are the "real" sciences. 

First, science is a method, not a content area. One can apply the scientific method to the behavior of atoms, molecules, organisms, or human behavior. Each of those domains has its own challenges, but the science is in the method, not in the content. 

Second, part of that method is replication. Not replicability, which is a property of a particular phenomenon; but replication, which is a methodological strategy. Observing a phenomenon once is intriguing, observing it repeatedly makes it something worth explaining. Each individual scientific report should be treated as provisional: Jim and I observed an effect that we reported in that 2008 paper, but it could have been a random coincidence or a bizarre property of the context of our experiment. This replication gives me more confidence in the result and I have separately found the effect in a different task and two different populations (Mirman, 2011; Mirman & Graziano, 2013), which makes me more confident in our underlying theory. To my mind, the bigger problem is that there is very little incentive for running replication studies. Journals and funding agencies want to see innovative science and replications are literally the opposite of innovative. People have proposed various clever ways of encouraging and sharing replication studies and some journals have started publishing replication reports. I hope this trend continues and the academic culture begins to accept and reward replications.

Third, much discussion has focused on the fact that a high proportion of studies did not "replicate" - an effect that was originally statistically significant was not statistically significant in the replication - and that the replication effect sizes were generally smaller than the originally reported effect sizes. The latter was true of my study: the effect replicated, but the replication effect size was smaller than the effect size in our original report, which is reflected by our data point being below the diagonal in the figure. The replication issue is a straightforward consequence of the effect size issue: even assuming that an effect exists in the population (not just in the original sample), if the population effect size is smaller than the one in the original sample, then power analysis based on the original sample effect size will produce under-powered studies that will, sometimes, fail to detect the effect in the population. So the relevant issue is that reported sample effect sizes tend to be larger than population effect sizes, but this is a direct consequence of the "statistical significance filter", also known as "publication bias": statistically significant effects can be published but null results are very rarely published. For example, Jim and I may not have been the only people to test for a near semantic neighbor effect, but maybe the effects in the other studies were smaller and not statistically significant, so they were not been published (probably not even submitted for publication). When you chop off the low end of the effect size distribution, the average of the trimmed distribution will necessarily be larger than the average of the full distribution.

Where do we go from here? 

I think we need two major changes:

(1) We need to start encouraging and rewarding replication studies. Not just when we think someone is wrong, but as a matter of course, as part of going about the business of psychological science. I've heard many good ideas -- using replication studies as assignments in research methods courses, publishing them as online supplements to the original studies or having an online repository of replications -- these and/or other ideas need to become part of how we do psychological science. 

(2) We need to accept that we're dealing with variable effects and that each new result should be treated as provisional until it is thoroughly replicated. There are lots aspects to this, but I think the most important one is not to take it personally or get defensive when someone raises doubts or fails to replicate our work. One can run a perfectly good experiment, do all of the analyses the best possible way, and come up with something that is true for the sample but not true for the population. I think it is important to be very aware of how big a leap we are making when we see a phenomenon in 20 college students and draw conclusions about fundamental aspects of human cognition.

ResearchBlogging.orgOpen Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349 (6251) DOI: 10.1126/science.aac4716
Mirman, D., & Magnuson, J. (2008). Attractor dynamics and semantic neighborhood density: Processing is slowed by near neighbors and speeded by distant neighbors. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34 (1), 65-79 DOI: 10.1037/0278-7393.34.1.65
Mirman, D. (2011). Effects of near and distant semantic neighbors on word production Cognitive, Affective, & Behavioral Neuroscience, 11 (1), 32-43 DOI: 10.3758/s13415-010-0009-7
Mirman, D., & Graziano, K. (2013). The Neural Basis of Inhibitory Effects of Semantic and Phonological Neighbors in Spoken Word Production Journal of Cognitive Neuroscience, 25 (9), 1504-1516 DOI: 10.1162/jocn_a_00408

1 comment: