Forthcoming article:
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J., Sun, J., Washburn, A. N., Wong, K., Yantis, C. A., & Skitka, L. J. (in press). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. (preprint)
Brief Introduction
Since JPSP published incredbile evidence for mental time travel (Bem, 2011), the credibility of social psychological research has been questioned. There is talk of a crisis of confidence, a replication crisis, or a credibility crisis. However, hard data on the credibility of empirical findings published in social psychology journals are scarce.
There have been two approaches to examine the credibility of social psychology. One approach relies on replication studies. Authors attempt to replicate original studies as closely as possible. The most ambitious replication project was carried out by the Open Science Collaboration (Science, 2015) that replicated 1 study from 100 articles; 54 articles were classified as social psychology. For original articles that reported a significant result, only a quarter replicated a significant result in the replication studies. This estimate of replicability suggests that researches conduct many more studies than are published and that effect sizes in published articles are inflated by sampling error, which makes them difficult to replicate. One concern about the OSC results is that replicating original studies can be difficult. For example, a bilingual study in California may not produce the same results as a bilingual study in Canada. It is therefore possible that the poor outcome is partially due to problems of reproducing the exact conditions of original studies.
A second approach is to estimate replicability of published results using statistical methods. The advantage of this approach is that replicabiliy estimates are predictions for exact replication studies of the original studies because the original studies provide the data for the replicability estimates. This is the approach used by Motyl et al.
The authors sampled 30% of articles published in 2003-2004 (pre-crisis) and 2013-2014 (post-crisis) from four major social psychology journals (JPSP, PSPB, JESP, and PS). For each study, coders identified one focal hypothesis and recorded the statistical result. The bulk of the statistics were t-values from t-tests or regression analyses and F-tests from ANOVAs. Only 19 statistics were z-tests. The authors applied various statistical tests to the data that test for the presence of publication bias or whether the studies have evidential value (i.e., reject the null-hypothesis that all published results are false positives). For the purpose of estimating replicability, the most important statistic is the R-Index.
The R-Index has two components. First, it uses the median observed power of studies as an estimate of replicability (i.e., the percentage of studies that should produce a significant result if all studies were replicated exactly). Second, it computes the percentage of studies with a significant result. In an unbiased set of studies, median observed power and percentage of significant results should match. Publication bias and questionable research practices will produce more significant results than predicted by median observed power. The discrepancy is called the inflation rate. The R-Index subtracts the inflation rate from median observed power because median observed power is an inflated estimate of replicability when bias is present. The R-Index is not a replicability estimate. That is, an R-Index of 30% does not mean that 30% of studies will produce a significant result. However, a set of studies with an R-Index of 30 will have fewer successful replications than a set of studies with an R-Index of 80. An exception is an R-Index of 50, which is equivalent with a replicability estimate of 50%. If the R-Index is below 50, one would expect more replication failures than successes.
Motyl et al. computed the R-Index separately for the 2003/2004 and the 2013/2014 results and found “the R-index decreased numerically, but not statistically over time, from .62 [CI95% = .54, .68] in 2003-2004 to .52 [CI95% = .47, .56] in 2013-2014. This metric suggests that the field is not getting better and that it may consistently be rotten to the core.”
I think this interpretation of the R-Index results is too harsh. I consider an R-Index below 50 an F (fail). An R-Index in the 50s is a D, and an R-Index in the 60s is a C. An R-Index greater than 80 is considered an A. So, clearly there is a replication crisis, but social psychology is not rotten to the core.
The R-Index is a simple tool, but it is not designed to estimate replicability. Jerry Brunner and I developed a method that can estimate replicability, called z-curve. All test-statistics are converted into absolute z-scores and a kernel density distribution is fitted to the histogram of z-scores. Then a mixture model of normal distributions is fitted to the density distribution and the means of the normal distributions are converted into power values. The weights of the components are used to compute the weighted average power. When this method is applied only to significant results, the weighted average power is the replicability estimate; that is, the percentage of significant results that one would expect if the set of significant studies were replicated exactly. Motyl et al. did not have access to this statistical tool. They kindly shared their data and I was able to estimate replicability with z-curve. For this analysis, I used all t-tests, F-tests, and z-tests (k = 1,163). The Figure shows two results. The left figure uses all z-scores greater than 2 for estimation (all values on the right side of the vertical blue line). The right figure uses only z-scores greater than 2.4. The reason is that just-significant results may be compromised by questionable research methods that may bias estimates.
The key finding is the replicability estimate. Both estimations produce similar results (48% vs. 49%). Even with over 1,000 observations there is uncertainty in these estimates and the 95%CI can range from 45 to 54% using all significant results. Based on this finding, it is predicted that about half of these results would produce a significant result again in a replication study.
However, it is important to note that there is considerable heterogeneity in replicability across studies. As z-scores increase, the strength of evidence becomes stronger, and results are more likely to replicate. This is shown with average power estimates for bands of z-scores at the bottom of the figure. In the left figure, z-scores between 2 and 2.5 (~ .01 < p < .05) have only a replicability of 31%, and even z-scores between 2.5 and 3 have a replicability below 50%. It requires z-scores greater than 4 to reach a replicability of 80% or more. Similar results are obtained for actual replication studies in the OSC reproducibilty project. Thus, researchers should take the strength of evidence of a particular study into account. Studies with p-values in the .01 to .05 range are unlikely to replicate without boosting sample sizes. Studies with p-values less than .001 are likely to replicate even with the same sample size.
Independent Replication Study
Schimmack and Brunner (2016) applied z-curve to the original studies in the OSC reproducibility project. For this purpose, I coded all studies in the OSC reproducibility project. The actual replication project often picked one study from articles with multiple studies. 54 social psychology articles reported 173 studies. The focal hypothesis test of each study was used to compute absolute z-scores that were analyzed with z-curve.
The two estimation methods (using z > 2.0 or z > 2.4) produced very similar replicability estimates (53% vs. 52%). The estimates are only slightly higher than those for Motyl et al.’s data (48% & 49%) and the confidence intervals overlap. Thus, this independent replication study closely replicates the estimates obtained with Motyl et al.’s data.
Automated Extraction Estimates
Hand-coding of focal hypothesis tests is labor intensive and subject to coding biases. Often studies report more than one hypothesis test and it is not trivial to pick one of the tests for further analysis. An alternative approach is to automatically extract all test statistics from articles. This makes it also possible to base estimates on a much larger sample of test results. The downside of automated extraction is that articles also report statistical analysis for trivial or non-critical tests (e.g., manipulation checks). The extraction of non-significant results is irrelevant because they are not used by z-curve to estimate replicability. I have reported the results of this method for various social psychology journals covering the years from 2010 to 2016 and posted powergraphs for all journals and years (2016 Replicability Rankings). Further analyses replicated the results from the OSC reproducibility project that results published in cognitive journals are more replicable than those published in social journals. The Figure below shows that the average replicability estimate for social psychology is 61%, with an encouraging trend in 2016. This estimate is about 10% above the estimates based on hand-coded focal hypothesis tests in the two datasets above. This discrepancy can be due to the inclusion of less original and trivial statistical tests in the automated analysis. However, a 10% difference is not a dramatic difference. Neither 50% nor 60% replicability justify claims that social psychology is rotten to the core, nor do they meet the expectation that researchers should plan studies with 80% power to detect a predicted effect.
Moderator Analyses
Motyl et al. (in press) did extensive coding of the studies. This makes it possible to examine potential moderators (predictors) of higher or lower replicability. As noted earlier, the strength of evidence is an important predictor. Studies with higher z-scores (smaller p-values) are, on average, more replicable. The strength of evidence is a direct function of statistical power. Thus, studies with larger population effect sizes and smaller sampling error are more likely to replicate.
It is well known that larger samples have less sampling error. Not surprisingly, there is a correlation between sample size and the absolute z-scores (r = .3). I also examined the R-Index for different ranges of sample sizes. The R-Index was the lowest for sample sizes between N = 40 and 80 (R-Index = 43), increased for N = 80 to 200 (R-Index = 52) and further for sample sizes between 200 and 1,000 (R-Index = 69). Interestingly, the R-Index for small samples with N < 40 was 70. This is explained by the fact that research designs also influence replicability and that small samples often use more powerful within-subject designs.
A moderator analysis with design as moderator confirms this. The R-Indices for between-subject designs is the lowest (R-Index = 48) followed by mixed designs (R-Index = 61) and then within-subject designs (R-Index = 75). This pattern is also found in the OSC reproducibility project and partially accounts for the higher replicability of cognitive studies, which often employ within-subject designs.
Another possibility is that articles with more studies package smaller and less replicable studies. However, number of studies in an article was not a notable moderator: 1 study R-Index = 53, 2 studies R-Index = 51, 3 studies R-Index = 60, 4 studies R-Index = 52, 5 studies R-Index = 53.
Conclusion
Motyl et al. (in press) coded a large and representative sample of results published in social psychology journals. Their article complements results from the OSC reproducibility project that used actual replications, but a much smaller number of studies. The two approaches produce different results. Actual replication studies produced only 25% successful replications. Statistical estimates of replicability are around 50%. Due to the small number of actual replications in the OSC reproducibility project, it is important to be cautious in interpreting the differences. However, one plausible explanation for lower success rates in actual replication studies is that it is practically impossible to redo a study exactly. This may even be true when researchers conduct three similar studies in their own lab and only one of these studies produces a significant result. Some non-random, but also not reproducible, factor may have helped to produce a significant result in this study. Statistical models assume that we can redo a study exactly and may therefore overestimate the success rate for actual replication studies. Thus, the 50% estimate is an optimistic estimate for the unlikely scenario that a study can be replicated exactly. This means that even though optimists may see the 50% estimate as “the glass half full,” social psychologists need to increase statistical power and pay more attention to the strength of evidence of published results to build a robust and credible science of social behavior.