diff --git a/config/_default/menus.toml b/config/_default/menus.toml index 573f0f533a7..f271d5cc6e5 100644 --- a/config/_default/menus.toml +++ b/config/_default/menus.toml @@ -331,22 +331,28 @@ weight = 69 parent = "replications" +[[main]] + name = "Replication Network Blog" + url = "/replication-hub/blog" + weight = 70 + parent = "replications" + [[main]] name = "Replication Research Journal" url = "https://replicationresearch.org" - weight = 70 + weight = 71 parent = "replications" [[main]] name = "Replication Manuscript Templates" url = "https://osf.io/brxtd/" - weight = 71 + weight = 72 parent = "replications" [[main]] name = "Submit a replication to FReD" url = "/replication-hub/submit" - weight = 72 + weight = 73 parent = "replications" diff --git a/content/replication-hub/blog/_index.md b/content/replication-hub/blog/_index.md new file mode 100644 index 00000000000..c9e386dcf22 --- /dev/null +++ b/content/replication-hub/blog/_index.md @@ -0,0 +1,21 @@ +--- +title: "Replication Network Blog" +date: 2025-11-04 +type: blog +url: "/replication-hub/blog/" +--- + +Welcome to the Replication Network Blog, a collection of guest posts, perspectives, and discussions on replication research, reproducibility, and open science practices. + +Browse through our archive of articles covering topics including: +- Replication studies and methodologies +- Statistical considerations in replication research +- Peer review and publishing practices +- Meta-science and research quality +- Teaching and learning about replications + +The blog features contributions from researchers, statisticians, and practitioners who share their insights and experiences with replication research across various disciplines. + +--- + +## Recent Blog Posts diff --git a/content/replication-hub/blog/anderson-maxwell-there-s-more-than-one-way-to-conduct-a-replication-study-six-in-fact.md b/content/replication-hub/blog/anderson-maxwell-there-s-more-than-one-way-to-conduct-a-replication-study-six-in-fact.md new file mode 100644 index 00000000000..e9377f556a1 --- /dev/null +++ b/content/replication-hub/blog/anderson-maxwell-there-s-more-than-one-way-to-conduct-a-replication-study-six-in-fact.md @@ -0,0 +1,75 @@ +--- +title: "ANDERSON & MAXWELL: There’s More than One Way to Conduct a Replication Study – Six, in Fact" +date: 2017-02-28 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "confidence intervals" + - "equivalence tests" + - "p-value" + - "replication" + - "significance testing" +draft: false +type: blog +--- + +###### *NOTE: This entry is based on the article, “There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance” (Psychological Methods, 2016, Vol, 21, No. 1, 1-12)* + +###### Following a large-scale replication project in economics (Chang & Li, 2015) that successfully replicated only a third of 67 studies, a recent headline boldly reads, “The replication crisis has engulfed economics” (Ortman, 2015). Several fields are suffering from a “crisis of confidence” (Pashler & Wagenmakers, 2012, p. 528), as widely publicized replication projects in psychology and medicine have showed similarly disappointing results (e.g., Open Science Collaboration, 2015; Prinz, Schlange, & Asadullah, 2011). There are certainly a host of factors contributing to the crisis, but there is a silver lining: the recent increase in attention toward replication has allowed researchers to consider various ways in which replication research can be improved. Our article (Anderson & Maxwell, 2016, *Psychological Methods*) sheds light on one potential way to broaden the effectiveness of replication research. + +###### In our article, we take the perspective that replication has often been narrowly defined. Namely, if a replication study is statistically significant, it is considered successful, whereas if the replication study does not meet the significance threshold, it is considered a failure. However, replication need not only be defined by this significant, non-significant distinction. We posit that what constitutes a successful replication can vary based on a researcher’s specific goal. We outline six replication goals and provide details on the statistical analysis for each, noting that these goals are by no means exhaustive. + +###### Deeming a replication as successful when the result is statistically significant is indeed merited in a number of situations (Goal 1). For example, consider the case where two competing theories are pitted against each other. In this situation, we argue that it is the direction of the effect that matters, which validates one theory over another, rather than the magnitude of the effect. Significance based replication can be quite informative in these cases. However, even in this situation, a nonsignificant result should not be taken to mean that the replication was a failure. Researchers who desire to evidence that a reported effect is null can consider Goal 2. + +###### In Goal 2, researchers are interested in showing that an effect does not exist. Although some researchers seem to be aware that this is a valid goal, their choice of analysis often only fails to reject the null, which is rather weak evidence for nonreplication.  We encourage researchers who would like to show that a claimed effect is null to use an equivalence test or Bayesian methods (e.g., ROPE, Kruschke, 2011; Bayes-factors, Rouder & Morey, 2012), both of which can reliably show an effect is essentially zero, rather than simply that it is not statistically significant. + +###### Goal 3 involves accurately estimating the magnitude of a claimed effect. Research has shown that effect sizes in published research are upwardly biased (Lane & Dunlap, 1978; Maxwell, 2004), and effect sizes from underpowered studies may have wide confidence intervals. Thus, a replication researcher may have reason to question the reported effect size of a study and desire to obtain a more accurate estimate of the effect. Researchers with this goal in mind can use accuracy in parameter estimation (AIPE; Maxwell, Kelley, & Rausch, 2008) approaches to plan their sample sizes so that a desired degree of precision in the effect size estimate can be achieved. In the analysis phase, we encourage these researchers to report a confidence interval around the replication effect size. Thus, successful replication for Goal 3 is defined by the degree of precision in estimating the effect size. + +###### Goal 4 involves combining data from a replication study with a published original study, effectively conducting a small meta-analysis on the two studies. Importantly, access to the raw data from the original study is often not necessary. This approach is in keeping with the idea of continuously cumulating meta-analysis, (CCMA; Braver, Thoemmes, & Rosenthal, 2014) wherein each new replication can be incorporated into the previous knowledge. Researchers can report a confidence interval around the average (weighted) effect size of the two studies (e.g., Bonett, 2009). This goal begins to correct some of the issues associated with underpowered studies, even when only a single replication study is involved. For example, Braver and colleagues (2014) illustrate a situation in which the *p*-value combining original and replication studies (*p* = .016) was smaller than both the original study (*p* = .033) and the replication study (*p* = .198), emphasizing the power advantage of this technique. + +###### In Goal 5, researchers aim to show that a replication effect size is inconsistent with that of the original study. A simple difference in statistical significance is not suited for this goal. In fact, the difference between a statistically significant and nonsignificant finding is not necessarily statistically significant (Gelman & Stern, 2006). Rather, we encourage researchers to consider testing the difference in effect sizes between the two studies, using a confidence interval approach (e.g., Bonett, 2009). Although some authors declare a replication to be a failure when the replication effect size is smaller in magnitude than that reported by the original study, testing the difference in effect sizes for significance is a much more precise indicator of replication success in this situation. Specifically, a nominal difference in effect sizes does not imply that the effects differ statistically (Bonett & Wright, 2007). + +###### Finally, Goal 6 involves showing that a replication effect is consistent with the original effect. In a combination of the recommended analyses for Goals 2 and 5, we recommend conducting an equivalence test on the difference in effect sizes.  Authors who declare their replication study successful when the effect size appears similar to the original study could benefit from knowledge of these analyses, as descriptively similar effect sizes may statistically differ. + +###### We hope that the broader view of replication that we present in our article allows researchers to expand their goals for replication research as well as utilize more precise indicators of replication success and non-success. Although recent replication attempts have painted a grim picture in many fields, we are confident that the recent emphasis on replication will bring about a literature in which readers can be more confident, in economics, psychology, and beyond. + +###### *Scott Maxwell is Professor and Matthew A. Fitzsimon Chair in the Department of Psychology at the University of Notre Dame. Samantha Anderson is a PhD student, also in the Department of Psychology at Notre Dame. Correspondence about this blog should be addressed to her at Samantha.F.Anderson.350@nd.edu.* + +###### **REFERENCES** + +###### Bonett, D. G. (2009). Meta-analytic interval estimation for standardized and unstandardized mean differences. *Psychological Methods, 14*, 225–238. doi:10.1037/a0016619 + +###### Bonett, D. G., & Wright, T. A. (2007). Comments and recommendations regarding the hypothesis testing controversy. *Journal of Organizational Behavior, 28*, 647–659. doi:10.1002/job.448 + +###### Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating meta-analysis and replicability. *Perspectives on Psychological Science, 9*, 333–342. doi:10.1177/1745691614529796 + +###### Chang, A. C., & Li, P. (2015). Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not”,” *Finance and Economics Discussion Series 2015-083*. Washington: Board of Governors of the Federal Reserve System, doi:10.17016/FEDS.2015.083 + +###### Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. *The American Statistician, 60*, 328 –331. doi:10.1198/000313006X152649 + +###### Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. *Perspectives on Psychological Science, 6*, 299–312. doi:10.1177/1745691611406925 + +###### Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance criterion in editorial decisions. *British Journal of Mathematical and Statistical Psychology, 31*, 107–112. doi:10.1111/j.2044-8317.1978.tb00578.x + +###### Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. *Psychological Methods, 9*, 147–163. doi:10.1037/1082-989X.9.2.147 + +###### Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. *Annual Review of Psychology, 59*, 537–563. doi:10.1146/annurev.psych.59.103006.093735 + +###### Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. *Science, 349,* aac4716. doi:10.1126/science.aac4716 + +###### Ortman, A. (2015, November 2). *The replication crisis has engulfed economics*. Retrieved from + +###### Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? *Perspectives on Psychological Science, 7,* 528–530. doi:10.1177/1745691612465253 + +###### Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? *Nature Reviews Drug Discovery*, *10*, 712–713. + +###### Rouder, J. N., & Morey, R. D. (2012). Default Bayes factors for model selection in regression. *Multivariate Behavioral Research, 47*, 877–903. doi:10.1080/00273171.2012.734737 + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/02/28/anderson-maxwell-theres-more-than-one-way-to-conduct-a-replication-study-six-in-fact/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/02/28/anderson-maxwell-theres-more-than-one-way-to-conduct-a-replication-study-six-in-fact/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-comparing-human-only-ai-assisted-and-ai-led-teams-on-assessing-research-reproducibility-in-quant.md b/content/replication-hub/blog/aoi-comparing-human-only-ai-assisted-and-ai-led-teams-on-assessing-research-reproducibility-in-quant.md new file mode 100644 index 00000000000..fb9c2b94ad4 --- /dev/null +++ b/content/replication-hub/blog/aoi-comparing-human-only-ai-assisted-and-ai-led-teams-on-assessing-research-reproducibility-in-quant.md @@ -0,0 +1,43 @@ +--- +title: "AoI*: “: Comparing Human-Only, AI-Assisted, and AI-Led Teams on Assessing Research Reproducibility in Quantitative Social Science” by Brodeur et al. (2025)" +date: 2025-01-18 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "AI" + - "AI-assisted research" + - "AI-led analysis" + - "Artificial Intelligence" + - "Human vs AI collaboration" + - "Quantitative social science" + - "Reproducibility assessment" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from**[***the article***](https://www.econstor.eu/bitstream/10419/308508/1/I4R-DP195.pdf)**)** + +“This study evaluates the effectiveness of varying levels of human and artificial intelligence (AI) integration in reproducibility assessments of quantitative social science research.” + +“We computationally reproduced quantitative results from published articles in the social sciences with 288 researchers, randomly assigned to 103 teams across three groups — human-only teams, AI-assisted teams and teams whose task was to minimally guide an AI to conduct reproducibility checks (the “AI-led” approach).” + +“Findings reveal that when working independently, human teams matched the reproducibility success rates of teams using AI assistance, while both groups substantially outperformed AI-led approaches (with human teams achieving 57 percentage points higher success rates than AI-led teams, 𝒑 < 0.001).” + +“Human teams were particularly effective at identifying serious problems in the analysis: they found significantly more major errors compared to both AI-assisted teams (0.7 more errors per team, 𝒑 = 0.017) and AI-led teams (1.1 more errors per team, 𝒑 < 0.001). AI-assisted teams demonstrated an advantage over more automated approaches, detecting 0.4 more major errors per team than AI-led teams ( 𝒑 = 0.029), though still significantly fewer than human-only teams. Finally, both human and AI-assisted teams significantly outperformed AI-led approaches in both proposing (25 percentage points difference, 𝒑 = 0.017) and implementing (33 percentage points difference, 𝒑 = 0.005) comprehensive robustness checks.” + +“These results underscore both the strengths and limitations of AI assistance in research reproduction and suggest that despite impressive advancements in AI capability, key aspects of the research publication process still require human substantial human involvement.” + +**REFERENCE** + +[Brodeur, Abel et al. (2025) : Comparing Human-Only, AI-Assisted, and AI-Led Teams on Assessing Research Reproducibility in Quantitative Social Science, I4R Discussion Paper Series, No. 195, Institute for Replication (I4R), s.l.](https://www.econstor.eu/bitstream/10419/308508/1/I4R-DP195.pdf) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/01/18/aoi-comparing-human-only-ai-assisted-and-ai-led-teams-on-assessing-research-reproducibility-in-quantitative-social-science-by-brodeur-et-al-2025/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/01/18/aoi-comparing-human-only-ai-assisted-and-ai-led-teams-on-assessing-research-reproducibility-in-quantitative-social-science-by-brodeur-et-al-2025/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-conventional-wisdom-meta-analysis-and-research-revision-in-economics-by-gechert-et-al-2023.md b/content/replication-hub/blog/aoi-conventional-wisdom-meta-analysis-and-research-revision-in-economics-by-gechert-et-al-2023.md new file mode 100644 index 00000000000..607a99105d6 --- /dev/null +++ b/content/replication-hub/blog/aoi-conventional-wisdom-meta-analysis-and-research-revision-in-economics-by-gechert-et-al-2023.md @@ -0,0 +1,38 @@ +--- +title: "AoI*: “Conventional Wisdom, Meta-Analysis, and Research Revision in Economics” by Gechert  et al. (2023)" +date: 2023-12-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Conventional wisdom" + - "maer-net" + - "Meta-analysis" + - "publication bias" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report excerpts of recent research related to replication and research integrity.]* + +**EXCERPT (taken from the [article](https://www.econstor.eu/bitstream/10419/280745/1/Meta-analysis-review.pdf))** + +“The purpose of this study is to compare the findings of influential meta-analyses to the ‘conventional wisdom’ about the same economic question or issue. What have we learned from meta-analyses of economics? How do their results differ from the conventional, textbook understanding of economics?” + +“We identify ‘influential’ meta-analyses as those with at least 100 citations that were published in 2000 or later, and those that were recommended by a survey of members of the Meta-Analysis of Economics Research Network (MAER-Net)” + +“Out of the full sample of 360 studies, 72 studies cover a general interest topic in economics and include original empirical estimates for a certain effect size. We narrow down further to those meta-analyses that provide both a simple mean of the original effect size and a corrected mean, controlling for publication bias or other biases. This gives us a final list of 24 studies covering the fields of growth and development, finance, public finance, education, international, labor, behavioral, gender, environmental, and regional/urban economics.” + +“We compare the central findings of the meta-analyses to ‘conventional wisdom’ as classified by: (1) a widely recognized seminal paper or authoritative literature review; (2) the assessment of an artificial intelligence (AI), the GPT-4 Large Language Model (LLM); and (3) the simple unweighted average of reported effects included in the metaanalysis.” + +“For 17 of these 24 studies, the corrected effect size is substantially closer to zero than commonly thought, or even switches sign. Statistically significant publication bias is prevalent in 17 of the 24 studies. Overall, we find that 16 of 24 studies show both a clear reduction in effect size and a statistically significant publication bias. Comparing the best estimate from the meta-analysis with the conventional wisdom from the reference study, the GPT-4 estimate, or the simple unweighted average, the relative reduction in the effect size is in the range of 45-60% in all three comparison cases.” + +REFERENCE: [Gechert, S., Mey, B., Opatrny, M., Havranek, T., Stanley, T. D., Bom, P. R., … & Rachinger, H. J. (2023). Conventional Wisdom, Meta-Analysis, and Research Revision in Economics](https://www.econstor.eu/bitstream/10419/280745/1/Meta-analysis-review.pdf). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/12/27/aoi-conventional-wisdom-meta-analysis-and-research-revision-in-economics-by-gechert-et-al-2023/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/12/27/aoi-conventional-wisdom-meta-analysis-and-research-revision-in-economics-by-gechert-et-al-2023/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-decisions-decisions-decisions-an-ethnographic-study-of-researcher-discretion-in-practice-by-van-.md b/content/replication-hub/blog/aoi-decisions-decisions-decisions-an-ethnographic-study-of-researcher-discretion-in-practice-by-van-.md new file mode 100644 index 00000000000..38349400e26 --- /dev/null +++ b/content/replication-hub/blog/aoi-decisions-decisions-decisions-an-ethnographic-study-of-researcher-discretion-in-practice-by-van-.md @@ -0,0 +1,36 @@ +--- +title: "AoI*: “Decisions, Decisions, Decisions: An Ethnographic Study of Researcher Discretion in Practice” by van Drimmelen et al. (2024)" +date: 2025-03-05 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Ethnographic study" + - "Pre-Analysis plans" + - "Research practice" + - "Researcher discretion" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from *[the article](https://link.springer.com/article/10.1007/s11948-024-00481-5)*)** + +“This paper is a study of the decisions that researchers take during the execution of a research plan: their researcher discretion. Flexible research methods are generally seen as undesirable, and many methodologists urge to eliminate these so-called ‘researcher degrees of freedom’ from the research practice. However, what this looks like in practice is unclear.” + +“Based on twelve months of ethnographic fieldwork in two end-of-life research groups in which we observed research practice, conducted interviews, and collected documents, we explore when researchers are required to make decisions, and what these decisions entail.” + +“Our ethnographic study of research practice suggests that researcher discretion is an integral and inevitable aspect of research practice, as many elements of a research protocol will either need to be further operationalised or adapted during its execution. Moreover, it may be difficult for researchers to identify their own discretion, limiting their effectivity in transparency.” + +**REFERENCE** + +[van Drimmelen, T., Slagboom, M.N., Reis, R. *et al.* Decisions, Decisions, Decisions: An Ethnographic Study of Researcher Discretion in Practice. *Sci Eng Ethics* **30**, 59 (2024). https://doi.org/10.1007/s11948-024-00481-5](https://link.springer.com/article/10.1007/s11948-024-00481-5) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/03/05/aoi-decisions-decisions-decisions-an-ethnographic-study-of-researcher-discretion-in-practice-by-van-drimmelen-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/03/05/aoi-decisions-decisions-decisions-an-ethnographic-study-of-researcher-discretion-in-practice-by-van-drimmelen-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-do-experimental-asset-market-results-replicate-high-powered-preregistered-replications-of-17-cla.md b/content/replication-hub/blog/aoi-do-experimental-asset-market-results-replicate-high-powered-preregistered-replications-of-17-cla.md new file mode 100644 index 00000000000..48f0b09be8b --- /dev/null +++ b/content/replication-hub/blog/aoi-do-experimental-asset-market-results-replicate-high-powered-preregistered-replications-of-17-cla.md @@ -0,0 +1,34 @@ +--- +title: "AoI*: “Do experimental asset market results replicate? High powered preregistered replications of 17 claims” by Huber et al. (2024)" +date: 2025-01-08 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Behavioral Economics" + - "Bubbles" + - "Cognitive Skills" + - "Experimental asset markets" + - "Gender" + - "replication" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from [the article](https://www.econstor.eu/bitstream/10419/307733/1/1912217104.pdf?fbclid=IwY2xjawHZRpVleHRuA2FlbQIxMAABHb_waLnAegpC1bUW7VU9jFBjjxxhxWpBVHG6jMe25jJeEUt8UbgtUFvWXA_aem_xOkLhu33L84tsgvqStoP7w))** + +“Experimental asset markets provide a controlled approach to studying financial markets. We attempt to replicate 17 key results from four prominent studies, collecting new data from 166 markets with 1,544 participants. Only 3 of the 14 original results reported as statistically significant were successfully replicated, with an average replication effect size of 2.9% of the original estimates. We fail to replicate findings on emotions, self-control, and gender differences in bubble formation but confirm that experience reduces bubbles and cognitive skills explain trading success. Our study demonstrates the importance of replications in enhancing the credibility of scientific claims in this field.” + +**REFERENCE** + +[Huber, C., Holzmeister, F., Johannesson, M., König-Kersting, C., Dreber, A., Huber, J., & Kirchler, M. (2024). *Do experimental asset market results replicate? High-powered preregistered replications of 17 claims*. University of Innsbruck Working Papers in Economics and Statistics, Working Papers in Economics and Statistics, No. 2024-12.](https://www.econstor.eu/bitstream/10419/307733/1/1912217104.pdf?fbclid=IwY2xjawHZRpVleHRuA2FlbQIxMAABHb_waLnAegpC1bUW7VU9jFBjjxxhxWpBVHG6jMe25jJeEUt8UbgtUFvWXA_aem_xOkLhu33L84tsgvqStoP7w) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/01/08/aoi-do-experimental-asset-market-results-replicate-high-powered-preregistered-replications-of-17-claims-by-huber-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/01/08/aoi-do-experimental-asset-market-results-replicate-high-powered-preregistered-replications-of-17-claims-by-huber-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-do-pre-registration-and-pre-analysis-plans-reduce-p-hacking-and-publication-bias-evidence-from-1.md b/content/replication-hub/blog/aoi-do-pre-registration-and-pre-analysis-plans-reduce-p-hacking-and-publication-bias-evidence-from-1.md new file mode 100644 index 00000000000..41061dfe22a --- /dev/null +++ b/content/replication-hub/blog/aoi-do-pre-registration-and-pre-analysis-plans-reduce-p-hacking-and-publication-bias-evidence-from-1.md @@ -0,0 +1,49 @@ +--- +title: "AoI*: “Do Pre-Registration and Pre-Analysis Plans Reduce p-Hacking and Publication Bias? Evidence from 15,992 Test Statistics and Suggestions for Improvement” by Brodeur  et al. (2023)" +date: 2024-01-10 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "p-hacking" + - "Pre-Analysis plans" + - "Pre-registration" + - "publication bias" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report excerpts of recent research related to replication and research integrity.]* + +**EXCERPTS (taken from the [article](https://www.econstor.eu/bitstream/10419/280894/1/GLO-DP-1147pre.pdf))** + +“Pre-registration is regarded as an important contributor to research credibility. We investigate this by analyzing the pattern of test statistics from the universe of randomized controlled trials (RCT) studies published in 15 leading economics journals.” + +“We draw two conclusions:” + +“(a) Pre-registration frequently does not involve a pre-analysis plan (PAP), or sufficient detail to constrain meaningfully the actions and decisions of researchers after data is collected. Consistent with this, we find no evidence that pre-registration in itself reduces p-hacking and publication bias.” + +“(b) When pre-registration is accompanied by a PAP we find evidence consistent with both reduced phacking and publication bias.” + +“…we proceed with notions of pre-registration and PAPs as practiced in economics, or at least as operationalized by the largest and most influential professional association in the discipline, the AEA.” + +“In their discussion relating to psychology, Nosek et al. (2018) contend that: “An effective solution is to define the research questions and analysis plan before observing the research outcomes – a process called preregistration,”which implies that pre-registration and the existence of a PAP are one and the same thing (see also Simmons et al. (2021) for a similar contention). This is far from how things work in economics, as we will show here.” + +“We contend that many readers believe, wrongly according to our analysis, that pre-registration in itself implies enhanced research credibility, which would explain the weight apparently attached to whether a study is pre-registered or not in assessing its likely validity. Our finding is that credibility is enhanced only with inclusion of a PAP.” + +**REFERENCES:** + +[Brodeur, Abel; Cook, Nikolai M.; Hartley, Jonathan S.; Heyes, Anthony (2023) : Do Pre-Registration and Pre-Analysis Plans Reduce p-Hacking and Publication Bias?: Evidence from 15,992 Test Statistics and Suggestions for Improvement, GLO Discussion Paper, No. 1147 [pre.], Global Labor Organization (GLO), Essen](https://www.econstor.eu/bitstream/10419/280894/1/GLO-DP-1147pre.pdf) + +[Nosek, B. A., Ebersole, C. R., DeHaven, A. C. and Mellor, D. T.: 2018, The](https://www.pnas.org/cdi/doi/10.1073/pnas.1708274114) [Preregistration Revolution, Proceedings of the National Academy of Sciences](https://www.pnas.org/cdi/doi/10.1073/pnas.1708274114) [115(11), 2600–2606.](https://www.pnas.org/cdi/doi/10.1073/pnas.1708274114) + +[Simmons, P. J., Nelson, L. D. and Simonsohn, U.: 2021, Pre-Registration: Whyand How, Journal of Consumer Psychology 31(1), 151–162.](https://myscp.onlinelibrary.wiley.com/doi/abs/10.1002/jcpy.1208?casa_token=il4lxJaxU4IAAAAA:M6HqA7p9y7JgrWRTFdlLFc6AEt9jOGQOgNyiZDXcz__Tb5d5Cn_hbsQ-JfqU22OcBjlR6pFiDijmzG_m) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/01/10/aoi-do-pre-registration-and-pre-analysis-plans-reduce-p-hacking-and-publication-bias-evidence-from-15992-test-statistics-and-suggestions-for-improvement-by-brodeur-et-al/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/01/10/aoi-do-pre-registration-and-pre-analysis-plans-reduce-p-hacking-and-publication-bias-evidence-from-15992-test-statistics-and-suggestions-for-improvement-by-brodeur-et-al/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-estimating-the-extent-of-selective-reporting-an-application-to-economics-by-bruns-et-al-2024.md b/content/replication-hub/blog/aoi-estimating-the-extent-of-selective-reporting-an-application-to-economics-by-bruns-et-al-2024.md new file mode 100644 index 00000000000..3a25130af25 --- /dev/null +++ b/content/replication-hub/blog/aoi-estimating-the-extent-of-selective-reporting-an-application-to-economics-by-bruns-et-al-2024.md @@ -0,0 +1,32 @@ +--- +title: "AoI*: “Estimating the Extent of Selective Reporting: An Application to Economics” by Bruns et al. (2024)" +date: 2024-02-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Meta-analyses" + - "p-values" + - "Publication selection" + - "Selective reporting" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from *[the article](https://onlinelibrary.wiley.com/doi/pdf/10.1002/jrsm.1711)*)** + +“Using a sample of 70,399 published p-values from 192 meta-analyses, we empirically estimate the counterfactual distribution of p-values in the absence of any biases. Comparing observed p-values with counterfactually expected pvalues allows us to estimate how many p-values are published as being statistically significant when they should have been published as non-significant. We estimate the extent of selectively reported p-values to range between 57.7% and 71.9% of the significant p-values. The counterfactual p-value distribution also allows us to assess shifts of p-values along the entire distribution of published p-values, revealing that particularly very small p-values (p < 0.001) are unexpectedly abundant in the published literature. Subsample analysis suggests that the extent of selective reporting is reduced in research fields that use experimental designs, analyze microeconomics research questions, and have at least some adequately powered studies..” + +**REFERENCES:** + +[Bruns, S. B., Deressa, T. K., Stanley, T. D., Doucouliagos, C., & Ioannidis, J. P. (2024). Estimating the extent of selective reporting: An application to economics. Research Synthesis Methods.](https://onlinelibrary.wiley.com/doi/pdf/10.1002/jrsm.1711) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/02/27/aoi-estimating-the-extent-of-selective-reporting-an-application-to-economics-by-bruns-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/02/27/aoi-estimating-the-extent-of-selective-reporting-an-application-to-economics-by-bruns-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-high-replicability-of-newly-discovered-social-behavioural-findings-is-achievable-by-protzko-et-a.md b/content/replication-hub/blog/aoi-high-replicability-of-newly-discovered-social-behavioural-findings-is-achievable-by-protzko-et-a.md new file mode 100644 index 00000000000..4e34b2aebe5 --- /dev/null +++ b/content/replication-hub/blog/aoi-high-replicability-of-newly-discovered-social-behavioural-findings-is-achievable-by-protzko-et-a.md @@ -0,0 +1,33 @@ +--- +title: "AoI*: “High replicability of newly discovered social-behavioural findings is achievable” by Protzko et al. (2023)" +date: 2023-11-18 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Confirmatory tests" + - "Large sample sizes" + - "Methodological transparency" + - "Preregistration" + - "replication" + - "Replication rates" + - "Rigour-enhancing practices" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from the article)** + +“Failures to replicate evidence of new discoveries have forced scientists to ask whether this unreliability is due to suboptimal implementation of methods or whether presumptively optimal methods are not, in fact, optimal. This paper reports an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigour-enhancing practices: confirmatory tests, large sample sizes, preregistration and methodological transparency. In contrast to past systematic replication efforts that reported replication rates averaging 50%, replication attempts here produced the expected effects with significance testing (P < 0.05) in 86% of attempts, slightly exceeding the maximum expected replicability based on observed effect sizes and sample sizes. When one lab attempted to replicate an effect discovered by another lab, the effect size in the replications was 97% that in the original study. This high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries.” + +Reference: [Protzko, J., Krosnick, J., Nelson, L., Nosek, B. A., Axt, J., Berent, M., … & Schooler, J. W. (2023). High replicability of newly discovered social-behavioural findings is achievable. Nature Human Behaviour, 1-9](https://www.nature.com/articles/s41562-023-01749-9) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/11/18/aoi-high-replicability-of-newly-discovered-social-behavioural-findings-is-achievable-by-protzko-et-al-2023/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/11/18/aoi-high-replicability-of-newly-discovered-social-behavioural-findings-is-achievable-by-protzko-et-al-2023/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-introducing-synchronous-robustness-reports-by-bartos-et-al-2025.md b/content/replication-hub/blog/aoi-introducing-synchronous-robustness-reports-by-bartos-et-al-2025.md new file mode 100644 index 00000000000..27e2ef774ae --- /dev/null +++ b/content/replication-hub/blog/aoi-introducing-synchronous-robustness-reports-by-bartos-et-al-2025.md @@ -0,0 +1,41 @@ +--- +title: "AoI*: “Introducing Synchronous Robustness Reports” by Bartos et al. (2025)" +date: 2025-03-20 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "FAIR data principles" + - "Journal policies" + - "Many-Analysts Approach" + - "Methodological diversity" + - "Publication workflow" + - "Robustness in scientific research" + - "TOP (Transparency and Openness Promotion) guidelines" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +NOTE: The article is behind a firewall. + +**ABSTRACT (taken from *[the article](https://www.nature.com/articles/s41562-025-02129-1)*)** + +“Most empirical research articles feature a single primary analysis that is conducted by the authors. However, different analysis teams usually adopt different analytical approaches and frequently reach varied conclusions. We propose synchronous robustness reports [SRRs] — brief reports that summarize the results of alternative analyses by independent experts — to strengthen the credibility of science.” + +“To integrate SRRs seamlessly into the publication process, we suggest the framework outlined as a flowchart in Fig. 2. As the flowchart shows, the SRRs form a natural extension to the standard review process.” + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2025/03/image.webp) + +**REFERENCE** + +[Bartoš, F., Sarafoglou, A., Aczel, B. *et al.* Introducing synchronous robustness reports. *Nat Hum Behav* (2025). https://doi.org/10.1038/s41562-025-02129-1](https://doi.org/10.1038/s41562-025-02129-1) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/03/20/aoi-introducing-synchronous-robustness-reports-by-bartos-et-al-2025/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/03/20/aoi-introducing-synchronous-robustness-reports-by-bartos-et-al-2025/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-mass-reproducibility-and-replicability-a-new-hope-by-brodeur-et-al-2024.md b/content/replication-hub/blog/aoi-mass-reproducibility-and-replicability-a-new-hope-by-brodeur-et-al-2024.md new file mode 100644 index 00000000000..6a969b3f0ab --- /dev/null +++ b/content/replication-hub/blog/aoi-mass-reproducibility-and-replicability-a-new-hope-by-brodeur-et-al-2024.md @@ -0,0 +1,37 @@ +--- +title: "AoI*: “Mass Reproducibility and Replicability: A New Hope” by Brodeur et al. (2024)" +date: 2024-04-10 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "110 reproductions/replications" + - "economics" + - "Journals" + - "Many analysts" + - "Open Science" + - "political science" + - "Re-analysis" + - "replication" + - "Reproduction" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from *[the article](https://www.econstor.eu/bitstream/10419/289437/1/I4R-DP107.pdf)***) + +“This study pushes our understanding of research reliability by producing and replicating claims from 110 papers in leading economic and political science journals. The analysis involves com- putational reproducibility checks and robustness assessments. It reveals several patterns. First, we uncover a high rate of fully computationally reproducible results (over 85%). Second, excluding minor issues like missing packages or broken pathways, we uncover coding errors for about 25% of studies, with some studies containing multiple errors. Third, we test the robustness of the results to 5,511 re-analyses. We find a robustness reproducibility of about 70%. Robustness reproducibility rates are relatively higher for re-analyses that introduce new data and lower for re-analyses that change the sample or the definition of the dependent variable. Fourth, 52% of re-analysis effect size estimates are smaller than the original published estimates and the average statistical significance of a re-analysis is 77% of the original. Lastly, we rely on six teams of re- searchers working independently to answer eight additional research questions on the determinants of robustness reproducibility. Most teams find a negative relationship between replicators’ experience and reproducibility, while finding no relationship between reproducibility and the provision of intermediate or even raw data combined with the necessary cleaning codes.” + +**REFERENCE** + +[Brodeur, Abel et al. (2024) : Mass Reproducibility and Replicability: A New Hope, I4R Discussion Paper Series, No. 107, Institute for Replication (I4R), s.l.](https://www.econstor.eu/bitstream/10419/289437/1/I4R-DP107.pdf) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/04/10/aoi-mass-reproducibility-and-replicability-a-new-hope-by-brodeur-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/04/10/aoi-mass-reproducibility-and-replicability-a-new-hope-by-brodeur-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-open-minds-tied-hands-awareness-behavior-and-reasoning-on-open-science-and-irresponsible-researc.md b/content/replication-hub/blog/aoi-open-minds-tied-hands-awareness-behavior-and-reasoning-on-open-science-and-irresponsible-researc.md new file mode 100644 index 00000000000..f67787e2703 --- /dev/null +++ b/content/replication-hub/blog/aoi-open-minds-tied-hands-awareness-behavior-and-reasoning-on-open-science-and-irresponsible-researc.md @@ -0,0 +1,36 @@ +--- +title: "AoI*: “Open minds, tied hands: Awareness, behavior, and reasoning on open science and irresponsible research behavior” by Wiradhany et al. (2025)" +date: 2025-02-25 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Irresponsible Research Behavior (IRB)" + - "Open Science Practices (OSP)" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from *[the article](https://www.tandfonline.com/doi/full/10.1080/08989621.2025.2457100)*)** + +“Knowledge on Open Science Practices (OSP) has been promoted through responsible conduct of research training and the development of open science infrastructure to combat Irresponsible Research Behavior (IRB). Yet, there is limited evidence for the efficacy of OSP in minimizing IRB.” + +“We asked N=778 participants to fill in questionnaires that contain OSP and ethical reasoning vignettes, and report self-admission rates of IRB and personality traits. We found that against our initial prediction, even though OSP was negatively correlated with IRB, this correlation was very weak, and upon controlling for individual differences factors, OSP neither predicted IRB nor was this relationship moderated by ethical reasoning.” + +“On the other hand, individual differences factors, namely dark personality triad, and conscientiousness and openness, contributed more to IRB than OSP knowledge.” + +“Our findings suggest that OSP knowledge needs to be complemented by the development of ethical virtues to encounter IRBs more effectively.” + +**REFERENCE** + +[Wiradhany, W., Djalal, F. M., & de Bruin, A. B. H. (2025). Open minds, tied hands: Awareness, behavior, and reasoning on open science and irresponsible research behavior. *Accountability in Research*, 1–24. https://doi.org/10.1080/08989621.2025.2457100](https://www.tandfonline.com/doi/full/10.1080/08989621.2025.2457100) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/02/25/aoi-open-minds-tied-hands-awareness-behavior-and-reasoning-on-open-science-and-irresponsible-research-behavior-by-wiradhany-et-al-2025/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/02/25/aoi-open-minds-tied-hands-awareness-behavior-and-reasoning-on-open-science-and-irresponsible-research-behavior-by-wiradhany-et-al-2025/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-predicting-the-replicability-of-social-and-behavioural-science-claims-in-covid-19-preprints-by-m.md b/content/replication-hub/blog/aoi-predicting-the-replicability-of-social-and-behavioural-science-claims-in-covid-19-preprints-by-m.md new file mode 100644 index 00000000000..b857f732a6e --- /dev/null +++ b/content/replication-hub/blog/aoi-predicting-the-replicability-of-social-and-behavioural-science-claims-in-covid-19-preprints-by-m.md @@ -0,0 +1,35 @@ +--- +title: "AoI*: “Predicting the replicability of social and behavioural science claims in COVID-19 preprints” by Marcoci et al. (2024)" +date: 2025-01-06 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Covid-19" + - "Experts' vs beginners' predictions" + - "Fast science" + - "Forecasting replicability" + - "Nature Human Behaviour" + - "Prediction Markets" + - "RepliCATS" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from [the article](https://www.nature.com/articles/s41562-024-01961-1))** + +“We elicited judgements from participants on 100 claims from preprints about an emerging area of research (COVID-19 pandemic) using an interactive structured elicitation protocol, and we conducted 29 new high-powered replications. After interacting with their peers, participant groups with lower task expertise (‘beginners’) updated their estimates and confidence in their judgements significantly more than groups with greater task expertise (‘experienced’). For experienced individuals, the average accuracy was 0.57 (95% CI: [0.53, 0.61]) after interaction, and they correctly classified 61% of claims; beginners’ average accuracy was 0.58 (95% CI: [0.54, 0.62]), correctly classifying 69% of claims. The difference in accuracy between groups was not statistically significant and their judgements on the full set of claims were correlated (r(98) = 0.48, P < 0.001). These results suggest that both beginners and more-experienced participants using a structured process have some ability to make better-than-chance predictions about the reliability of ‘fast science’ under conditions of high uncertainty. However, given the importance of such assessments for making evidence-based critical decisions in a crisis, more research is required to understand who the right experts in forecasting replicability are and how their judgements ought to be elicited.” + +**REFERENCE** + +[Marcoci, A., Wilkinson, D. P., Vercammen, A., Wintle, B. C., Abatayo, A. L., Baskin, E., … & van der Linden, S. (2024). Predicting the replicability of social and behavioural science claims in COVID-19 preprints. *Nature Human Behaviour*, 1-18.](https://www.nature.com/articles/s41562-024-01961-1) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/01/06/aoi-predicting-the-replicability-of-social-and-behavioural-science-claims-in-covid-19-preprints-by-marcoci-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/01/06/aoi-predicting-the-replicability-of-social-and-behavioural-science-claims-in-covid-19-preprints-by-marcoci-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-promoting-reproducibility-and-replicability-in-political-science-by-brodeur-et-al-2024.md b/content/replication-hub/blog/aoi-promoting-reproducibility-and-replicability-in-political-science-by-brodeur-et-al-2024.md new file mode 100644 index 00000000000..7cacfc079b5 --- /dev/null +++ b/content/replication-hub/blog/aoi-promoting-reproducibility-and-replicability-in-political-science-by-brodeur-et-al-2024.md @@ -0,0 +1,32 @@ +--- +title: "AoI*: “Promoting Reproducibility and Replicability in Political Science” by Brodeur et al. (2024)" +date: 2024-01-24 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Journal policies" + - "political science" + - "Replicability" + - "Reproducibility" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from [*the article*](https://www.econstor.eu/bitstream/10419/281135/1/I4R-DP100.pdf))** + +“This article reviews and summarizes current reproduction and replication practices in political science. We first provide definitions for reproducibility and replicability. We then review data availability policies for 28 leading political science journals and present the results from a survey of editors about their willingness to publish comments and replications. We discuss new initiatives that seek to promote and generate high[1]quality reproductions and replications. Finally, we make the case for standards and practices that may help increase data availability, reproducibility, and replicability in political science.” + +**REFERENCES:** + +[Brodeur, Abel et al. (2024. Promoting Reproducibility and Replicability in Political Science, I4R Discussion Paper Series, No. 100, Institute for Replication (I4R)](https://www.econstor.eu/bitstream/10419/281135/1/I4R-DP100.pdf) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/01/24/aoi-promoting-reproducibility-and-replicability-in-political-science-by-brodeur-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/01/24/aoi-promoting-reproducibility-and-replicability-in-political-science-by-brodeur-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-prosocial-motives-underlie-scientific-censorship-by-scientists-a-perspective-and-research-agenda.md b/content/replication-hub/blog/aoi-prosocial-motives-underlie-scientific-censorship-by-scientists-a-perspective-and-research-agenda.md new file mode 100644 index 00000000000..f6650475fa4 --- /dev/null +++ b/content/replication-hub/blog/aoi-prosocial-motives-underlie-scientific-censorship-by-scientists-a-perspective-and-research-agenda.md @@ -0,0 +1,31 @@ +--- +title: "AoI*: “Prosocial motives underlie scientific censorship by scientists: A perspective and research agenda” by Clark et al. (2023)" +date: 2023-11-28 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Peer scholars" + - "Prosocial concerns" + - "Science" + - "Scientific censorship" + - "Self-protection" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from the article)** + +“Science is among humanity’s greatest achievements, yet scientific censorship is rarely studied empirically. We explore the social, psychological, and institutional causes and consequences of scientific censorship (defined as actions aimed at obstructing particular scientific ideas from reaching an audience for reasons other than low scientific quality). Popular narratives suggest that scientific censorship is driven by authoritarian officials with dark motives, such as dogmatism and intolerance. Our analysis suggests that scientific censorship is often driven by scientists, who are primarily motivated by self-protection, benevolence toward peer scholars, and prosocial concerns for the well-being of human social groups.” + +Reference: [Clark, C. J., Jussim, L., Frey, K., Stevens, S. T., Al-Gharbi, M., Aquino, K., … & von Hippel, W. (2023). Prosocial motives underlie scientific censorship by scientists: A perspective and research agenda. Proceedings of the National Academy of Sciences, 120(48), e2301642120.](https://www.pnas.org/doi/10.1073/pnas.2301642120) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/11/28/aoi-prosocial-motives-underlie-scientific-censorship-by-scientists-a-perspective-and-research-agenda-by-clark-et-al-2023/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/11/28/aoi-prosocial-motives-underlie-scientific-censorship-by-scientists-a-perspective-and-research-agenda-by-clark-et-al-2023/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-reproduction-and-replication-at-scale-by-brodeur-et-al-2024.md b/content/replication-hub/blog/aoi-reproduction-and-replication-at-scale-by-brodeur-et-al-2024.md new file mode 100644 index 00000000000..3afcb8982a8 --- /dev/null +++ b/content/replication-hub/blog/aoi-reproduction-and-replication-at-scale-by-brodeur-et-al-2024.md @@ -0,0 +1,37 @@ +--- +title: "AoI*: “Reproduction and Replication at Scale” by Brodeur et al. (2024)" +date: 2024-01-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Nature Human Behavior" + - "replication" + - "Replication Games" + - "Replication Initiative" + - "Reproduction" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report excerpts of recent research related to replication and research integrity.]* + +**EXCERPTS (taken from the [*article*](https://www.nature.com/articles/s41562-023-01807-2))** + +“We are thrilled to announce that we are broadening our focus to new disciplines through a collaboration with Nature Human Behaviour. As part of this collaboration, we will be reproducing and replicating as many studies as possible of those that are published in Nature Human Behaviour (from 2023 and going forward), including in the fields of anthropology, epidemiology, economics, management, politics and psychology.” + +“Furthermore, we will organize multiple replication games dedicated to reproducing and replicating articles in Nature Human Behaviour. All replication enthusiasts are invited to participate and will be granted co-authorship to a meta-paper that combines all reproductions and replications. The plan is for this meta-paper to then be considered for publication as a research article in Nature Human Behaviour (subject to peer review).” + +“We are enthusiastic about this collaboration with Nature Human Behaviour and are actively looking for replicators. Please contact us by email at [instituteforreplication@gmail.com](mailto:instituteforreplication@gmail.com) if you would like to join our initiative!” + +**REFERENCES:** + +[Brodeur, A., Dreber, A., Hoces de la Guardia, F. et al. Reproduction and replication at scale. Nat Hum Behav 8, 2–3 (2024)](https://www.nature.com/articles/s41562-023-01807-2) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/01/27/aoi-reproduction-and-replication-at-scale-by-brodeur-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/01/27/aoi-reproduction-and-replication-at-scale-by-brodeur-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-same-data-different-analysts-variation-in-effect-sizes-due-to-analytical-decisions-in-ecology-an.md b/content/replication-hub/blog/aoi-same-data-different-analysts-variation-in-effect-sizes-due-to-analytical-decisions-in-ecology-an.md new file mode 100644 index 00000000000..57d0eb1e585 --- /dev/null +++ b/content/replication-hub/blog/aoi-same-data-different-analysts-variation-in-effect-sizes-due-to-analytical-decisions-in-ecology-an.md @@ -0,0 +1,44 @@ +--- +title: "AoI*: “Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology” by Gould et al. (2025)" +date: 2025-03-08 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Ecology and evolutionary biology" + - "Effect size variation" + - "Many-analyst study" + - "Meta-analysis" + - "replication crisis" + - "Reproducibility" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from *[the article](https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-024-02101-x)*)** + +“We [implemented] a large-scale empirical exploration of the variation in effect sizes and model predictions generated by the analytical decisions of different researchers in ecology and evolutionary biology.” + +“We used two unpublished datasets, one from evolutionary ecology (blue tit, *Cyanistes caeruleus*, to compare sibling number and nestling growth) and one from conservation ecology (*Eucalyptus*, to compare grass cover and tree seedling recruitment). The project leaders recruited 174 analyst teams, comprising 246 analysts, to investigate the answers to prespecified research questions.” + +“We found substantial heterogeneity among results for both datasets, although the patterns of variation differed between them. For the blue tit analyses, the average effect was convincingly negative, with less growth for nestlings living with more siblings, but there was near continuous variation in effect size from large negative effects to effects near zero, and even effects crossing the traditional threshold of statistical significance in the opposite direction.” + +“In contrast, the average relationship between grass cover and *Eucalyptus* seedling number was only slightly negative and not convincingly different from zero, and most effects ranged from weakly negative to weakly positive, with about a third of effects crossing the traditional threshold of significance in one direction or the other. However, there were also several striking outliers in the *Eucalyptus* dataset, with effects far from zero.” + +“…analyses with results that were far from the mean were no more or less likely to have dissimilar variable sets, use random effects in their models, or receive poor peer reviews than those analyses that found results that were close to the mean.” + +“The existence of substantial variability among analysis outcomes raises important questions about how ecologists and evolutionary biologists should interpret published results, and how they should conduct analyses in the future.” + +**REFERENCE** + +[Gould, E., Fraser, H.S., Parker, T.H. *et al.* Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology. *BMC Biol* **23**, 35 (2025). https://doi.org/10.1186/s12915-024-02101-x](https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-024-02101-x) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/03/08/aoi-same-data-different-analysts-variation-in-effect-sizes-due-to-analytical-decisions-in-ecology-and-evolutionary-biology-by-gould-et-al-2025/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/03/08/aoi-same-data-different-analysts-variation-in-effect-sizes-due-to-analytical-decisions-in-ecology-and-evolutionary-biology-by-gould-et-al-2025/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-the-robustness-reproducibility-of-the-american-economic-review-by-campbell-et-al-2024.md b/content/replication-hub/blog/aoi-the-robustness-reproducibility-of-the-american-economic-review-by-campbell-et-al-2024.md new file mode 100644 index 00000000000..409f283254d --- /dev/null +++ b/content/replication-hub/blog/aoi-the-robustness-reproducibility-of-the-american-economic-review-by-campbell-et-al-2024.md @@ -0,0 +1,35 @@ +--- +title: "AoI*: “The Robustness Reproducibility of the American Economic Review” by Campbell et al. (2024)" +date: 2024-05-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "American Economic Review" + - "economics" + - "Non-expeirmental papers" + - "p-hacking" + - "publication bias" + - "robustness" + - "The Institute for Replication (I4R)" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from [the article](https://www.econstor.eu/bitstream/10419/295222/1/I4R-DP124.pdf))** + +“We estimate the robustness reproducibility of key results from 17 non-experimental AER papers published in 2013 (8 papers) and 2022/23 (9 papers). We find that many of the results are not robust, with no improvement over time. The fraction of significant robustness tests (p<0.05) varies between 17% and 88% across the papers with a mean of 46%. The mean relative t/z-value of the robustness tests varies between 35% and 87% with a mean of 63%, suggesting selective reporting of analytical specifications that exaggerate statistical significance. A sample of economists (n=359) overestimates robustness reproducibility, but predictions are correlated with observed reproducibility.” + +**REFERENCE** + +[Campbell, Douglas et al. (2024) : The Robustness Reproducibility of the American Economic Review, I4R Discussion Paper Series, No. 124, Institute for Replication (I4R), s.l](https://www.econstor.eu/bitstream/10419/295222/1/I4R-DP124.pdf). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/05/27/aoi-the-robustness-reproducibility-of-the-american-economic-review-by-campbell-et-al-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/05/27/aoi-the-robustness-reproducibility-of-the-american-economic-review-by-campbell-et-al-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-the-significance-of-data-sharing-policy-by-azkarov-et-al-2023.md b/content/replication-hub/blog/aoi-the-significance-of-data-sharing-policy-by-azkarov-et-al-2023.md new file mode 100644 index 00000000000..b0c80a674bf --- /dev/null +++ b/content/replication-hub/blog/aoi-the-significance-of-data-sharing-policy-by-azkarov-et-al-2023.md @@ -0,0 +1,32 @@ +--- +title: "AoI*: “The Significance of Data-Sharing Policy” by Azkarov et al. (2023)" +date: 2023-11-17 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Data-sharing" + - "economics" + - "Excess statistical significance (ESS)" + - "Journal policies" + - "Publicaton bias" + - "Statistical significancce" +draft: false +type: blog +--- + +*[\*AoI = “**Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.*] + +**ABSTRACT (taken from the article)** + +“We assess the impact of mandating data-sharing in economics journals on two dimensions of research credibility: statistical significance and excess statistical significance (ESS). ESS is a necessary condition for publication selection bias. Quasi-experimental difference-in-differences analysis of 20,121 estimates published in 24 general interest and leading field journals shows that data-sharing policies have reduced reported statistical significance and the associated *t*-values. The magnitude of this reduction is large and of practical significance. We also find suggestive evidence that mandatory data-sharing reduces ESS and hence decreases publication bias.” + +Reference: [Askarov, Z., Doucouliagos, A., Doucouliagos, H., & Stanley, T. D. (2023). The significance of data-sharing policy. *Journal of the European Economic Association*, *21*(3), 1191-1226](https://academic.oup.com/jeea/article/21/3/1191/6706852) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/11/17/aoi-the-significance-of-data-sharing-policy-by-azkarov-et-al-2023/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/11/17/aoi-the-significance-of-data-sharing-policy-by-azkarov-et-al-2023/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-the-sources-of-researcher-variation-in-economics-by-huntington-klein-et-al-2025.md b/content/replication-hub/blog/aoi-the-sources-of-researcher-variation-in-economics-by-huntington-klein-et-al-2025.md new file mode 100644 index 00000000000..4e5fd45f3c4 --- /dev/null +++ b/content/replication-hub/blog/aoi-the-sources-of-researcher-variation-in-economics-by-huntington-klein-et-al-2025.md @@ -0,0 +1,44 @@ +--- +title: "AoI*: “The Sources of Researcher Variation in Economics” by Huntington-Klein et al. (2025)" +date: 2025-03-14 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Causal Inference" + - "Data Cleaning" + - "Many-Analysts Approach" + - "Research design" + - "Researcher degrees of freedom" + - "Researcher Variation" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from *[the article](https://www.econstor.eu/bitstream/10419/312260/1/I4R-DP209.pdf)*)** + +“We use a rigorous three-stage many-analysts design to assess how different researcher decisions—specifically data cleaning, research design, and the interpretation of a policy question—affect the variation in estimated treatment effects.” + +“A total of 146 research teams each completed the same causal inference task three times each: first with few constraints, then using a shared research design, and finally with pre-cleaned data in addition to a specified design.” + +“We find that even when analyzing the same data, teams reach different conclusions. In the first stage, the interquartile range (IQR) of the reported policy effect was 3.1 percentage points, with substantial outliers.” + +“Surprisingly, the second stage, which restricted research design choices, exhibited slightly higher IQR (4.0 percentage points), largely attributable to imperfect adherence to the prescribed protocol. By contrast, the final stage, featuring standardized data cleaning, narrowed variation in estimated effects, achieving an IQR of 2.4 percentage points.” + +“Reported sample sizes also displayed significant convergence under more restrictive conditions, with the IQR dropping from 295,187 in the first stage to 29,144 in the second, and effectively zero by the third.” + +“Our findings underscore the critical importance of data cleaning in shaping applied microeconomic results and highlight avenues for future replication efforts.” + +**REFERENCE** + +[Huntington-Klein, Nick et al. (2025) : The Sources of Researcher Variation inEconomics, I4R Discussion Paper Series, No. 209, Institute for Replication (I4R), s.l.](https://www.econstor.eu/bitstream/10419/312260/1/I4R-DP209.pdf) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/03/14/aoi-the-sources-of-researcher-variation-in-economics-by-huntington-klein-et-al-2025/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/03/14/aoi-the-sources-of-researcher-variation-in-economics-by-huntington-klein-et-al-2025/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/aoi-what-is-the-false-discovery-rate-in-empirical-research-by-engsted-2024.md b/content/replication-hub/blog/aoi-what-is-the-false-discovery-rate-in-empirical-research-by-engsted-2024.md new file mode 100644 index 00000000000..3bbaa35ebbd --- /dev/null +++ b/content/replication-hub/blog/aoi-what-is-the-false-discovery-rate-in-empirical-research-by-engsted-2024.md @@ -0,0 +1,37 @@ +--- +title: "AoI*: “What Is the False Discovery Rate in Empirical Research?” by Engsted (2024)" +date: 2024-04-04 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Bayesian analysis" + - "Econ Journal Watch" + - "economics" + - "False Discovery Rate" + - "Hypothesis testing" + - "null hypothesis significance testing" + - "p-value" + - "Prior probabilities" + - "replication crisis" +draft: false +type: blog +--- + +*[\*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.]* + +**ABSTRACT (taken from*****[the article](https://econjwatch.org/articles/what-is-the-false-discovery-rate-in-empirical-research)***) + +“A scientific discovery in empirical research, e.g., establishing a causal relationship between two variables, is typically based on rejecting a statistical null hypothesis of no relationship. What is the probability that such a rejection is a mistake? This probability is not controlled by the significance level of the test which is typically set at 5 percent. Statistically, the ‘False Discovery Rate’ (FDR) is the fraction of null rejections that are false. FDR depends on the significance level, the power of the test, and the prior probability that the null is true. All else equal, the higher the prior probability, the higher is the FDR. Economists have different views on how to assess this prior probability. I argue that for both statistical and economic reasons, the prior probability of the null should in general be quite high and, thus, the FDR in empirical economics is high, i.e., substantially higher than 5 percent. This may be a contributing factor behind the replication crisis that also haunts economics. Finally, I discuss conventional and newly proposed stricter significance thresholds and, more generally, the problems that passively observed economic and social science data pose for the traditional statistical testing paradigm.” + +**REFERENCE** + +[Engsted, T. (2024). What Is the False Discovery Rate in Empirical Research?. *Econ Journal Watch*, *21*(1), 92-112](https://econjwatch.org/articles/what-is-the-false-discovery-rate-in-empirical-research). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/04/04/aoi-what-is-the-false-discovery-rate-in-empirical-research-by-engsted-2024/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/04/04/aoi-what-is-the-false-discovery-rate-in-empirical-research-by-engsted-2024/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/arjen-van-witteloostuijn-a-manifesto-and-a-petition.md b/content/replication-hub/blog/arjen-van-witteloostuijn-a-manifesto-and-a-petition.md new file mode 100644 index 00000000000..03af6502a02 --- /dev/null +++ b/content/replication-hub/blog/arjen-van-witteloostuijn-a-manifesto-and-a-petition.md @@ -0,0 +1,60 @@ +--- +title: "ARJEN VAN WITTELOOSTUIJN: A Manifesto, and a Petition" +date: 2015-10-09 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Economics Journals" + - "EDAWAX" + - "Journal Data Policies" + - "Vlaeminck" +draft: false +type: blog +--- + +###### Science is a community of human beings of the *homo sapiens* species: bipedals with the capacity to be self-reflexive. This implies that science as a community is subject to all the same behavioral patterns that all human communities are, including a plethora of biases at both the individual and collective level. + +###### Examples of well-known individual-level biases are hubris, confirmatory preference, and desire for novelty (or the reverse: fear of the new). This implies, for instance, that “When an experiment is not blinded, the chances are that the experimenters will see what they ‘should’ see” (*The Economist*, 2013). Together, these biases lead to Type I and Type II errors in judging research, both our own and that of others. As a result, without correcting mechanisms, published research will be heavily biased in favor of evidence that is in line with the theory. + +###### Science’s first line of defense is the micro-level reviewing process. Regrettably, the reviewing process, double-blinded or not, is anything but flawless, but rather full of biases itself.  This is not surprising, as the reviewing process is carried out by exemplars of the very same *homo sapiens* species that cannot escape from all these biases referred to above (plus quite a few others). + +###### Ample evidence abounds that current reviewing practices fail to provide the effective filtering mechanism they are claimed to provide.  Take the revealing study of Callaham and McCulloch (2011). On the basis of a 14-year sample of 14,808 reviews by 1,499 reviewers rated by 84 editors, they conclude that the quality scores deteriorated steadily over time, with the rate of deterioration being positively correlated with reviewers’ experience. This is mirrored in the well-established finding that reviewers, on average, fail to detect fatal errors in manuscripts, which reinforces the publication of false positives (Callaham & Tercier, 2007; Schroter et al., 2008). + +###### Hence, giving these unavoidable biases associated with the working of the human brain, the scientific community should adhere, as a collective, to a set of macro-level correcting principles as a second line of defense. Probably the most famous among these is Popper’s falsifiability principle. Key to Karl Popper’s (1959) incredibly influential philosophy of science is his argument that scientific progress evolves on the back of the falsification principle. + +###### We, as researchers, should try, time and again, to prove that we are wrong. If we find the evidence that indeed our theory is incorrect, we can further work on developing new theory that does fit with the data. Hence, we should teach the younger generation of researchers that instead of being overly discouraged, they should be happy if they cannot confirm their hypotheses. + +###### This quest for falsification is critical because, in the words of Ioannidis (2012: 646), “Efficient and unbiased replication mechanisms are essential for maintaining high levels of scientific credibility.” The falsification principle requires a tradition of replication studies in combination with the publication of non-significant and counter-results, or so-called nulls and negatives, backed by systematic meta-analyses. + +###### Current publication practices are overwhelmingly anti-Popperian.  No one is really interested in replicating anything, and meta-analyses are far and between. Indeed, only a tiny fraction of published studies involve a replication effort or meta-analysis. Moreover, journal authors, editors, reviewers and readers are not interested in seeing nulls and negatives in print. + +###### This replication defect and publication bias crisis implies that Popper’s critical falsification principle is actually thrown into the scientific community’s dustbin. We, as a collective, violate basic scientific principles by (a) mainly publishing positive findings (i.e., those that are in support of our hypotheses) and (b) rarely engaging in replication studies (being obsessed with novelty). Behind the façade of all these so-called new discoveries, false positives abound, as do questionable research practices . + +###### In my recently published Manifesto [“***What Happened to Popperian Falsification?”***](https://www.tilburguniversity.edu/upload/7e8059a1-2401-4f4e-b4a9-f431f3f79e81_Manifesto_Pro_Falsification.pdf), I argue what I believe is wrong, why that is so, and what we might do about this. This Manifesto is primarily directed at the worldwide Business and Management scholarly community. However, clearly, Business and Management is not the only discipline in crisis. + +###### If you share the concerns expressed in the my Manifesto, I encourage you to signal your support. For that purpose, I opened a petition webpage at ***[change.org](https://www.change.org/p/the-scientific-community-change-the-way-we-conduct-report-and-publish-our-research)***.  This can be signed, and used to start exchanging ideas. + +###### To kick-start this dialogue, I provide a tentative suggestion regarding a new and dynamic way of conducting, reporting, reviewing and publishing research, for now referred to as Scientific Wikipedia. My hope is that by initiating this dialogue, a few of the measures suggested in the Manifesto will be implemented; and others – perhaps far more effective ones – will be added over time. + +###### + +###### Callaham, M. and C. McCulloch (2011). Longitudinal Trends in the Performance of Scientific Peer Reviewers, *Annals of Emergency Medicine*, 57: 141-148. + +###### Callaham, M. L. and J. Tercier (2007). The Relationship of Previous Training and Experience of Journal Peer Reviewers to Subsequent Review Quality, *PLoS Medicine*, 4: 0032-0040. + +###### Ioannidis, J. P. A. (2012). Why Science Is Not Necessarily Self-Correcting, *Perspectives on Psychological Science*, 7: 645-654. + +###### Popper, K. (1959). *The Logic of Scientific Discovery*. Oxford: Routledge. + +###### Schroter, S., N. Black, S., Evans, F., Godlee, L., Osorio, L., and R. Smith (2008). What Errors Do Peer Reviewers Detect, and Does Training Improve their Ability to Detect Them?, *Journal of the Royal Society of Medicine*, 101: 507-514. + +###### *The Economist* (2013). Trouble at the Lab, (accessed on July 30 2015). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/10/09/arjen-van-witteloostuijn-a-manifesto-and-a-petition/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/10/09/arjen-van-witteloostuijn-a-manifesto-and-a-petition/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/b-d-carrats-et-al-lessons-from-replicating-an-rct.md b/content/replication-hub/blog/b-d-carrats-et-al-lessons-from-replicating-an-rct.md new file mode 100644 index 00000000000..c8f51879ec1 --- /dev/null +++ b/content/replication-hub/blog/b-d-carrats-et-al-lessons-from-replicating-an-rct.md @@ -0,0 +1,128 @@ +--- +title: "BÉDÉCARRATS et al.: Lessons from Replicating an RCT" +date: 2019-04-30 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Al Amana" + - "American Economic Journal: Applied Economics" + - "Development Economics" + - "IREE" + - "Microcredit" + - "Morocco" + - "Randomized controlled trials (RCTs)" + - "replication" + - "Trimming" + - "Verification tests" +draft: false +type: blog +--- + +###### In 2015, Crépon, Devoto, Duflo and Pariente (2015, henceforth CDDP), published the results of a randomized control trial (RCT) in a special issue of the *AEJ: Applied Economics*. CDDP evaluated the impact of a microcredit program conducted in Morocco with Al Amana, Morocco’s largest microcredit institution. Their total sample consisted of 5,551 households spread across 162 rural villages. They concluded that microcredit had substantial, significant impacts on self-employment assets, outputs, expenses and profits. + +###### We replicated their paper and identified a number of issues that challenge their conclusions. In this blog, we briefly summarize the results of our analysis and then offer ten lessons learned from this research effort. Greater detail about our replication can be found in [***our recently published paper in the International Journal for the Re-Views of Empirical Economics***](https://www.iree.eu/publications/publications-in-iree/estimating-microcredit-impact-with-low-take-up-contamination-and-inconsistent-data-a-replication-study-of-crepon-devoto-duflo-and-pariente-american-economic-journal-applied-economics-2015/)**.** + +###### **A Summary of Key Results from Our Replication** + +###### We found that CDDP’s results depend heavily on how one trims the data. CDDP used two different trimming criteria for their baseline and endline samples. We illustrate the fragility of their results by showing that they are not robust to small changes in the trimming thresholds at endline. Using a slightly looser criterion produces insignificant results for self-employment outputs (sales and home consumption) and profits. Applying a slightly stricter criterion generates significant positive impacts on expenses, significant negative impacts on investment, and insignificant impacts on profits. The latter results defy a coherent interpretation. + +###### We found substantial and significant imbalances in the baseline for a number of important variables, including on the outcome variables of this RCT. + +###### Perhaps relatedly, we estimated implausible “treatment effects” on some variables: For example, we found significant “treatment effects” for household head gender and spoken language. + +###### We documented numerous coding errors. The identified coding errors altered  approximately 80% of the observations. Correcting these substantially modify the estimated average treatment effects. + +###### There were substantial inconsistencies between survey and administrative data. For example, the administrative data used by CDDP identified 435 households as clients, yet 241 of these said they had not borrowed from Al Amana. Another 152 households self-reported having a loan from Al Amana, but were not listed as borrowers in Al Amana’s records. + +###### We found sampling errors. For example, the sex and age composition for approximately 20% of the households interviewed at baseline and supposedly re-interviewed at endline differs to such an extent that it is implausible that the same units were re-interviewed in these cases. + +###### We show in our paper that correcting these data problems substantially affects CDDP’s results. + +###### In addition, we found that CDDP’s sample characteristics differed in important ways from population characteristics, raising questions about the representativeness of the sample, and hence, external validity. + +###### **Ten Lessons Learned** + +###### The following are ten lessons that we learned as a result of our replication, with a focus on development economics. + +###### 1) Peer review cannot be relied upon to prevent suspect data analyses from being published, even at top journals such as the *AEJ:AE*. While some of the data issues we document would be difficult to identify without a careful re-working of the data, others were more obvious and should have been spotted by reviewers. + +###### 2) Replication, and more specifically, verification tests (Clemens, 2017), should play a more prominent role in research. Sukhtantar (2017) systematically reviewed development economics articles published in ten top-ranking journals since 2000. Of 120 RCTs, he found 15 had been replicated. Only two of these had been subjected to verification tests, in which the original data are examined for data, coding, and/or sampling flaws. This suggests that development economists generally assume that the data, sampling and programming code that underlie published research are reliable. A corollary is that multiple replications/verification tests may be needed to uncover problems in a study. For example, CDDP has been subject to two previous replications involving verification testing (Dahal & Fiala 2018; Kingi et al. 2018). These missed the errors we identified in our replication. + +###### 3) The discipline should do more to encourage better data analysis, as separate from econometric methodology. Researchers rarely receive formal training in programming and data handling. However, generic recommendations exist (Wickham 2014; Peng 2011; Cooper et al. 2017). These should be better integrated into researcher training. + +###### 4) Empirical studies should publish the *raw* data used for their analyses whenever possible. Our replication was feasible because the authors and the journal shared the data and code used to produce the published results. Although the *AEJ:AE* data availability policy[[1]](#_ftn1) states that raw data should be made available, this is not always the case. Raw data were available for just three of the six RCTs in the *AEJ:AE* 7(1) special issue on microcredit (Crépon et al. 2015; Attanasio et al. 2015; Augsburg et al. 2015). A subset of pre-processed data was available for two other RCTs (Banerjee et al. 2015; Angelucci, Karlan, & Zinman 2015). While the Banerjee et al. article included a URL link to the raw data, the corresponding website no longer exists. + +###### 5) Survey practices for RCTs should be improved. Data quality and sampling integrity are systematically analyzed for standard surveys (such as the Demographic and Health Surveys and Living Standards Measurement Surveys) and are reported in the survey reports’ appendices. Survey methods and practices used for RCTs should be aligned with the quality standards established for household surveys conducted by national statistical systems (Deaton 1997; United Nations Statistical Division 2005). This implies adopting sound unit definitions (household, economic activity, etc.), drawing on nationally tried-and-tested questionnaire models, working with professional statisticians with experience of quality surveys in the same country (ideally nationals), properly training and closely supervising survey interviewers and data entry clerks, and analyzing and reporting measurement and sampling errors. + +###### 6) RCT reviews should pay greater attention to imbalances at baseline. Many RCTs do not collect individual-level baseline surveys (4 in 6 did so in the *AEJ:AE* special issue on microcredit, Meager 2015), and some randomization proponents go so far as to recommend dropping baseline surveys to concentrate more on running larger endline surveys (Muralidharan 2017). RCTs need to include baseline surveys that offer the same statistical power as their endline surveys to ensure that results at endline are not due to sampling bias at baseline. + +###### 7) RCT reviews should also pay close attention to implausible impacts at endline in order to detect sampling errors, such as the household identification errors observed in CDDP. This can also reveal flaws in experiment integrity, such as co-intervention and data quality issues. + +###### 8) Best practice should be followed with respect to trimming. Deaton and Cartwright (A. Deaton & Cartwright 2016: 1) issued the following warning about trimming in RCTs, “*When there are outlying individual treatment effects, the estimate depends on whether the outliers are assigned to treatments or controls, causing massive reductions in the effective sample size. Trimming of outliers would fix the statistical problem, but only at the price of destroying the economic problem; for example, in healthcare, it is precisely the few outliers that make or break a programme.*” In general, setting fixed cut-offs for trimming lacks objectivity and is a source of bias, as it does not take into account the structure of the data distribution. Best practice for trimming experimental data consists of using a factor of standard deviation and, ideally, defining this factor based on sample size (Selst & Jolicoeur 1994). + +###### 9) RCTs should place their findings in the context of related, non-RCT studies. In their article, CDDP cite 17 references: nine RCTs, four on econometric methodology, three non-RCT empirical studies from India and one economic theory paper. No reference is made to other studies on Morocco, microcredit particularities or challenges encountered with this particular RCT. This is especially surprising since this RCT was the subject of debate in a number of published papers prior to CDDP, all seeking to constructively comment on and contextualize this Moroccan RCT (Bernard, Delarue, & Naudet 2012; Doligez et al. 2013; Morvant-Roux et al. 2014). These references help explain a number of the shortcomings that we identified in our replication. + +###### 10) RCTs are over-weighted in systematic reviews. Currently, RCTs dominate systematic reviews. The CDDP paper has already been cited 248 times and is considered a decisive contribution with respect to a long-standing debate on the subject (Ogden 2017). The substantial concerns we raise suggest that CDDP should not *a priori* be regarded as more reliable than the 154 non-experimental impact evaluations on microcredit that preceded it (Bédécarrats 2012; Duvendack et al. 2011). + +###### [[1]](#_ftnref1) [www.aeaweb.org/journals/policies/data-availability-policy](http://www.aeaweb.org/journals/policies/data-availability-policy) + +###### *Florent Bédécarrats works in the evaluation unit of the French Development Agency (AFD). Isabelle Guérin and François Roubaud are both senior research fellows of the French national Research Institute for Sustainable Development (IRD). Isabelle is a member of the Centre for Social Science Studies on the African, American and Asian Worlds and François is a member of the Joint Research Unit on Development, Institutions and Globalization (DIAL). Solène Morvant-Roux is Assistant Professor at the Institute of Demography and Socioeconomics at the University of Geneva. The opinions expressed are those of the authors and are not attributable to the AFD, the IRD or the University of Geneva. Correspondence can be directed to Florent Bédécarrats at [bedecarratsf@afd.fr](mailto:bedecarratsf@afd.fr)* + +###### **References** + +###### Angelucci, Manuela, Dean Karlan, Jonathan Zinman. 2015. « Microcredit impacts: Evidence from a randomized microcredit program placement experiment by Compartamos Banco ». *American Economic Journal: Applied Economics* 7 (1): 151-82 [***[available online](https://www.povertyactionlab.org/sites/default/files/publications/182_61%20Angelucci%20et%20al%20Mexico%20Jan2015.pdf)***]. + +###### Attanasio, Orazio, Britta Augsburg, Ralph De Haas, Emla Fitzsimons, Heike Harmgart. 2015. « The impacts of microfinance: Evidence from joint-liability lending in Mongolia ». *American Economic Journal: Applied Economics* 7 (1): 90-122 [***[available online](https://www.povertyactionlab.org/sites/default/files/publications/487%20Attanasio%20et%20al%20Mongolia%20Jan2015.pdf)***]. + +###### Augsburg, Britta, Ralph De Haas, Heike Harmgart, Costas Meghir. 2015. « The impacts of microcredit: Evidence from Bosnia and Herzegovina ». *American Economic Journal: Applied Economics* 7 (1): 183-203 [***[available online](https://www.ebrd.com/documents/oce/the-impacts-of-microcredit-evidence-from-bosnia-and-herzegovina.pdf)***]. + +###### Banerjee, Abhijit, Esther Duflo, Rachel Glennerster, Cynthia Kinnan. 2015. « The miracle of microfinance? Evidence from a randomized evaluation ». *American Economic Journal: Applied Economics* 7 (1): 22-53 [***[available online](https://economics.mit.edu/files/5993)***]. + +###### Bédécarrats, Florent. 2012. « L’impact de la microfinance : un enjeu politique au prisme de ses controverses scientifiques ». *Mondes en développement* 158: 127‑42 [***[available online](https://doi.org/10.3917/med.158.0127)***]. + +###### Bédécarrats, Florent, Isabelle Guérin, Solène Morvant-Roux, François Roubaud. 2019. « Estimating microcredit impact with low take-up, contamination and inconsistent data. A replication study of Crépon, Devoto, Duflo, and Pariente (American Economic Journal: Applied Economics, 2015) ». *International Journal for Re-Views in Empirical Economics* 3 (2019‑3) [***[available online](https://doi.org/10.18718/81781.12)***]. + +###### Bernard, Tanguy, Jocelyne Delarue, Jean-David Naudet. 2012. « Impact evaluations: a tool for accountability? Lessons from experience at Agence Française de Développement ». *Journal of Development Effectiveness* 4 (2): 314-327 [***[available online](https://doi.org/10.1080/19439342.2012.686047)***]. + +###### Clemens, Michael 2017. « The meaning of failed replications: A review and proposal ». *Journal of Economic Surveys* 31 (1): 326-342  [***[available online](http://ftp.iza.org/dp9000.pdf)***]. + +###### Cooper, Natalie, Pen-Yuan Hsing, Mike Croucher, Laura Graham, Tamora James, Anna Krystalli, Francois Michonneau. 2017. « A guide to reproducible code in ecology and evolution ». *British Ecological Society* [***[available online](https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf)***]. + +###### Crépon, Bruno, Florencia Devoto, Esther Duflo, William Parienté. 2015. « Estimating the impact of microcredit on those who take it up: Evidence from a randomized experiment in Morocco ». *American Economic Journal: Applied Economics* 7 (1): 123-50 [***[available online](https://economics.mit.edu/files/6659)***]. + +###### Dahal, Mahesh, Nathan Fiala. 2018. « What do we know about the impact of microfinance? The problems of power and precision ». Ruhr Economic Papers [***[available online](http://dx.doi.org/10.4419/86788880)***]. + +###### Deaton, Angus, Nancy Cartwright. 2016. « The limitations of randomized controlled trials ». *VOX: CEPR’s Policy Portal* (blog). 9 novembre 2016 [***[available online](https://voxeu.org/article/limitations-randomised-controlled-trials)***]. + +###### Deaton, Angus. 1997. *The Analysis of Household Surveys: A Microeconometric Approach to Development Policy*. Baltimore, MD: World Bank Publications [***[available online](http://documents.worldbank.org/curated/en/593871468777303124/The-Analysis-of-Household-Surveys-A-Microeconometric-Approach-to-Development-Policy)***]. + +###### Doligez, François, Florent Bédécarrats, Emmanuelle Bouquet, Cécile Lapenu, Betty Wampfler. 2013. « Évaluer l’impact de la microfinance : Sortir de la “double impasse” ». *Revue Tiers Monde*, no 213: 161‑78 [***[available online](https://doi.org/10.3917/rtm.213.0161)***]. + +###### Duvendack, Maren, Richard Palmer-Jones, James Copestake, Lee Hooper, Yoon Loke, Nitya Rao. 2011. *What is the Evidence of the Impact of Microfinance on the Well-Being of Poor People?* Londres: EPPI-University of London [***[available online](https://eppi.ioe.ac.uk/cms/Portals/0/PDF%20reviews%20and%20summaries/Microfinance%202011Duvendack%20report.pdf?ver=2011-10-28-162132-813)***]. + +###### Kingi, Hautahi, Flavio Stanchi, Lars Vilhuber, Sylverie Herbert. 2018. « The Reproducibility of Economics Research:  A Case Study ». presented at  Berkeley Initiative for Transparency in the Social Sciences Annual Meeting, Berkeley [***[available online](https://hdl.handle.net/1813/60838)***]. + +###### Meager, Rachael. 2015. « Understanding the Impact of Microcredit Expansions: A Bayesian Hierarchical Analysis of 7 Randomised Experiments ». *arXiv:1506.06669*[***[available online](https://arxiv.org/abs/1506.06669)***]. + +###### Morvant-Roux, Solène, Isabelle Guérin, Marc Roesch, Jean-Yves Moisseron. 2014. « Adding Value to Randomization with Qualitative Analysis: The Case of Microcredit in Rural Morocco ». *World Development* 56 (avril): 302‑12 [***[available online](http://hal.ird.fr/ird-01471911/document)***]. + +###### Muralidharan, Karthik. 2017. « Field Experiments in Education in Developing Countries ». In *Handbook of Economic Field Experiments*. Elsevier. + +###### Ogden, Timothy. 2017. *Experimental Conversations: Perspectives on Randomized Trials in Development Economics*. Cambridge, Massachusetts: The MIT Press. + +###### Peng, Roger. 2011. « Reproducible research in computational science ». *Science* 334 (6060): 1226-1227. + +###### Selst, Mark Van, Pierre Jolicoeur. 1994. « A solution to the effect of sample size on outlier elimination ». *The quarterly journal of experimental psychology* 47 (3): 631-650. + +###### United Nations Statistical Division. 2005. *Household Surveys in Developing and Transition Countries*. United Nations Publications [***[available online](https://unstats.un.org/UNSD/hhsurveys/)***]. + +###### Wickham, Hadley. 2014. « Tidy data ». *Journal of Statistical Software* 59 (10): 1-23 [***[available online](http://dx.doi.org/10.18637/jss.v059.i10)***]. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/04/30/bedecarrats-et-al-lessons-from-replicating-an-rct/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/04/30/bedecarrats-et-al-lessons-from-replicating-an-rct/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/b-d-mccullough-the-reason-so-few-replications-get-published-is.md b/content/replication-hub/blog/b-d-mccullough-the-reason-so-few-replications-get-published-is.md new file mode 100644 index 00000000000..9a016436d69 --- /dev/null +++ b/content/replication-hub/blog/b-d-mccullough-the-reason-so-few-replications-get-published-is.md @@ -0,0 +1,18 @@ +--- +title: "B.D.MCCULLOUGH: The Reason so Few Replications Get Published Is…." +date: 2015-07-30 +author: "The Replication Network" +draft: false +type: blog +--- + +###### When preparing to give a talk at a conference recently, I decided to update some information I had published a few years ago.  In McCullough (2009), I estimated that 16 economics journals had a mandatory data/code archive (archives that require only data do not support replication — see McCullough, McGeary and Harrison (2008)).  Vlaeminck (2013) counted 26 journals with a mandatory data/code archive. This is a non-trivial increase, since in 2004 only four economics journals had such a policy.  One might think that this increase bodes well for replicability in the economic science, but such is not the case.  It is all well and good to make data and code available for replications, but if there is no place for researchers to publish these replications, then all the mandatory data/code archives in the world will amount to only so much window dressing. The problem is that editors do not want to admit that they publish unreplicable research, nor do they want to be bothered ensuring that the research they publish is replicable.  The fact is that very few journals will publish replications and the top-ranked journals only publish an infinitesimal number of replications.  Consequently, any editor is largely immune to the embarrassment that would arise if several of the articles he published were found to be not replicable.  Hence, editors have no incentive either to ensure the replicability of the articles they publish or to publish replications of the articles they do publish.  If researchers can’t get their replication articles published in decent journals, they won’t write the articles in the first place.  And this seems to be the present state of equilibrium, sub-optimal though it may be.  Worse, there seems to be a tacit collusion between the editors, in that one editor will not publish an article that exposes another editor as publishing unreplicable research. Prima facie evidence of this sad state of affairs is the fact that Liebowitz’s failed replication of the JPE paper by Oberholzer-Gee and Strumpf still hasn’t been published, not by the the JPE and not be any other journal.  Anyone interested in replication should go to SRRN and read the papers by Liebowitz on this topic.  In “How Reliable is the Oberholzer-Gee and Strumpf Paper on File Sharing”, Liebowitz capably demonstrates fatal flaws in the data handling and analysis of the Oberholzer-Gee and Strumpf paper.  Actually, time is precious; just take my word for it so that you don’t have to read it: Liebowitz demolishes the Oberholzer-Gee/Strumpf paper.  In “Sequel to Liebowitz’s Comment on the Oberholzer-Gee and Strumpf Paper on File Sharing”, Liebowitz describes his efforts to get his paper published in the JPE.  This is the paper to be read. So Kafkaesque was Liebowitz’s ordeal that journalist Norbert Haring, writing in the German financial newspaper Handelsblatt (the German equivalent of the Wall Street Journal), said, “Steven Levitt, Editor of the Journal of Political Economy, uses a questionable tactic to block an undesired comment.   The subject of the criticised article was a hot topic.  On closer look, everything about the case was unusual.”  One might think that another journal with an interest in file sharing would publish Liebowitz’s paper…. No one can read these papers by Liebowitz and think that “truth will out” in the economics journals.  Yet there is cause for hope. Third party organizations dedicated to replication have emerged in the past few years, such as 3ie (International Initiative for Impact Evaluation) and BITSS (Berkeley Initiative for Transparency in the Social Sciences) and EDAWAX (European Data Watch).  These organizations support replication without a necessary prospect of publication.  If these organizations can demonstrate that top journals are publishing non-replicable research, then the top journals might be embarrassed into admitting that their efforts to ensure replicability are insufficient.  And then Liebowitz’s article might finally get published. References =========== N. Haring, Handelsblatt, 23.06.2008 B. D. McCullough “Open Access Economics Journals and the Market for Reproducible Economic Research” *Economic Analysis and Policy* **39**(1), 117-126, 2009 B. D. McCullough, Kerry Anne McGeary and Teresa D. Harrison “Do Economics Journal Archives Promote Replicable Research?” *Canadian Journal of Economics* **41**(4), 1406-1420, 2008 Vlaeminck, Sven, 2013. “Data Management in Scholarly Journals and Possible Roles for Libraries – Some Insights from EDaWaX,” EconStor Open Access Articles, ZBW – German National Library of Economics, pages 49-79. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/07/30/b-d-mccullough-the-reason-so-few-replications-get-published-is/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/07/30/b-d-mccullough-the-reason-so-few-replications-get-published-is/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/benjamin-wood-and-annette-brown-what-3ie-is-doing-in-the-replication-business.md b/content/replication-hub/blog/benjamin-wood-and-annette-brown-what-3ie-is-doing-in-the-replication-business.md new file mode 100644 index 00000000000..63f9e539264 --- /dev/null +++ b/content/replication-hub/blog/benjamin-wood-and-annette-brown-what-3ie-is-doing-in-the-replication-business.md @@ -0,0 +1,44 @@ +--- +title: "BENJAMIN WOOD and ANNETTE BROWN: What 3ie Is Doing in the Replication Business" +date: 2015-10-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "3ie" + - "Impact Evaluations" + - "Replication Paper Series" + - "replications" +draft: false +type: blog +--- + +###### What’s the *International Initiative for Impact Evaluation* ([***3ie***)](http://www.3ieimpact.org/) doing in the replication business? 3ie is mostly known in the development community as a funder of impact evaluations and systematic reviews. But our leadership always envisioned a role for replication research within 3ie’s mandate to provide high quality evidence for policymaking. We designed 3ie’s replication programme to encourage internal replication studies of influential, innovative, or controversial development-related impact evaluations. Through two rounds of replication windows we’ve funded 10 replication studies to date, including the highly discussed ***[replication research](http://www.3ieimpact.org/en/evaluation/impact-evaluation-replication-programme/replication-worms-identifying-impacts-education-and-health/)*** around deworming treatments in Kenya (for a related news item in TRN, ***[click here](https://replicationnetwork.com/2015/07/25/headline-news-two-economists-make-their-data-available/)***). So here’s what 3ie is doing in the replication business. + +###### Our processes are designed to address common criticisms of replication research (with this blog borrowing from a forthcoming paper we’re writing about it). Selection of replication-eligible studies is fraught with insinuations of improper selection, with replication researchers supposedly only choosing to replicate studies they feel confident they can disprove. We address that criticism by creating different eligibility mechanisms, such as choosing studies based on a crowdsourced ***[Candidate Studies List](http://www.3ieimpact.org/funding/replication-window/replication-candidate-studies/)*** and gathering a committee of experts to judge the policy relevance of each replication study proposal. + +###### One of the biggest concerns regarding replication research is researcher incentives. If bias exists for original authors to discover a new result, it can also exist for replication researchers to disprove the established result. To address this concern, we require all replication researchers to post replication plans, which allow readers to know how the researchers intended to undertake their replication study before starting the research. Ideally all robustness tests conducted in the replication paper will be publicly pre-specified in these replication plans. + +###### In an attempt to further defuse replication tensions, we encourage engagement between the replication researchers and the original authors. We require 3ie-funded replication researchers to include a “pure replication,” and to share these results early in the replication process. In the pure replication, the researchers attempt to reproduce the published results using the same data and methods as in the publication. We then require these replication researchers to share their findings with the original authors before completing their study, giving the original authors the opportunity to reply to the direct reproduction of their work before any results are finalized. + +###### Original authors are understandably sensitive to replication researchers who (in their opinion) solely aim to discredit their work. 3ie’s replication process includes multiple rounds of internal and external referring, including reviews of replication plans, pure replications, and draft final replication reports (see our ***[peer reviewing replication research blog](http://blogs.3ieimpact.org/how-to-peer-review-replication-research/)*** for more detail). + +###### Finally, replication researchers are concerned that they will spend a significant amount of time conducting their study and then have no place to publish it. And original authors are worried that they won’t have an opportunity to directly reply to the replication study. While we cannot guarantee publications, we created 3ie’s ***[Replication Paper Series](http://www.3ieimpact.org/publications/3ie-replication-paper-series/)*** (RPS) to partially address both of these concerns. The RPS provides an outlet for the replicating researchers to publish their work, and for original authors to respond to it. We view the RPS as a repository of replication research, including confirmatory studies that might struggle to find space in a journal. + +###### If you’re interested in replication research, here are a few ways to get involved with 3ie’s replication programme: + +###### – We’re planning another replication window. Send us the titles of recently published policy relevant development impact evaluations papers that you think should be considered for future replication to [replication@3eimpact.org](mailto:replication@3eimpact.org). + +###### – Apply for a replication award when we open our next window. + +###### – Volunteer to serve as an external reviewer of replication research. + +###### – Submit your replication paper, even if it wasn’t funded by 3ie, to our RPS (here are the ***[instructions](http://www.3ieimpact.org/en/evaluation/impact-evaluation-replication-programme/3ie-replication-paper-series-submission/)***). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/10/15/benjamin-wood-and-annette-brown-what-3ie-is-doing-in-the-replication-business/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/10/15/benjamin-wood-and-annette-brown-what-3ie-is-doing-in-the-replication-business/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/bitss-journal-of-development-economics-pilots-pre-results-review.md b/content/replication-hub/blog/bitss-journal-of-development-economics-pilots-pre-results-review.md new file mode 100644 index 00000000000..e01cb4433cd --- /dev/null +++ b/content/replication-hub/blog/bitss-journal-of-development-economics-pilots-pre-results-review.md @@ -0,0 +1,59 @@ +--- +title: "BITSS: Journal of Development Economics Pilots Pre-Results Review" +date: 2018-05-14 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "BITSS" + - "Journal of Development Economics" + - "Journal policies" + - "Pre-Results Review" + - "Registered Reports" +draft: false +type: blog +--- + +###### The *Journal of Development Economics* (***[JDE](https://www.journals.elsevier.com/journal-of-development-economics/)***) is piloting a new approach in which authors have the opportunity to submit empirical research designs for review and approval *before* the results of the study are known. While the JDE is the first journal in economics to implement this approach—referred to as “pre-results review”—it joins over 100 ***[other journals](https://cos.io/rr/)*** from across the sciences. + +###### **What is Pre-Results Review?** + +###### Pre-results review splits the peer review process into two stages (see Figure 1 below). In Stage 1, authors submit a plan for a prospective research project, typically including a literature review, research question(s), hypotheses, and a detailed methodological framework. This submission is evaluated based on the significance of the research question(s), the soundness of the theoretical reasoning, and the credibility and feasibility of the research design. + +###### Positively evaluated submissions are *accepted based on pre-results review*. This constitutes a commitment by the journal to publish the full paper, regardless of the nature of the empirical results. Authors will then collect and analyze their data, and submit the final paper for final review and publication (Stage 2). The final Stage 2 review provides quality assurance and ensures alignment with the research design peer reviewed in Stage 1.Capture**Why Pre-Results Review?** + +###### In development economics, we have long argued for the use of rigorous evidence to inform decisions about public policies. However, incentives in academia and journal publishing often reward studies featuring novel, theoretically tidy, or statistically significant results. Papers that fail to report such findings often go ***[unpublished](http://science.sciencemag.org/content/sci/345/6203/1502.full.pdf?sid=299345a4-f312-42c7-b5e1-4d843d4d0c30)***, even if the studies are of high quality and address important questions. As a result, we are left with an evidence base comprised of papers that tell ‘neat’ and clean stories, but may not accurately represent the world. When such research serves as the foundation for public policies, this publication bias can be costly. + +###### In recent years, pre-results review has emerged as potential alternative model to address publication bias. We hope that this pilot will help us understand the effectiveness of this approach and its sustainability for both the JDE and other social science journals. + +###### **What’s in It For You?** + +###### – Publication decision earlier in the peer review process; + +###### – Constructive feedback from peer reviewers earlier in the publishing process, with the potential for helpful suggestions for research design before beginning data collection; + +###### – Editorial decisions that are not influenced by the results of a study; + +###### – Inclusion of JDE “acceptance based on pre-results review” on author’s CVs; and + +###### – The chance to be part of an exciting pilot effort in economics! + +###### **How to Submit** + +###### Submissions should be filed as ‘Registered Reports’ on the JDE’s regular ***[submissions portal](https://www.evise.com/profile/api/navigate/DEVEC)***. + +###### All submissions in this format will follow existing JDE policies, including the ***[Mandatory Replication Policy](http://www.elsevier.com/__data/promis_misc/devec%20130805_ReplicationPolicy.docx)***. For guidelines specific to pre-results review, please see the ***[JDE Registered Reports Author Guidelines](https://www.bitss.org/wp-content/uploads/2018/03/JDE_RR_Author_Guidelines.pdf)***. + +###### **Need Help?** + +###### The Berkeley Initiative for Transparency in the Social Sciences (***[BITSS](https://www.bitss.org/publishing/rr-jde-about/)***) supports authors with pre-registering their research designs and preparing JDE submissions. Please contact Aleks Bogdanoski at *[abogdanoski@berkeley.edu](mailto:abogdanoski@berkeley.edu)* with any questio*ns.* + +###### *Established by the Center for Effective Global Action ([CEGA](http://cega.berkeley.edu/)) in 2012, the Berkeley Initiative for Transparency in the Social Sciences (BITSS) works to strengthen the integrity of social science research and evidence used for policy-making. The initiative aims to enhance the practices of economists, psychologists, political scientists, and other social scientists in ways that promote research transparency, reproducibility, and openness. Visit [www.bitss.org](http://www.bitss.org/) and @UCBITSS on Twitter to learn more, find useful tools and resources, and contribute to the discussion.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/05/14/bogdanoski-journal-of-development-economics-pilots-pre-results-review/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/05/14/bogdanoski-journal-of-development-economics-pilots-pre-results-review/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/blanco-perez-brodeur-progress-in-publishing-negative-results.md b/content/replication-hub/blog/blanco-perez-brodeur-progress-in-publishing-negative-results.md new file mode 100644 index 00000000000..d23ad85de68 --- /dev/null +++ b/content/replication-hub/blog/blanco-perez-brodeur-progress-in-publishing-negative-results.md @@ -0,0 +1,53 @@ +--- +title: "BLANCO-PEREZ & BRODEUR: Progress in Publishing Negative Results?" +date: 2018-01-24 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "health economics" + - "negative findings" + - "p-hacking" + - "Statistical significance" + - "z-statistics" +draft: false +type: blog +--- + +###### *[From the working paper, “Publication Bias and Editorial Statement on Negative Findings” by Cristina Blanco-Perez and Abel Brodeur]* + +###### Prior research points out that there is a selection bias in favor of positive results by editors and referees. In other words, research articles rejecting the null hypothesis (i.e., finding a statistically significant effect) are more likely to get published than papers not rejecting the null hypothesis. This issue may lead policymakers and the academic community to believe more in studies that find an effect than in studies not finding an effect. + +###### Fortunately, innovations in social sciences are under way to improve research transparency. For instance, many scientific journals now ask the authors to share their codes and data to facilitate replication. Registration and pre-analysis plans are also becoming more popular for randomized controlled trials and lab experiments. + +###### In this study, we test the impact of a simple, low-cost, new transparent practice that aims to reduce the extent of publication bias.  In February 2015, the editors of eight health economics journals published on their journals’ websites an [***Editorial** **Statement on Negative Findings***](http://ashecon.org/american-journal-of-health-economics/editorial-statement-on-negative-findings/). In this statement, the editors express that: “well-designed, well-executed empirical studies that address interesting and important problems in health economics, utilize appropriate data in a sound and creative manner, and deploy innovative conceptual and methodological approaches […] have potential scientific and publication merit regardless of whether such studies’ empirical findings do or do not reject null hypotheses that may be specified.” + +###### The editors point out in the statement that it: “should reduce the incentives to engage in two forms of behavior that we feel ought to be discouraged in the spirit of scientific advancement: + +###### – Authors withholding from submission such studies that are otherwise meritorious but whose main empirical findings are highly likely `negative’ (e.g., null hypotheses not rejected). + +###### – Authors engaging in `data mining,’ `specification searching,’ and other such empirical strategies with the goal of producing results that are ostensibly `positive’ (e.g., null hypotheses reported as rejected).” + +###### We collect z -statistics from two of the eight health economics journals that sent out the editorial statement and compare the distribution of tests before and after the editorial statement. We find that test statistics in papers submitted and published after the editors sent out the editorial statement are less likely to be statistically significant. The figure below illustrates our results. + +![brodeur](/replication-network-blog/brodeur.webp) + +###### About 56%, 49% and 41% of z -statistics, respectively, are statistically significant at the 10%, 5% and 1% levels after the editorial in comparison with 61%, 55% and 49% of z -statistics before the editorial statement. Of note, we document that the impact of the statement intensifies over the time period studied. + +###### As a robustness check, we look at whether there was a similar shift in the distribution of z -statistics at the time of the editorial statement for a non-health economics journal. On the contrary, we find that the distribution of z -statistics shifted to the right after the editorial statement for our control journal, possibly due to the increasing pressure to publish. + +###### Overall, our results provide suggestive evidence that the decrease in the share of tests significant at conventional levels is due to both a change in editors’ preferences for negative findings and a change in authors and/or referees’ behavior. + +###### Our results have interesting implications for editors and the academic community. They suggest that incentives may be aligned to promote a more transparent research and that editors may reduce the extent of publication bias quite easily. + +###### To read more, ***[click](https://osf.io/preprints/bitss/xq9nt/?platform=hootsuite) [here](https://osf.io/preprints/bitss/xq9nt/?platform=hootsuite)***. + +###### *Cristina Blanco-Perez is a Visiting Professor at the Economics Department of Ottawa. She can be contacted at* [*cblancop@uottawa.ca*](mailto:cblancop@uottawa.ca)*. Abel Brodeur is Assistant Professor of Economics at the University of Ottawa. He can be contacted at* [*abrodeur@uottawa.ca*](mailto:abrodeur@uottawa.ca)*.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/01/24/blanco-perez-brodeur-progress-in-publishing-negative-results/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/01/24/blanco-perez-brodeur-progress-in-publishing-negative-results/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/bob-reed-on-andrew-gelman-retractions-and-the-supply-and-demand-for-data-transparency.md b/content/replication-hub/blog/bob-reed-on-andrew-gelman-retractions-and-the-supply-and-demand-for-data-transparency.md new file mode 100644 index 00000000000..89b4d9f7fb1 --- /dev/null +++ b/content/replication-hub/blog/bob-reed-on-andrew-gelman-retractions-and-the-supply-and-demand-for-data-transparency.md @@ -0,0 +1,51 @@ +--- +title: "BOB REED: On Andrew Gelman, Retractions, and the Supply and Demand for Data Transparency" +date: 2016-05-23 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Andrew Gelman" + - "data and code" + - "replication" + - "retraction" + - "Sharing data" +draft: false +type: blog +--- + +###### In a ***[recent interview on Retraction Watch](http://retractionwatch.com/2016/05/19/retractions-arent-enough-why-science-has-bigger-problems/)***, Andrew Gelman reveals that what keeps him up at night isn’t scientific fraud, it’s “the sheer number of unreliable studies — uncorrected, unretracted — that have littered the literature.”  He then goes on to argue that retractions cannot be the answer.  His argument is simple.  The scales don’t match.  “Millions of scientific papers are published each year.  If 1% are fatally flawed, that’s thousands of corrections to be made.  And that’s not gonna happen.” + +###### Actually, if 1% of studies are fatally flawed, the problem is probably manageable.  Assuming a typical journal publishes 10 articles an issue, 4 issues a year, that means one retraction every two and a half years, which is certainly feasible for a journal.  Problems arise only when the percent substantially rises.  Gelman goes on to say that he personally thinks the error rate to be large as 50% in some journals, where “half the papers claim evidence that they don’t really have.”  At that point retractions are not the solution. + +###### If revealed preference is any indication, hopes for a solution appear centered on “data transparency.”  Data transparency means different things to different people, but a common core is that researchers make their data and programming code publicly available. + +###### The ***[Center for Open Science](https://cos.io/)***, ***[Dataverse](http://dataverse.org/)***, and ***[EUDAT](https://eudat.eu/what-eudat)*** are but a few examples of  the high-profile explosion in efforts to make research data more “open,” transparent and shareable.  In a ***[recent guest blog at The Replication Network](https://replicationnetwork.com/2016/05/19/stephanie-wykstra-on-data-reuse/)*** (reblogged from BITSS), Stephanie Wykstra promotes the related topic of data re-use. + +###### In an encouraging sign, these efforts appear to have had an impact.  A ***[recent survey article](https://econjwatch.org/articles/replications-in-economics-a-progress-report)*** by Duvendack et al. report that, of 333 journals categorized as “economics journals” by Thompson Reuter’s *Journal Citation Reports*, 27, or a little more than 8 percent, regularly published data and code to accompany empirical research studies.  As some of these journals are exclusively theory journals, the effective rate is somewhat higher. + +###### Noteworthy is that many of these journals only recently instituted a policy of publishing data and code.  So while one can argue whether the glass is, say, 20 percent full or 80 percent empty, the fact is that the glass used to contain virtually nothing.  That is progress. + +###### But making data more “open” does not, by itself, address the problem of scientific unreliability.  Researchers have to be motivated to go through these data, examine them carefully, and determine if they are sufficient to support the claims of the original study.  Further, they need to have an avenue to publicize their findings in a way that informs the literature. + +###### This is what replications are supposed to do.  Replications provide a way to confirm/disconfirm the results of other studies.  They are scalable to fit the size of the problem.  With so many studies potentially unreliable, researchers would prioritize the most important findings that are worthy of further analysis.  The self-selection mechanism of researchers’ time and interests would insure that the most important, most influential studies are appropriately vetted. + +###### But after obtaining their results, researchers need a place to publicize their findings. + +###### Unfortunately, on this dimension, the Duvendack et al. study is less encouraging.  They report that only 3 percent of “economics” journals explicitly state that that they publish replications.  Most of these are specialty/field journals, so that an author of a replication study only has a very few outlets, maybe as few as one or two, in which they can hope to publish their research. + +###### And just because a journal states that it publishes replication studies, doesn’t mean that it does it very often. Duvendack et al. report that 6 journals account for 60 percent of all replication studies ever published in Web of Science “economics” journals.  Further, only 10 journals have ever published more than 3 replication studies.  In their entire history. + +###### Without an outlet to publish their findings, researchers will be unmotivated to spend substantial effort re-analysing other researchers’ data.  Or to put it differently, the open science/data sharing movement only addresses the supply side of the scientific market.  Unless the demand side is addressed, these efforts are unlikely to be successful in providing a solution to the problem of scientific unreliability. + +###### The irony is this: The problem has been identified.  There is a solution.  The pieces are all there.  But in the end, the gatekeepers of scientific findings, the journals, need to open up space to allow science to be self-correcting.  Until that happens, there’s not much hope of Professor Gelman getting any more sleep. + +###### *Bob Reed is Professor of Economics at the University of Canterbury in New Zealand, and co-organizer of The Replication Network.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/05/23/bob-reed-on-andrew-gelman-retractions-and-the-supply-and-demand-for-data-transparency/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/05/23/bob-reed-on-andrew-gelman-retractions-and-the-supply-and-demand-for-data-transparency/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/bob-reed-replications-and-peer-review.md b/content/replication-hub/blog/bob-reed-replications-and-peer-review.md new file mode 100644 index 00000000000..1a44da31ab9 --- /dev/null +++ b/content/replication-hub/blog/bob-reed-replications-and-peer-review.md @@ -0,0 +1,32 @@ +--- +title: "BOB REED: Replications and Peer Review" +date: 2016-07-11 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Ivan Oransky" + - "peer review" + - "replication" + - "Retraction Watch" +draft: false +type: blog +--- + +###### “Weekend Reads”, the weekly summary by IVAN ORANSKY of ***[Retraction Watch](http://retractionwatch.com/)***, recently listed two articles on Peer Review.  One, a blog by George Borjas, concerns the recent imbroglio at the *American Economic Review* involving an editor who oversaw the review of an article by one of her coauthors (***[read here](https://gborjas.org/2016/06/30/a-rant-on-peer-review/)***).  The other, a comment in *Nature* entitled “Let’s make peer review scientific” (***[read here](https://gborjas.org/2016/06/30/a-rant-on-peer-review/)***) reviews 30 years of progress, and lack of progress, in peer reviewing.  Both articles underscore the obvious to anybody who has even minimal experience with the reviewing process — peer review is a flawed process. + +###### What does this have to do with replications?  The real problem with peer review is thinking that it is the final arbiter of a paper’s value.  Peer review is but one step in a lengthy process.  It follows the circulation of a working paper for comments and the presentation of one’s research at seminars and conferences.  But the publication of a paper should not be the final stage in a paper’s review. + +###### If a paper is important and makes a valuable contribution, that research should be examined further.  Were the data handled correctly?  Would alternative formulations of the research question have given similar results? Were the results robust to reasonable perturbations in experimental design?  These are things that are difficult for reviewers to address, because they generally do not have access to a researcher’s data and code. + +###### Even when a journal requires data and code to be made available, rarely do reviewers have access to these when they are doing their review.  They are only available after a paper has been accepted for publication.  And it is only after researchers have been able to go through a paper’s data and code that they can judge for themselves whether the paper’s conclusions are fragile or robust. + +###### The problem with peer review is not so much a problem with peer review.  The problem with peer review is the scientific community’s elevation of peer review in the review process.  Peer review should be thought of as an intermediate stage in the review of a paper.  As one part of the gauntlet that a paper needs to run to establish its scientific worth.  Until it becomes the norm for authors to provide their data and code when submitting their research to journals, it will inevitably be the case that the real “review” will have to be done in the post publication phase of a paper’s life.  Through replication. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/07/11/bob-reed-replications-and-peer-review/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/07/11/bob-reed-replications-and-peer-review/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/bob-reed-replications-can-make-things-worse-really.md b/content/replication-hub/blog/bob-reed-replications-can-make-things-worse-really.md new file mode 100644 index 00000000000..186af4a346b --- /dev/null +++ b/content/replication-hub/blog/bob-reed-replications-can-make-things-worse-really.md @@ -0,0 +1,31 @@ +--- +title: "BOB REED: Replications Can Make Things Worse? Really?" +date: 2016-05-03 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Adam Marcu" + - "Economics E-Journal" + - "Ivan Oransky" + - "Public Finance Review" + - "replication policy" + - "Retraction Watch" + - "Slate" +draft: false +type: blog +--- + +###### In a recent article in *Slate* entitled “The Unintended Consequences of Trying to Replicate Research,” IVAN ORANSKY and ADAM MARCUS from *Retraction Watch* argue that replications can exacerbate research unreliability.  The argument assumes that publication bias is more likely to favour confirming replication studies over disconfirming studies. To read more, ***[click here](http://www.slate.com/articles/technology/future_tense/2016/04/the_unintended_consequences_of_trying_to_replicate_scientific_research.html)***. This is the same argument that Michele Nuijten makes in her guest blog for *TRN,*which you can ***[read here](https://replicationnetwork.com/2016/01/05/michele-nuijten-the-replication-paradox/)***. + +###### Whether this is a real concern depends on the replication policies at journals.  At least two economics journals have publication policies that explicitly state they are neutral towards the conclusion of replication studies.  In their “Call for Replication Studies”, Burmann et al. state: “*Public Finance Review* will publish all … kinds of replication studies, those that validate and those that invalidate previous research” (***[see here](http://www.sagepub.com/sites/default/files/upm-binaries/36845_Replication_Studies11PFR10_787_793.pdf)***).  And the journal *Economics: The Open-Access, Open-Assessment E-Journal*states: “The journal will publish both confirmations and disconfirmations of original studies. The only consideration will be quality of the replicating study” (***[see here](http://www.economics-ejournal.org/special-areas/replications-1)***). + +###### Further, in their recent study, “Replications in Economics: A Progress Report” (***[see here](http://econjwatch.org/articles/replications-in-economics-a-progress-report)***), Duvendack et al. find that most published replication studies in economics disconfirm the original research.  So while it is possible that replications could make things worse, perhaps this is more a worry in theory than in practice.  At least in economics. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/05/03/replications-can-make-things-worse/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/05/03/replications-can-make-things-worse/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/bob-reed-the-problem-with-open-data-would-requiring-co-authorship-help.md b/content/replication-hub/blog/bob-reed-the-problem-with-open-data-would-requiring-co-authorship-help.md new file mode 100644 index 00000000000..18c7f7149c8 --- /dev/null +++ b/content/replication-hub/blog/bob-reed-the-problem-with-open-data-would-requiring-co-authorship-help.md @@ -0,0 +1,32 @@ +--- +title: "BOB REED: The Problem With Open Data: Would Requiring Co-Authorship Help?" +date: 2016-06-28 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Ecology Bits" + - "Margaret Kosmala" + - "Open data" + - "property rights" +draft: false +type: blog +--- + +###### There has been a huge amount of attention focused on “open data.”  A casual reading of the blogosphere is that ***[Open Data](https://en.wikipedia.org/wiki/Open_data)*** is good, ***[Secret Data](https://replicationnetwork.com/2015/12/31/john-cochrane-secret-data/)*** is bad. + +###### Remarkably, there has been very little discussion given to the property right issues associated with open data.  The Open Data Movement wants to turn a private good (datasets) into a public good.  Economists know something about public goods.  They tend to get under-produced.  This introduces a trade-off between the propagation of data for use by multiple researchers, a social good (***[though see here for a discussion where this is not necessarily so](https://replicationnetwork.com/2016/05/30/youtube-video-of-conference-session-on-open-science/)***), versus the disincentive this causes for producing data, a social bad.  How best to make this trade-off is unclear. + +###### In a ***[recent blog](http://ecologybits.com/index.php/2016/06/15/open-data-authorship-and-the-early-career-scientist/)*** entitled “Open data, authorship, and the early career scientist”, ***[MARGARET KOSMALA](http://ecologybits.com/index.php/author/mkosmala/)***, a postdoctoral fellow at Harvard University, argues that making one’s data available to others hurts the data-producing scholar, particularly younger scholars.  The argument is not so much that the data-producing scholar will be scooped by other scientists on the associated research.  Rather, it is that subsequent research projects that could have resulted in publications for the data-producing scholar will end up being undertaken by other scientists.  And while Kosmala does not make this point explicitly, this serves as a disincentive for scientists to produce data, if only because  younger scholars may not be able to produce sufficient publications to get the funding and tenure they need to continue their careers. + +###### What is really interesting about this blog is that it led to a discussion between a reader and the author about the ethics of “requiring co-authorship” when authors use data produced by another scientist.  Missing from the discussion was the recognition that “requiring co-authorship” provides a potential solution to the problem of open data.  It is a way for the data-producing scientist to reap the rewards of data production, while still allowing other authors to use it. + +###### Of course, there are issues associated with implementing a policy like this.  Once data are released, how will the data-producing author be able to ensure that others who use the data will extend co-authorship to him/her?  And suppose the data-producing author does not wish to have their data used in a certain way.  Should they have the right to restrict its use?  While the answers are debatable, the questions are illuminating, because they make us realise that the debate over open data is just another application of the larger subject of ***[intellectual property rights](https://en.wikipedia.org/wiki/Intellectual_property)***. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/06/28/bob-reed-the-problem-with-open-data-would-requiring-co-authorship-help/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/06/28/bob-reed-the-problem-with-open-data-would-requiring-co-authorship-help/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/brodeur-launching-the-institute-for-replication-i4r.md b/content/replication-hub/blog/brodeur-launching-the-institute-for-replication-i4r.md new file mode 100644 index 00000000000..cb6739728aa --- /dev/null +++ b/content/replication-hub/blog/brodeur-launching-the-institute-for-replication-i4r.md @@ -0,0 +1,74 @@ +--- +title: "BRODEUR: Launching the Institute for Replication (I4R)" +date: 2022-01-08 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Abel Brodeur" + - "AEA Data Editor" + - "BITSS" + - "economics" + - "I4R" + - "Institute for Replication" + - "political science" + - "replication" + - "Replicators" + - "Social Science Reproduction Platform" +draft: false +type: blog +--- + +Replication is key to the credibility and confidence in research findings. As falsification checks of past evidence, replication efforts contribute in essential ways to the production of scientific knowledge. They allow us to assess which findings are robust, making science a self-correcting system, with major downstream effects on policy-making. Despite these benefits, reproducibility and replicability rates are surprisingly low, and direct replications rarely published. Addressing these challenges requires innovative approaches in how we conduct, reward, and communicate the outcomes of reproductions and replications. + +That is why we are excited to announce the official launch of the **[*Institute for Replication*](https://i4replication.org/) (I4R), an institute working to improve the credibility of science by systematically reproducing and replicating research findings in leading academic journals.** Our team of collaborators supports researchers and aims to improve the credibility of science by + +– Reproducing, conducting sensitivity analysis and replicating results of studies published in leading journals. + +– Establishing an open access website to serve as a central repository containing the replications, responses by the original authors and documentation. + +– Developing and providing access to educational material on replication and open science. + +– Preparing standardized file structure and code and documentation aimed at facilitating reproducibility and replicability by the broader community. + +**How I4R works** + +Our primary goal is to promote and generate replications. Replications may be achieved using the same or different data and procedures/codes, and a variety of [***definitions***](https://i4replication.org/definitions.html) are being used. + +While I4R is not a journal, we are actively looking for replicators and have an [***ongoing list of studies***](https://i4replication.org/reports.html) we’re looking to be replicated. Once a set of studies has been selected by I4R, our team of collaborators will confirm that the codes and data provided by the selected studies are sufficient to reproduce their results. Once that has been established, our team recruits replicators to test the robustness of the main results of the selected studies. + +For their replication, replicators may use the [***Social Science Reproduction Platform***](https://www.socialsciencereproduction.org/)). We also developed a template for writing replications which is available [***here***](https://osf.io/8dkxc/). This template provides examples of robustness checks and how to report the replication results. Once the replication is completed, we will be sending a copy to the original author(s) who will have the opportunity to provide an answer. Both the replication and answer from the original author(s) will be simultaneously released on our ***[website](https://i4replication.org/reports.html)*** and working paper series. + +Replicators may decide to remain anonymous. The decision to remain anonymous can be made at any point during the process; initially, once completed or once the original author(s) provided an answer. See ***[Conflict of Interest](https://i4replication.org/conflict.html)*** page for more details. + +We will provide assistance for helping replicators publish their work. Replicators will also be invited to co-author a large meta-analysis paper which will combine the work of all replicators and answer questions such as which type of studies replicate and what characterizes results that replicate. For more on publishing replications, keep reading! + +**We need your help** + +I4R is open to all researchers interested in advancing the reproducibility and replicability of research. We need your help reproducing and replicating as many studies as possible. Please contact us if you are interested in helping out. We are also actively looking for researchers with large networks to serve on the editorial board, especially in the field of macroeconomics and international relations for political science. + +Beyond helping out with replication efforts, you can help our community by bringing replication to your classroom. If you want to teach replication in class assignments, our team has developed some resources that might be of interest. A list of educational resources is available ***[here](https://bitss.github.io/ACRE/guidance-for-instructors-supervising-reproduction-assignments.html)***. + +A very useful resource is the ***[Social Science Reproduction Platform (SSRP)](https://www.socialsciencereproduction.org/)*** which was developed by our collaborators at the Berkeley Initiative for Transparency in the Social Sciences in collaboration with the AEA Data Editor. This is a platform for systematically conducting and recording reproductions of published social science research. The SSRP can be easily incorporated as a module in applied social science courses at graduate and undergraduate levels. Students can use the platform and materials with little to no supervision, covering learning activities such as assessing and improving the reproducibility of published work and applying good coding and data management practices. Guidance for instructors such as how to select a paper, timelines and grading strategy is available ***[here](https://bitss.github.io/ACRE/guidance-for-instructors-supervising-reproduction-assignments.html)***. + +Reach out to us if you want to learn more about the SSRP and other teaching resources. We are here to help! + +**Where to publish replications** + +Incentives for replications are currently limited, with a small number of replications published in top journals. Moreover, reproducing or replicating others’ work can lead to disagreements with the original author(s) whose work is re-analyzed. One of I4R’s main objective is to address these challenges and help researchers conduct and disseminate reproductions and replications. + +As a first step to better understand publication possibilities for replicators, our collaborators (Jörg Peters and Nathan Fiala) and the Chair, Abel Brodeur, have been contacting journal editors for top economic, finance and political journals asking them whether they are willing to publish comments for papers published in their journal and/or comments on studies published elsewhere. The answers are made publicly available on our ***[website](https://i4replication.org/publishing.html)***. We also highlight special issues/symposiums dedicated to replications and journals which strictly publish comments. Please contact us if you want to advertize other replication efforts or special issues related to open science and replications. + +**We will continue developing new and exciting features based on input from the community. Do not hesitate to reach out to us!** + +RESOURCES: , Twitter @I4Replication + +*Abel Brodeur is Associate Professor in the Department of Economics, University of Ottawa, and founder and chair of the Institute for Replication (I4R). He can be reached at [abrodeur@uottawa.ca](mailto:abrodeur@uottawa.ca).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2022/01/08/brodeur-launching-the-institute-for-replication-i4r/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2022/01/08/brodeur-launching-the-institute-for-replication-i4r/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/brown-how-to-conduct-a-replication-study-what-not-to-do.md b/content/replication-hub/blog/brown-how-to-conduct-a-replication-study-what-not-to-do.md new file mode 100644 index 00000000000..37f651e810c --- /dev/null +++ b/content/replication-hub/blog/brown-how-to-conduct-a-replication-study-what-not-to-do.md @@ -0,0 +1,55 @@ +--- +title: "BROWN: How to Conduct a Replication Study – What Not To Do" +date: 2018-11-16 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "3ie" + - "Annette N. Brown" + - "Don'ts" + - "FHI 360" + - "R&E Search for Evidence blog" + - "replication" + - "retraction" + - "Which tests" + - "Workshop" +draft: false +type: blog +--- + +###### *[This post is based on a presentation by Annette Brown at the****[Workshop on Reproducibility and Integrity in Scientific Research](https://replicationnetwork.com/2018/09/21/all-invited-workshop-on-reproducibility-and-integrity-in-scientific-research/)****, held at the University of Canterbury, New Zealand, on October 26, 2018. It is cross-published on FHI 360’s****[R&E](https://researchforevidence.fhi360.org/how-to-conduct-a-replication-study-which-tests-not-witch-hunts)[Search for Evidence blog](https://researchforevidence.fhi360.org/how-to-conduct-a-replication-study-what-not-to-do)****]* + +###### Two weeks ago, on Halloween, I wrote a ***[post](https://researchforevidence.fhi360.org/how-to-conduct-a-replication-study-which-tests-not-witch-hunts)*** about how to conduct a replication study using an approach that emphasizes which tests might be run in order to avoid the perception of a witch hunt. The post is based on my ***[paper](http://www.economics-ejournal.org/economics/journalarticles/2018-53)*** with ***[Benjamin D.K. Wood](https://sites.google.com/view/bdkwood/home)***, which I recently presented at the “Reproducibility and Integrity in Scientific Research” ***[workshop](https://replicationnetwork.com/2018/09/21/all-invited-workshop-on-reproducibility-and-integrity-in-scientific-research/)*** at the University of Canterbury. When Ben and I first submitted the paper to ***[Economics E-journal](http://www.economics-ejournal.org/)******,*** we received some great referee comments (all of which are ***[public](http://www.economics-ejournal.org/economics/discussionpapers/2017-77)***) including requests by an anonymous referee and ***[Andrew Chang](https://sites.google.com/site/andrewchristopherchang/)*** to include in the paper a list of what not to do – a list of don’ts. + +###### We spent some time thinking about this request. We realized that what the referees wanted was a list of statistical and econometric no-nos, especially drawing on the most controversial replication studies funded by the ***[International Initiative for Impact Evaluation](http://www.3ieimpact.org/en/evaluation/impact-evaluation-replication-programme/)*** (3ie) while we were both there. However, our role at 3ie was to be a neutral third party, at least as much as possible, and we didn’t want to abandon that now. + +###### At the same time, we did learn a lot of lessons about conducting replication research while at 3ie, and we agreed that some of those lessons would be appropriate don’ts. So we added a checklist of don’ts to the paper that was ultimately published. Here I summarize three of these don’ts. I should note that I’m talking here about internal replication studies, which is when the replication researcher uses the original data from a publication to check whether the published findings can be exactly reproduced and are robust, particularly those findings supporting conclusions and recommendations. + +###### When conducting a replication study, don’t confuse critiques of the original research with the replication tests or findings. Certainly, critiques of the original research can motivate the choice of replication exercises, and it is fine to present critiques in that context. But often there are critiques that are separate from what can be explored with the data. For example, a replication researcher might be concerned that the fact that treatment and controls groups were unblinded may mean the published findings are biased. + +###### This concern about the original research design may be valid, but it is not something that can be tested through replication exercises. Simply identifying this concern is not a replication finding. We saw many examples where replication researchers interspersed their critiques of the motivation or design of the original research with their replication exercises and results. Mixing these two types of analysis contributed to some of the biggest controversies that we witnessed. + +###### Don’t conduct measurement and estimation analysis (which some call robustness testing) before conducting a pure replication. (See ***[here](https://www.tandfonline.com/doi/full/10.1080/00220388.2018.1506582)*** and ***[here](https://www.tandfonline.com/eprint/PmF6PJdXzd8c33F9wn2K/full)*** for more on terminology and the 3ie replication program.) Often replication researchers begin a study motivated by questions of robustness and may even take for granted that a pure replication (which is applying the published analysis methods to the original data) would reproduce the published results. + +###### While skipping the pure replication may seem like a way to save time, conducting the pure replication often has the benefit of saving time. The pure replication is the best way for the replication researcher to familiarize herself with the data, methods, and findings of the original publication, and missing a problem at the pure replication stage is only going to confuse the measurement and estimation analysis. + +###### Even more to the point, some consider pure replication the only stage of the research that should be called “replication”, and therefore the only results that should be reported as replication results. It is important for a replication researcher to be able to make a clear statement about the results at this stage. + +###### Don’t present, post or publish replication results without first sharing them with the original authors. Replication research is, unfortunately, often a contentious undertaking. Replication researchers are advised to take the high road and communicate with original authors about their work – ideally from the beginning, even if the data are already publicly available. We saw cases where the replication researchers made mistakes that the original authors caught, so communication can save face on both sides. + +###### There is a real concern about the original authors scooping a replication study by posting a correction without citing the replication researchers. We have seen this happen. Some approaches to addressing it include publicly posting the replication plan in advance. This research transparency approach serves multiple purposes, but one is putting a name and timestamp on the work that might lead to corrections. Another approach is to document the dates and subjects of communications with original authors and include this information, as an acknowledgement or footnote, in the replication study. + +###### Perhaps one of our most important don’ts is don’t label the difference between a published result and a replication study results an “error” or “mistake” without identifying the source of the error. Just because the second estimate is different than the first does not make the second right. Ben and I already blogged about this don’t recommendation ***[here](http://blogs.worldbank.org/impactevaluations/when-error-not-error-guest-post-annette-n-brown-and-benjamin-d-k-wood)*** on the World Bank Development Impact blog. + +###### Recent revelations, such as last month’s ***[report of the retraction](https://www.vox.com/platform/amp/science-and-health/2018/9/19/17879102/brian-wansink-cornell-food-brand-lab-retractions-jama?__twitter_impression=true)*** of 15 articles by well-known Cornell food researcher, Brian Wansink, remind us that replication research is as important to the advancement of the natural and social sciences as ever. My hope is that more researchers accept the responsibility of conducting replication research as part of their contribution to science. The advice presented in the ***[which tests paper](http://www.economics-ejournal.org/economics/journalarticles/2018-53)*** and summarized in my ***[last post](https://researchforevidence.fhi360.org/how-to-conduct-a-replication-study-which-tests-not-witch-hunts)*** and this one is intended to help them get started. + +###### *Annette N. Brown, PhD is Principal Economist at FHI 360, where she leads efforts to increase and enhance evidence production and use across all sectors and regions. She previously worked at 3ie, where she directed the research transparency programs, including the replication program.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/11/16/brown-how-to-conduct-a-replication-study-what-not-to-do/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/11/16/brown-how-to-conduct-a-replication-study-what-not-to-do/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/brown-how-to-conduct-a-replication-study-which-tests-not-witch-hunts.md b/content/replication-hub/blog/brown-how-to-conduct-a-replication-study-which-tests-not-witch-hunts.md new file mode 100644 index 00000000000..59db70dd913 --- /dev/null +++ b/content/replication-hub/blog/brown-how-to-conduct-a-replication-study-which-tests-not-witch-hunts.md @@ -0,0 +1,68 @@ +--- +title: "BROWN: How to Conduct a Replication Study – Which Tests, Not Witch Hunts" +date: 2018-11-01 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "3ie" + - "Annette N. Brown" + - "Deworming study" + - "FHI 360" + - "replication" + - "Research & evaluation" + - "research methods" + - "worm wars" +draft: false +type: blog +--- + +###### *[This post is based on a presentation by Annette Brown at the **[Workshop on Reproducibility and Integrity in Scientific Research](https://replicationnetwork.com/2018/09/21/all-invited-workshop-on-reproducibility-and-integrity-in-scientific-research/)**, held at the University of Canterbury, New Zealand, on October 26, 2018. It is cross-published on FHI 360’s **[R&E](https://researchforevidence.fhi360.org/how-to-conduct-a-replication-study-which-tests-not-witch-hunts)** **[Search for Evidence blog](https://researchforevidence.fhi360.org/how-to-conduct-a-replication-study-which-tests-not-witch-hunts)**]* + +###### Last week I was treated to a great ***[workshop](https://replicationnetwork.com/2018/09/21/all-invited-workshop-on-reproducibility-and-integrity-in-scientific-research/)*** titled “Reproducibility and Integrity in Scientific Research” at the University of Canterbury where I presented my article (joint with ***[Benjamin D.K. Wood](https://sites.google.com/view/bdkwood)***), “***[Which tests not witch hunts: A diagnostic approach for conducting replication research](http://www.economics-ejournal.org/economics/journalarticles/2018-53)***.” The article provides tips and resources for researchers seeking a neutral approach to replication research. In honor of the workshop and Halloween, I thought I’d scare up a blog post summarizing the article. + +###### **Why conduct replication research?** + +###### Suppose you’ve read a study that you consider to be innovative or influential. Why might you want to conduct a replication study of it? Here when I say ‘replication study’, I mean internal replication (or desk replication), for which the researcher uses the study’s original data to reassess the study’s findings. There are three reasons you might want to conduct such a study: to prove it right, to learn from it, or to prove it wrong. We rarely see the first reason stated, making it a bit of phantom. However, I am a big fan of ***[conducting replication research to validate a study’s findings](https://www.tandfonline.com/doi/full/10.1080/19439342.2014.944555)*** for the purpose of policy making or program design. We see the second reason – to learn from it – more often, although often in the context of graduate school courses on quantitative methods. + +###### Instead, many fear that most replication studies are conducted with the desire to prove a study wrong. ***[Zimmerman (2015)](https://www.aeaweb.org/conference/2016/retrieve.php?pdfid=436)*** considers “turning replication exercises into witch hunts” to be an easy pitfall of replication research. ***[Gertler, Galiani, and Romero (2018)](https://www.nature.com/articles/d41586-018-02108-9)*** report that unnamed third parties “speculated” that researchers for a well-known replication study sought to overturn results. The specter of speculation aside, why might replication researchers look for faults in a study? + +###### One reason is publication bias. Experience shows that replication studies that question the results of original studies are more likely to be published, and Gertler, Galiani, and Romero (2018) provide evidence from a survey of editors of economics journals showing that editors are much more likely to publish a replication study that overturns results than one that confirms results. Regardless of publication bias, however, my experience funding replication studies while working at the ***[International Initiative for Impact Evaluation (3ie)](http://www.3ieimpact.org/en/evaluation/impact-evaluation-replication-programme/)*** is that not all replication researchers carry torches and pitchforks. Many just don’t know where to start when conducting replication research. Without some kind of template or checklist to work from, these researchers are often haunted by the academic norm of critical review and approach their replication work from that standpoint. + +###### To address this challenge, Ben Wood and I set out to develop a neutral approach to replication research based on elements of quantitative analysis and using examples from 3ie-funded replication studies. This approach is intended for researchers who want to dissect a study beyond just a pure replication (which is using the study’s methods and original data to simply reproduce the results in the published article). The diagnostic approach includes four categories: assumptions, data transformations, estimation, and heterogeneous outcomes. + +###### **Assumptions** + +###### The application of methods and models in conducting empirical research always involves making assumptions. Often these assumptions can be tested using the study data or using other data. Since my focus is often development impact evaluation, the assumptions I see most often are those supporting the identification strategy of a study. Examples include assuming no randomization failure in the case of random-assignment designs or assuming unobservables are time invariant in the case of difference-in-difference designs. Many other assumptions are also often necessary depending on the context of the research. For example, when looking at market interventions, researchers often assume that agents are small relative to the market (i.e., price takers). Even if the study data cannot be used to shed light on these assumptions, there may be other data that can. + +###### In the ***[Whitney, Cameron, and Winters (2018)](https://www.tandfonline.com/doi/abs/10.1080/00220388.2018.1506576)*** replication study of the ***[Galiani and Schargrodsky (2010)](https://www.sciencedirect.com/science/article/abs/pii/S0047272710000654)*** impact evaluation of a property rights policy change in Buenos Aires, the replication researchers note that the original authors provide balance tables for the full sample of 1,082 parcels but only conduct their analysis on a subset of 300 parcels. Whitney, et al. test the pre-program balance between program and comparison parcels on four key characteristics for the households in the analysis subset and find statistically significant differences for three of the four. Their further tests reveal that these imbalances do not change the ultimate findings of the study, however. + +###### **Data transformations** + +###### There is a lot of hocus pocus that goes into getting data ready for analysis. These spells determine what data are used, including decisions about whether to kill outliers, how to bring missing values back from the dead, and how to weight observations. We also often engage in potion making when we use data to construct new variables, including variables like aggregates (e.g., income and consumption) and indexes (e.g., empowerment and participation). Replication researchers can use the study data and sometimes outside data in order to answer questions about whether these choices are well supported and whether they make a difference to the analysis. + +###### [Kuecken and Valfort (2018)](https://www.tandfonline.com/doi/full/10.1080/00220388.2018.1506575) question the decision by ***[Reinikka and Svensson (2005)](https://onlinelibrary.wiley.com/doi/abs/10.1162/jeea.2005.3.2-3.259)*** to exclude certain schools from the analysis dataset used for their study of how an anti-corruption newspaper campaign affects enrollment and learning. The original study includes a footnote that the excluded schools experienced reductions in enrollment due to “idiosyncratic shocks”, which the original authors argue should not be systematically correlated with the explanatory variable. Kuecken and Valfort resurrect the excluded schools and find that the published statistical significance of the findings is sensitive to the exclusion. + +###### **Estimation methods** + +###### There are two sets of replication questions around estimation methods. One is whether different methods developed for similar statistical tasks produce the same results. A well-known example is the replication study conducted by epidemiologists ***[Aiken, Davey, Hargreaves, and Hayes (2015)](https://academic.oup.com/ije/article/44/5/1572/2594560)***  (published as ***[two articles](https://academic.oup.com/ije/article/44/5/1581/2594562)***) of an impact evaluation of a health intervention conducted by economists ***[Miguel and Kremer (2004)](https://www.jstor.org/stable/3598853?seq=1#page_scan_tab_contents)***. This replication study combined with systematic review evidence resulted in the ***[worm wars](https://blogs.worldbank.org/impactevaluations/worm-wars-anthology)***, which were indeed spine-chilling. The second set of questions is how sensitive (or robust) the results are to parameters or other choices made when applying estimation methods. Many published studies include some sensitivity tests, but there are sometimes additional sensitivity tests that can be conducted. + +###### [Korte, Djimeu, and Calvo (2018)](https://www.tandfonline.com/doi/abs/10.1080/00220388.2018.1506580) do the converse of worm wars – they apply econometric methods to data from an epidemiology trial by ***[Bailey, et al. (2007)](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(07)60312-2/fulltext)*** testing whether male circumcision reduces incidence of HIV infection. For example, Korte, et al. exploit the panel nature of the data, that is, repeated observations of the same individuals over time, by running a fixed effects model, which controls for unobserved individual differences that don’t change over time. They find that the econometric methods produce very similar results as the biostatistical methods for the HIV infection outcome, but produce some different results for the tests of whether male circumcision increases risky sexual behavior. + +###### **Heterogeneous outcomes** + +###### Understanding whether the data from a published study point to heterogeneous outcomes can be important for using the study’s findings for program design or policy targeting. These further tests on a study’s data are likely to be exploratory rather than confirmatory. For example, one might separate a random-assignment sample into men and women for heterogeneous outcomes analysis even if the randomization did not occur for these two groups separately. Exploration of heterogeneous outcomes in a replication study should be motivated by theoretical or clinical considerations. + +###### [Wood and Dong (2018)](https://www.tandfonline.com/doi/full/10.1080/00220388.2018.1506574) re-examine an agricultural commercialization impact evaluation conducted by ***[Ashraf, Giné, and Karlan (2009)](https://www.jstor.org/stable/20616255?seq=1#page_scan_tab_contents)***. The commercialization program included promoting certain export crops and making it easier to sell all crops. The original study explores heterogeneous outcomes by whether the sample farmers grew the export crops before the intervention or not and find that those who did not grow these crops are more likely to benefit. Wood and Dong use value chain theory to hypothesize that the benefits of the program come from bringing farmers to the market, that is getting them to sell any crops (domestic or export). They look at heterogeneous outcomes by whether farmers grew any cash crops before the program and find that only those who did not grow cash crops benefit from the program. + +###### Internal replication research provides validation of published results, which is especially important when those results are used for policy making and program design (***[Brown, Cameron, and Wood, 2014](https://www.tandfonline.com/doi/full/10.1080/19439342.2014.944555)***). It doesn’t need to be scary, and original authors don’t need to be spooked. The “which tests not witch hunts” paper provide tips and resources for each of the topics described above. The paper also provides a list of “don’ts” for replication research, which I’ll summarize in a separate post. Happy Halloween! + +###### *Annette N. Brown, PhD is Principal Economist at FHI 360, where she leads efforts to increase and enhance evidence production and use across all sectors and regions. She previously worked at 3ie, where she directed the research transparency programs, including the replication program.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/11/01/brown-how-to-conduct-a-replication-study-which-tests-not-witch-hunts/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/11/01/brown-how-to-conduct-a-replication-study-which-tests-not-witch-hunts/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/brown-is-the-evidence-we-use-in-international-development-verifiable-push-button-replication-provide.md b/content/replication-hub/blog/brown-is-the-evidence-we-use-in-international-development-verifiable-push-button-replication-provide.md new file mode 100644 index 00000000000..d5ecd60aa0f --- /dev/null +++ b/content/replication-hub/blog/brown-is-the-evidence-we-use-in-international-development-verifiable-push-button-replication-provide.md @@ -0,0 +1,79 @@ +--- +title: "BROWN: Is the Evidence We Use in International Development Verifiable? Push Button Replication Provides the Answer" +date: 2019-01-09 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "3ie" + - "3ie replication programme" + - "Annette N. Brown" + - "Benjamin Wood" + - "FHI 360" + - "Impact evaluation" + - "International development" + - "PLOS ONE" + - "Push button reproducibility" + - "R&E Search for Evidence" + - "replication" +draft: false +type: blog +--- + +###### *[This post is* *cross-published on FHI 360’s****[R&E](https://researchforevidence.fhi360.org/how-to-conduct-a-replication-study-which-tests-not-witch-hunts)[Search for Evidence blog](https://researchforevidence.fhi360.org/is-the-evidence-we-use-in-international-development-verifiable-push-button-replication-provides-the-answer)****]* + +###### There are many debates about the definitions and distinctions for replication research, particularly for *internal* replication research, which is conducted using the original dataset from an article or study. The debaters are concerned about what kinds of replication exercises are appropriate and about how (and whether) to make determinations of “success” and “failure” for a replication. + +###### What everyone seems to agree, however, is that the most basic test – the lowest bar for any publication to achieve – is that a third party can take the authors’ software code and data and apply the code to the data to reproduce the findings in the published article. This kind of verification should be a no-brainer, right? But it turns out, as reported in a ***[newly published article](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0209416)*** in *PLOS ONE*, only 25% of the articles ***[Benjamin D.K. Wood](https://sites.google.com/view/bdkwood/home)***, ***[Rui Müller](https://www.researchgate.net/profile/Rui_Mueller)***, and I tested met this bar. Only 25% are verifiable! + +###### I suspect (hope) that this finding raises a lot of questions in your mind, so let me try to answer them. + +###### **Why did you test this?** + +###### We embarked on the push button replication exercise in 2015 when Ben and I were still at the International Initiative for Impact Evaluation (***[3ie](http://www.3ieimpact.org/)***) running its ***[replication program(me)](http://www.3ieimpact.org/what-we-do/replication)***. At a consultation event we hosted, the discussion turned to the question of whether we should believe that published articles can all be replicated in this way. Some argued of course we should, but others maintained that we cannot take for granted that published articles are supported by clean data and code that easily reproduce the tables and graphs. So we decided to test it. + +###### In our quest to use replication terms that are self-explanatory (see ***[here](https://www.tandfonline.com/doi/abs/10.1080/19439342.2014.944555)*** for other examples) we decided to call it push button replication – you can push a button to replicate the findings presented in the tables and figures in a published study. As children of the Midwest United States, we also thought the PBR acronym was fun. + +###### **How did you test this?** + +###### We started by selecting a sample of articles to test. Our work at 3ie revolved around development impact evaluations, so we focused on these types of studies. One advantage of studying research within this set is that the studies use similar quantitative methods, but span many sectors and academic disciplines. We used the records in 3ie’s Impact Evaluation Repository for the period 2010 to 2012 to identify the ***[top ten journals publishing development impact evaluations](https://doi.org/10.1371/journal.pone.0209416.t002)***. Then we screened all the articles published in those ten journals in 2014 for those that qualified as development impact evaluations. We ended up with a sample of 109 articles. + +###### That was the easy part. We also developed a detailed protocol for conducting a push button replication. The protocol outlines very clear procedures for requesting data and code, pushing the button, and selecting a classification. We piloted and then revised the protocol a few times before finalizing it for the project. We also created a project in the ***[Open Science Framework](https://osf.io/)*** (OSF) and posted the protocol and other project documents ***[there](https://osf.io/yfbr8/)*** for transparency. + +###### To be clear, the journals in our sample had different replication data requirements in 2014. One journal required public replication files, and two others required that replication files be made available upon request. The rest had no requirements. We decided at the beginning that we did not just want to look for publicly available files as other studies like ours have done. We wanted to observe first whether the requirements that do exist are working and second whether articles in journals without any requirements are third-party verifiable. We have witnessed many researchers who are ahead of the journals in adopting research transparency practices, so we were hopeful that authors would hold themselves to a verifiability standard even if their journal did not require them to do so. + +###### With the sample and the protocol in hand, we set out to attempt push button replications for each article. At this point Rui joined the team and offered to take on all the economics impact evaluations in the sample as part of his master’s thesis. + +###### **What did you find?** + +###### We present the primary results in Figure 1 in the paper, which is copied below. For the majority of the articles in the sample (59 out of 109) the authors refused to provide the data and code for verification. They just said no. Even some who stated in their articles that replication data would be provided upon request just said no when we made that request. And even some who published in journals requiring that replication files be provided just said no. Not just some, a lot. The authors of ten of 20 articles from the *Journal of Development Economics* and 24 of 34 articles from *PLOS ONE*, both journals with requirements for providing replication files, refused to provide data and code for verification. + +###### capture + +###### But, you say, some of those data must be proprietary! Yes, some of the authors claimed that, but they needed to prove it to be classified as proprietary data. We rejected six unsubstantiated claims but did classify the three substantiated claims as having propriety data (the turquoise squares). + +###### You might be saying to yourself, “why would authors give access to their code and data if they didn’t know what you were going to do with them?” But they did. The push button replication protocol and the description of our project were publicly available, and we offered to sign whatever kind of confidentiality agreement regarding the data was necessary. You might also be objecting that we didn’t give them enough chances or enough time. But we far exceeded our stated protocol in terms of the number of reminders we sent and the length of time we waited. In fact, we would have accepted any replication files that came in before we finalized the results, so authors really had from our first request in 2016 until we finalized the results early in 2018 to provide the data. + +###### We did receive data for 47 articles. For 15 of these, we received data and code but not enough to fully reproduce the tables and figures in the published articles. These are classified as incomplete (the royal blue squares). For the rest, the news is good. Of the 32 articles that we were able to push button replicate, 27 had comparable findings. Five had some minor differences, especially when focusing on the tables tied to the articles’ key results. + +###### **Do these findings matter?** + +###### You might look at the figure and conclude, “only five complete push button replications found minor differences”, so that’s good news! Well, yes, but I see this this way: for twenty of the 47 articles for which we received data, we know that the authors’ data and code cannot completely or comparably reproduce the published findings. That’s 43%. Is there any reason to believe that the rate is lower for those articles for which the authors refused to provide the files? I don’t think so. If anything, one might hypothesize the opposite. + +###### Our conclusion is that much of the evidence that we want to use for international development, evidence from both the health sciences and the social sciences, is not third-party verifiable. In the *PLOS ONE* article, we present additional results, including the classifications by each of the ten journals and the results according to some of the funders of these studies. + +###### **What do you recommend?** + +###### First, unfortunately, it is not enough for a journal to simply have a policy. Many academics do not respect policies that the journals do not enforce. The exception to this in our sample was *American Economic Journal: Applied Economics*. It had an upon-request policy, and we received the data and code for six out of eight articles, with the other two meeting the requirements to be classified as proprietary data. + +###### Second, many health scientists and social scientists are lagging not just in research transparency practices, but also in good research practices. Even for publications as recent as 2014, many authors did not maintain complete data and code to reproduce their published findings. Fifteen of 47 for which we received files did not have complete files to send. In many fields there are formal and informal associations of researchers who are pushing for better practices, but I believe that sea change will require firm action on the part of journals. + +###### *Annette N. Brown, PhD is Principal Economist at FHI 360, where she leads efforts to increase and enhance evidence production and use across all sectors and regions. She previously worked at 3ie, where she directed the research transparency programs, including the replication program.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/01/09/brown-is-the-evidence-we-use-in-international-development-verifiable-push-button-replication-provides-the-answer/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/01/09/brown-is-the-evidence-we-use-in-international-development-verifiable-push-button-replication-provides-the-answer/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/brown-lambert-wojan-at-the-intersection-of-null-findings-and-replication.md b/content/replication-hub/blog/brown-lambert-wojan-at-the-intersection-of-null-findings-and-replication.md new file mode 100644 index 00000000000..646510fe1a3 --- /dev/null +++ b/content/replication-hub/blog/brown-lambert-wojan-at-the-intersection-of-null-findings-and-replication.md @@ -0,0 +1,38 @@ +--- +title: "BROWN, LAMBERT, & WOJAN: At the Intersection of Null Findings and Replication" +date: 2018-08-23 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Conservation Reserve Program" + - "Effect sizes" + - "Hypothesis testing" + - "null results" + - "replication" + - "Statistical power" +draft: false +type: blog +--- + +###### Replication is an important topic in economic research or any social science for that matter. This issue is most important when an analysis is undertaken to inform decisions by policymakers. Drawing inferences from null or insignificant finding is particularly problematic because it is often unclear when “not significant” can be interpreted as “no effect.” We recently wrestled with this issue in our paper, “***[The Effect of the Conservation Reserve Program on Rural Economies: Deriving a Statistical Verdict from a Null Finding](https://academic.oup.com/ajae/advance-article-abstract/doi/10.1093/ajae/aay046/5054572)***,” published in the *American Journal of Agricultural Economics*.  Below is a summary of our findings. + +###### While an inherent bias to publish research with significant findings is widely recognized, there are times when not finding an effect may be more important. For example, suggestive evidence that a policy may not work is arguably more consequential than statistical confirmation that is does. The conundrum produced by null findings is not having any statistical basis for determining whether the true effect is close to zero or if the test is underpowered—that is, unlikely to detect a substantive effect. Our paper developed a method for deriving probabilities for null findings by providing a valid *ex post* estimate of statistical power. This allows economists and policymakers to more confidently conclude when “not significant” can, in fact, be interpreted as “no substantive effect.” + +###### We demonstrate our method by replicating an analysis from the Economic Research Service’s (ERS) 2004 Report to Congress on the economic implications of the Conservation Reserve Program (CRP). The program, which was signed into law in 1985, was designed to remove environmentally vulnerable land from agricultural production. However, farm-dependent counties experienced both employment and population declines through the economically prosperous 1990s, raising concerns that the program might have cost jobs due to a reduction in agricultural production. Indeed, the ERS report identified worse employment growth in farm-dependent counties with high-CRP enrollments relative to their low-CRP enrollment peers. However, the report was unable to attribute lost employment to CRP enrollments. + +###### While the report failed to identify a statistically significant, negative long-term effect of the program on employment growth, the authors cautioned that the verdict of “no negative employment effect” was only valid if the econometric test was statistically powerful. Replicating the 2004 analysis using new statistical inference methods allowed us to determine whether the tentative 2004 conclusion was correct. Our replication addresses two critical deficiencies that prevent economists from estimating statistical power: 1) we posit a compelling effect size—the level of job losses that would raise concerns regarding the trade-off with environmental benefits–and 2) we estimate the variability of an unobserved alternative distribution using simulation methods. We conclude that the test used in the ERS report had high power for detecting employment effects of −1 percent or lower, equivalent to job losses that would reduce the program’s environmental benefits by a third. An unrestricted test in line with Congress’s charge to search for “any effect” had very low power. + +###### In many circumstances, economists do not have the opportunity to conduct power analysis before research starts. The approaches we suggest can be used to determine power for univariate analyses or multivariate regressions after the fact, provided the data-generating process can be replicated and the effect size of economic significance or policy relevance is stated. Given a range of posited effect sizes, our approach supplements an array of tools to inform decision making in the event of a null finding.” + +###### In the spirit of replication, you can find our data and code in the supporting documentation of the article. If you are not able to access the article, the supplemental materials are also available ***[here](https://www.kansascityfed.org/~/media/files/publicat/reswkpap/pdf/blw_sup_files.zip?la=en)***. We hope that others confronted with the “null hypothesis lacking error probability” conundrum will consider using the methods as a tool for making null findings potentially more informative, and for making our toolkit of applied econometric methods more useful for decision-making. + +###### *Jason P. Brown is an assistant vice president and economist at the Federal Reserve Bank of Kansas City. Dayton M. Lambert is a professor and Willard Sparks Chair, Department of Agricultural Economics, Oklahoma State University. Timothy R. Wojan is a senior economist, USDA, Economic Research Service. The opinions expressed are those of the authors and are not attributable to the Federal Reserve Bank of Kansas City, the Federal Reserve System, Oklahoma State University, the Economic Research Service, or USDA. Correspondence can be directed to Jason Brown at Jason.Brown@kc.frb.org.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/08/23/brown-lambert-wojan-at-the-intersection-of-null-findings-and-replication/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/08/23/brown-lambert-wojan-at-the-intersection-of-null-findings-and-replication/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/butera-a-novel-approach-for-novel-results.md b/content/replication-hub/blog/butera-a-novel-approach-for-novel-results.md new file mode 100644 index 00000000000..7f1376a1f70 --- /dev/null +++ b/content/replication-hub/blog/butera-a-novel-approach-for-novel-results.md @@ -0,0 +1,41 @@ +--- +title: "BUTERA: A Novel Approach for Novel Results" +date: 2017-08-17 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Crisis of confidence in science" + - "experimental economics" + - "Luigi Butera" + - "Public goods game" + - "replication" +draft: false +type: blog +--- + +###### *[NOTE: This post refers to the article “An Economic Approach to Alleviate the Crises of Confidence in Science: With an Application to the Public Goods Game” by Luigi Butera and John List.  The article is available as a working paper w**hich can be downloaded* **[*here*](http://s3.amazonaws.com/fieldexperiments-papers2/papers/00608.pdf)***.]* + +###### In the process of generating scientific knowledge, scholars sometimes stumble upon new and surprising results. Novel studies typically face a binary fate: either their relevance and validity is dismissed, or findings are embraced as important and insightful. Such judgements however commonly rely on statistical significance as the main criterion for acceptance. This poses two problems, especially when a study is the first of its kind. + +###### The first problem is that novel results may be false positives simply because of the mechanics of statistical inference. Similarly, new surprising results that suffer from low power, or marginal statistical significance, may sometimes be dismissed even though they point toward an economic association that is ultimately true. + +###### The second problem has to do with how people should update their beliefs based on unanticipated new scientific evidence. Given the mechanics of inference, it is difficult to provide a definite answer when such evidence is based on one single exploration. To fix ideas, suppose that before running an experiment, a Bayesian scholar had a prior about the likelihood of a given result being true of only 1%. After running the experiment and observing the significant results (significant at, say, 5% level), the scholar should update his beliefs to 13.9%, a very large increase relative to the initial beliefs. Posterior beliefs can be easily computed, for any given prior, by dividing the probability that a true result is declared true by the probability that *any* result is declared true.  Even more dramatically, a second scholar who for instance had priors of 10%, instead of 1%, would update his posterior beliefs to 64%. The problem is clear: posterior beliefs generated from low priors are extremely volatile when they only depend on evidence provided by a single study. Finding a referee with priors of 10% or 1% can make or break a paper! + +###### The simple solution to this problem is of course to replicate the study: as evidence accumulates, posterior beliefs converge. Unfortunately, the incentives to replicate existing studies are rarely in place in the social sciences: once a paper is published, the original authors have little incentive to replicate their own work. Similarly, the incentives for other scholars to closely replicate existing work are typically very low. + +###### To address this issue, we proposed in our paper a simple incentive-compatible mechanism to promote replications, and generate mutually beneficial gains from trade between scholars. Our idea is simple: upon completion of a study that reports novel results, the authors make it available online as a working paper, but commit never to submit it to a peer-reviewed journal for publication. They instead calculate how many replications they need for beliefs to converge to a desired level, and then offer co-authorship for a second, yet to be written, paper to other scholars willing to independently replicate their study. Once the team of coauthors is established, but before replications begin, the first working paper is updated to include the list of coauthors and the experimental protocol is registered at the AEA RCT registry. This guarantees that all replications, both failed and successful, are accounted for in the second paper. The second paper will then reference the first working paper, include all replications, and will be submitted to a peer-reviewed journal for publication. + +###### We put our mechanism to work on our own experiment where we asked: can cooperation be sustained over time when the quality of a given public good cannot be precisely estimated? From charitable investments to social programs, uncertainty about the exact social returns from these investments is a pervasive characteristic. Yet we know very little about how people coordinate over ambiguous and uncertain social decisions. Surprisingly, we find that the presence of (Knightian) uncertainty about the quality of a public good does not harm, but rather increases cooperation. We interpret our finding through the lenses of conditional cooperation: when the value of a public good is observed with noise, conditional cooperators may be more tolerant to observed reductions in their payoffs, for instance because such reductions may be due, in part, to a lower-than-expected quality of the public good itself rather than solely to the presence of free-riders. However, we will wait until all replications are completed to draw more informed inference about the effect of ambiguity on social decisions. + +###### One final note: while we believe that replications are always desirable, we do not by any means suggest that all experiments, lab or field, necessarily need to follow our methodology. We believe that our approach is best suited for studies that find results that are unanticipated, and in some cases at odds with the current state of knowledge on a topic. This is because in these cases, priors are more likely to be low, and perhaps more sensitive to other factors such as the experience or rank of the investigator. As such, we believe that our approach would be particularly beneficial for scholars at the early stages of their careers, and we hope many will consider joining forces together. + +###### *Luigi Butera is a Post-Doctoral scholar in the Department of Economics at the University of Chicago. He can be contacted via email at lbutera@uchicago.edu.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/08/17/butera-a-novel-approach-for-novel-results/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/08/17/butera-a-novel-approach-for-novel-results/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/byington-felps-on-resolving-the-social-dilemmas-that-lead-to-non-credible-science.md b/content/replication-hub/blog/byington-felps-on-resolving-the-social-dilemmas-that-lead-to-non-credible-science.md new file mode 100644 index 00000000000..d72d0f256e6 --- /dev/null +++ b/content/replication-hub/blog/byington-felps-on-resolving-the-social-dilemmas-that-lead-to-non-credible-science.md @@ -0,0 +1,115 @@ +--- +title: "BYINGTON & FELPS: On Resolving the Social Dilemmas that Lead to Non-Credible Science" +date: 2016-08-24 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Digital Badges" + - "Eliza Byington" + - "HARP" + - "Management Science" + - "replication" + - "social dilemmas" + - "Will Felps" +draft: false +type: blog +--- + +###### In our forthcoming article “Solutions to the credibility crisis in Management science” (full text ***[available here](https://goo.gl/RnqvzE)***), we suggest that “social dilemmas” in the production of Management science put scholars and journal gatekeepers in a difficult position – pitting self-interest *against* the production of credible scientific claims. We argue that recognizing that the credibility crisis in Management science is at least partly a consequence of social dilemmas – and treating it as such – are foundational steps that can help move the field toward adopting the variety of credibility enhancing practices that scientists have been advocating for decades (e.g. Ceci & Walker, 1983; N. L. Kerr, 1998). + +###### Although we are Management scholars rather than economists, we suspect that the social dilemma dynamics we point out (and the solutions we propose) are relevant for improving the credibility of claims produced by many fields (e.g., economics, sociology, anthropology, psychology, criminology, education, political science, medicine, etc.). As such, we are grateful for the invitation from *The Replication Network* to share a summary of our article for your consideration. + +###### **ARTICLE SUMMARY:** + +###### **Credibility Problems in Management Science** + +###### The claims of primary studies in Management cannot be fully relied upon, as evidenced by the fact that a) results fail to replicate much more often than they should, and b) attempts to verify and replicate prior claims rarely appear in the literature (Hubbard & Vetter, 1996). + +###### There is reason to believe that the weak replicability of Management findings may be the result of four sets troublingly prevalent researcher behaviors (see full manuscript for evidence of prevalence): + +###### — *Unacknowledged “Hypothesizing After the Results are Known”* (N. L. Kerr, 1998); + +###### *— Data manipulation* (also known as p-hacking), which involves exploiting researchers “degrees of freedom” – e.g. adding / dropping control variables, dropping uncooperative data points / conditions, using alterative measures / transformations – to find desired results (Goldfarb & King, 2016); + +###### *— Data fraud*, which involves changing data points or generating data wholesale (John, Loewenstein, & Prelec, 2012); + +###### *— Data hoarding*, which involves an unwillingness to share data or research materials that would allow others to verify whether one’s data is consistent with one’s published conclusions (Wicherts, Bakker, & Molenaar, 2011). + +###### As demonstrated in studies such as that of Simmons, Nelson, and Simonsohn (2011), such practices can dramatically increase the likelihood of producing “statistically significant” (but ultimately erroneous) findings. + +###### **Drivers of Non-Credible Research Practices** + +###### We argue that the reason these undesirable research behaviors are so prevalent in Management is that engaging in such behaviors can be beneficial for one’s career, since such behaviors facilitate the production of highly citable (i.e. novel, theory adding, statistically significant) research claims likely to be publishable in high status journals. Of course, engaging in these research behaviors comes with some risk of detection, but the current lack of verification / replication efforts would seem to make the chance of detection low. Thus, scholars are in a social dilemma, where what is good for them individually (i.e. producing highly citable claims) is at odds with what is good for society/science as a whole (producing credible, replicable claims). + +###### There are a variety of journal practices that would significantly decrease the career benefits associated with non-credible research practices, and thus lead to more credible Management science.  They include: + +###### *— Frequent publication of high-quality* *strict* *replications* via dedicated journal space, distinct reviewing criteria for replication studies, provision of replication protocols, and crowd-sourcing replication efforts; + +###### — Enabling *robustness checks* through in-house analysis checks and altered data submission policies; + +###### — Enabling the publication of *null results* through registered reports and results-blind review; + +###### — Adopting *Open Practice article badges* (Center for Open Science, 2015). + +###### However, adoption of these practices has been slow. We propose that one possible reason is that a journal that “sticks its neck out” and adopts these credibility supportive practices is likely to see its status decline.  For example, null findings and replications are rarely cited (Hubbard, 2015), and thus publishing them can reduce a journal’s impact factor. Similarly, requiring scholars to submit their data when competitor journals do not have such a requirement will make the “purist journal” a less attractive publication outlet for scholars, potentially reducing their pool of highly citable submissions. Indeed, each of these credibility enhancing journal practices are likely to lead to research that is both more reliable *and* less citable. This means that journal gatekeepers (editors and reviewers) are themselves trapped in a social dilemma, where what is good for the journal’s status (i.e. high impact factor relative to “competitor journals”) is at odds with what is good for society/science as a whole (i.e. adopting credibility enhancing practices that help ensure more reliable claims). + +###### **Resolving the Social Dilemmas** + +###### Fortunately, social science has accumulated great deal of knowledge about how to resolve social dilemmas (Kollock, 1998; Van Lange, Balliet, Parks, & Vugt, 2013). Specifically, we suggest three *structural* social dilemma solutions, and two *motivational* social dilemma solutions. + +###### Structural social dilemma solutions involve changing the incentives for journal gatekeepers (Messick & Brewer, 1983). We suggest the following structural social dilemma interventions: + +###### *— Define small peer journal groups*: A prerequisite for conditional pledges (below) and other social dilemma solutions is identifying a population of peer (i.e. “competitor”) journals. + +###### *— Conditional pledges by editors*: These are public pledges to adopt certain credibility supportive journal practices if a substantial portion of peer journals also agree to the pledge. This approach is meant to mitigate “relative status costs” of a journal adopting credibility supportive practices. + +###### *— Reviewer pledges*: Credibility-minded reviewers could themselves publically pledge to (only) review for journals that adopt credibility supportive journal practices to create an incentive for editors to sign onto a conditional pledge with their peer journals. + +###### Motivational social dilemma solutions increase the desire to generously cooperate with others without changing the underlying incentives (Messick & Brewer, 1983). We suggest the following motivational social dilemma interventions: + +###### *— Increase multi-journal communication*: Editors are more likely to cooperate with other journals in adopting credibility supportive journal practices if they discuss the field-level benefits of doing so face-to-face with their peers (i.e., other editors). + +###### *— Inject a moral frame*: Journal editors are more likely to adopt credibility supportive journal practices when such practices are framed as a moral imperative. + +###### Across many fields, there is a growing appetite for improving the way science is done. The social dilemma solutions presented in the article build on the belief that the best hope for resolving the credibility crisis in science is in pragmatic (re)consideration of scholars’ and journal gatekeepers’ incentives for producing credible scientific claims. Until then, we are merely rewarding A while hoping for B (S. Kerr, 1975). + +###### **REFERENCES** + +###### Byington, E. K., & Felps, W. (forthcoming). Solutions to the credibility crisis in management science. *Academy of Management Learning & Education*. + +###### Ceci, S. J., & Walker, E. (1983). Private archives and public needs. *American Psychologist*, *38*(4), 414–423. + +###### Center for Open Science. (2015, January 24). Badges to acknowledge open practices. Retrieved January 26, 2015, from + +###### Goldfarb, B. D., & King, A. A. (2016). Scientific apophenia in strategic management research. *Strategic Management Journal*, *37*(1), 167–176. + +###### Hubbard, R. (2015). *Corrupt research: The case for reconceptualizing empirical management and social science*. Newcastle upon Tyne, UK: Sage. + +###### Hubbard, R., & Vetter, D. E. (1996). An empirical comparison of published replication research in accounting, economics, finance, management, and marketing. *Journal of Business Research*, *35*(2), 153–164. + +###### John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. *Psychological Science*, *23*(5), 524–532. + +###### Kepes, S., Banks, G. C., McDaniel, M., & Whetzel, D. L. (2012). Publication bias in the organizational sciences. *Organizational Research Methods*, *15*(4), 624–662. + +###### Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. *Personality and Social Psychology Review*, *2*(3), 196–217. + +###### Kerr, S. (1975). On the folly of rewarding a, while hoping for b. *Academy of Management Journal*, *18*(4), 769–783. + +###### Kollock, P. (1998). Social dilemmas: The anatomy of cooperation. *Annual Review of Sociology*, 183–214. + +###### Messick, D. M., & Brewer, M. B. (1983). Solving social dilemmas: A review. *Review of Personality and Social Psychology*, *4*(1), 11–44. + +###### Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. *Psychological Science*, *22*(11), 1359–1366. + +###### Van Lange, P. A. M., Balliet, D. P., Parks, C. D., & Vugt, M. van. (2013). *Social dilemmas: Understanding human cooperation*. Oxford, UK: Oxford University Press. + +###### Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. *PLOS ONE*, *6*(11), e26828. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/08/24/byington-felps-on-resolving-the-social-dilemmas-that-lead-to-non-credible-science/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/08/24/byington-felps-on-resolving-the-social-dilemmas-that-lead-to-non-credible-science/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/campbell-is-the-aer-replicable-and-is-it-robust-evidence-from-a-class-project.md b/content/replication-hub/blog/campbell-is-the-aer-replicable-and-is-it-robust-evidence-from-a-class-project.md new file mode 100644 index 00000000000..809e4dbdf44 --- /dev/null +++ b/content/replication-hub/blog/campbell-is-the-aer-replicable-and-is-it-robust-evidence-from-a-class-project.md @@ -0,0 +1,63 @@ +--- +title: "CAMPBELL: Is the AER Replicable? And is it Robust? Evidence from a Class Project" +date: 2016-12-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "AER" + - "American Economic Review" + - "Douglas Campbell" + - "Edlin factor" + - "Geography" + - "Macroeconomics" + - "reanalysis" + - "replication" + - "robustness" +draft: false +type: blog +--- + +###### As part of a major replication and robustness project of articles in the American Economic Review, this fall I assigned students in my Masters Macro course at the New Economic School (Moscow) to replicate and test robustness for Macro papers published in the AER. In our sample of AER papers, 66% had full data available online, and the replicated results were exactly the same as in the paper 72% of the time. However, in all the remaining cases where the data and code were available were the results in the replication approximately the same.  The robustness results were a bit less sanguine: students concluded that 65% of the papers were robust (primarily doing ***[reanalysis](https://replicationnetwork.com/2016/10/27/goldstein-more-replication-in-economics/)*** rather than extensions), while t-scores fell by 31% on average in the robustness checks.  While this work should be seen as preliminary (students had to work under tight deadlines), the results suggest that more work needs to be done on replication and robustness, which should be an integral part of the scientific process in economics. + +###### **The Assignment** + +###### First, each student could choose their own paper and try to replicate the results. The students were allowed to switch papers for any reason, such as if the data was not available or the code didn’t work. Then they had to write referee reports on their papers, suggesting robustness checks. Lastly, after a round of comments and suggestions from myself, students were to implement their robustness checks and report their results. They were also required to submit their data and code. + +###### **Replication Results** + +###### 24 papers had the full data available, while 12 did not, for an impressive two-thirds ratio (similar to what ***[Chang and Li](https://www.federalreserve.gov/econresdata/feds/2015/files/2015083pap.pdf)***, who test whether economics research is replicable, find, 23/35 in their case). Unfortunately, this is almost certainly an upper-bound estimate, as there may be selection given that students likely chose papers which looked easy to replicate.  In addition, seven students switched away from their first-choice papers without necessarily reporting why in the google spreadsheet, and others likely switched between several papers just before the deadline in search of papers which were easy to replicate. + +###### Next, in 23 out of 32 cases, when there was full or partial data available, the replication results were exactly the same. In the other nine cases, the results were “approximately” the same, for a fairly impressive 100% replicability ratio. While this is encouraging, a pessimist might note that in just 18 out of 32 papers was there full data and code available that gave exactly the same results as were found in the published version of the paper, and in 24/32 cases were there approximately the same results and full data. + +###### **Robustness Results** + +###### While virtually all the tables that had data replicated well, it cannot necessarily be said that the results proved particularly robust. Of the troopers in my class who made it through a busy quarter to test for robustness and filled in their results in the google sheet, just 15 out of 23 subjectively called the results of the original AER “robust”, with the average t-score falling by 31% (similar to an ***[Edlin factor](http://andrewgelman.com/2014/02/24/edlins-rule-routinely-scaling-published-estimates/)***). + +###### While, admittedly, some students might have felt incentivized to overturn papers by hook or crook, for example by running hundreds of robustness tests, this does not appear to be what happened. This is particularly the case since many of the robustness checks were premeditated in the referee reports. On average, students who reversed their results reported doing so on the 8th attempt. + +###### If anything, with an exception of one or two cases, students seemed to be cautious about claiming studies were non-robust. One diligent student found a regression in a paper’s .do file – not reported in the main paper — in which the results were not statistically significant. However, the student also noted that the sample size in that particular regression shrank by one-third, and thus still gave the paper the benefit of the doubt. Other students often found that their papers’ had clearly heterogenous impacts by subsample, and yet were cautious enough to still conclude that the key results were robust on the full sample, even if not on subsamples. And, indeed, having insignificant results on a subsample may or may not be problematic, but at a minimum suggests further study is warranted. + +###### **“Geographic” Data Papers: Breaking Badly?** + +###### For papers that have economic data arranged geographically, such as papers which look at local labor market effects of a particular shock, or cross-country data, or data from individuals in different areas, the results appeared to be more grim. It often happened that different geographic regions would yield quite different results (not unlike ***[this example](http://andrewgelman.com/2015/12/19/a-replication-in-economics-does-genetic-distance-to-the-us-predict-development/)*** from the QJE). Thus if one splits the sample, and then tests out-of-sample on the remaining data, the initial model often does not validate well. It might not be that the hypothesis is wrong, but it does make one wonder how well the results would test out of sample. The problem here seems to be that geography is highly-nonrandom, so that regressing any variable y (say, cat ownership) on any other variable x (marijuana consumption), one will find a correlation. (This is likely the force which gave rise to the rainfall IV.) However, often these correlations will reverse signs on different regions. Having a strong intuitive initial hypothesis here is important. + +###### For example, one student chose a paper which argued for a large causal effect of inherited *trust* on economic growth – which *a priori* sounded to me like a dubious proposition. The student found that a simple dummy for former communist countries eliminated the significance of the result when added to one of the richer specifications in the paper. + +###### Concluding Thoughts + +###### Would this result, that 65% of the papers in the AER are robust, replicate? One wonders if the students had had more time, particularly enough time to do extensions in addition to reanalaysis, or if the robustness checks had been carried out by experienced professionals in the field, whether as many papers would have proven robust. In addition, students were probably more likely to choose famous papers – which may or may not be more likely than others to replicate. Thus, in the future we would like to do a random selection of papers to test robustness. In addition, I suspected from the beginning that empirical macro papers are likely to be relatively low-hanging fruit in terms of the difficulty of critiquing the methodology. This suspicion proved correct. While some papers were hard to find faults in, other papers were missing intuitive fixed effects or didn’t cluster, and one paper ran a panel regression in levels of trending variables without controlling for panel-specific trends (which changed the results). + +###### I do believe this is a good exercise for students, conditioned on not overburdening them, a mistake I believe I made. The assignment requires students to practice the same skills – coding, thinking hard about identification, and writing – that empirical researches use when doing actual research. + +###### On the whole, research published in the AER appears to replicate well, but it is still an open jury as to how robust the AER is. In my view, a robustness ratio of 15/23 = 65% is actually very good, and is a bit better than my initial priors. The evidence from this Russian study does seem to suggest, however, that research using geographic data published in the American Economic Review is no more robust than the American electoral process. This is an institution in need of further fine-tuning. + +###### *Douglas Campbell is an Assistant Professor at the New Economic School in Moscow. His webpage can be found at ****.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/12/27/campbell-is-the-aer-replicable-and-is-it-robust-evidence-from-a-class-project/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/12/27/campbell-is-the-aer-replicable-and-is-it-robust-evidence-from-a-class-project/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/campbell-on-perverse-incentives-and-replication-in-science.md b/content/replication-hub/blog/campbell-on-perverse-incentives-and-replication-in-science.md new file mode 100644 index 00000000000..0197aa15c26 --- /dev/null +++ b/content/replication-hub/blog/campbell-on-perverse-incentives-and-replication-in-science.md @@ -0,0 +1,37 @@ +--- +title: "CAMPBELL: On Perverse Incentives and Replication in Science" +date: 2017-03-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Douglas Campbell" + - "Perverse Incentives" + - "replication" +draft: false +type: blog +--- + +###### [NOTE: This is a repost of a blog that Doug Campbell wrote for his blogsite at *douglaslcampbell.blogspot.co.nz*] + +###### ***[Stephen Hsu](http://infoproc.blogspot.ru/2017/02/perverse-incentives-and-replication-in.html)*** has a nice blog post on this topic. He writes about this common pattern: + +###### (1) Study reports results which reinforce the dominant, politically correct, narrative. + +###### (2) Study is widely cited in other academic work, lionized in the popular press, and used to advance real world agendas. + +###### (3) Study fails to replicate, but no one (except a few careful and independent thinkers) notices. + +###### #1 is spot-on for economics. Woe be to she who bucks the dominant narrative. In economics, something else happens. Following the study, there are 20 piggy-back papers which test for the same results on other data. The original authors typically get to referee these papers, so if you’re a young researcher looking for a publication, look no further. You’ve just guaranteed yourself the rarest of gifts — a friendly referee who will likely go to bat for you. Just make sure your results are similar to theirs. If not, you might want to shelve your project, or else try 100 other specifications until you get something that “works”. One trick I learned: You can bury a robustness check which overturns the main results deep in the paper, and your referee who is emotionally invested in the benchmark result for sure won’t read that far. Hsu then writes: “one should be highly skeptical of results in many areas of social science and even biomedical science (see link below). Serious researchers (i.e., those who actually aspire to participate in Science) in fields with low replication rates should (as a demonstration of collective intelligence!) do everything possible to improve the situation. Replication should be considered an important research activity, and should be taken seriously” + +###### That’s exactly right. Most researchers in Economics go their entire careers without criticizing anyone else in their field, except as an anonymous referee, where they tend to let out their pent-up aggression. Journals shy away from publishing comment papers, as I [found out first-hand](http://andrewgelman.com/2015/12/19/a-replication-in-economics-does-genetic-distance-to-the-us-predict-development/). In fact, much if not a majority of the papers published in top economics journals are probably wrong, and yet the field soldiers on like a drunken sailor. Often, many people “in the know” realize that many big papers have fatal flaws, but have every incentive not to point this out and create enemies, or to waste their time writing up something which journals don’t really want to publish (the editor doesn’t want to piss a colleague off either). As a result, many of these false results end up getting taught to generations of students. Indeed, I was taught a number of these flawed papers as both an undergraduate and a grad student. What can be done? Well, it would be nice to make replication sexy. I’m currently working on a major replication/robustness project of the AER. In the first stage, we are checking whether results are replicable, using the same data sets and empirical specifications. In the second stage, we plan to think up a collection of robustness checks and out-of-sample tests of papers, and then create an online betting market about which papers will be robust. We plan to let the original authors bet on their own work. Another long-term project is to make a journal ranking system which gives journals points for publishing comment papers. Adjustments could also be made for other journal policies, such as the extent to which a particular journal leeches off the academic community with high library subscription fees, submission fees, and long response times. The AEA should also come out with a new journal split between writing review articles (which tend to be highly cited), and comment papers (which tend not to be). In that case, they could do both well and good. As an individual, you can help the situation by writing a comment paper (maybe [light up somebody](http://andrewgelman.com/2015/12/19/a-replication-in-economics-does-genetic-distance-to-the-us-predict-development/) who isn’t in your main field, like I did). You can also help by citing comment papers, and by rewarding comment papers when you edit and serve as a referee. As an editor, do you really care more about your journal’s citations than truth? You could also engage in playful teasing of your colleagues who haven’t written any comment papers as people who aren’t doing their part to make economics a science. (You could also note that it’s also a form of soft corruption, but I digress…) + +###### *Doug Campbell is an Assistant Professor at the New Economic School in Moscow, Russia. You can follow him on Twitter at @lust4learning. Correspondence regarding this blog can be sent to him at [dolcampb@gmail.com](mailto:dolcampb@gmail.com) .* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/03/15/campbell-on-perverse-incentives-and-replication-in-science/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/03/15/campbell-on-perverse-incentives-and-replication-in-science/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/chu-henderson-and-wang-us-food-aid-good-intentions-bad-outcomes.md b/content/replication-hub/blog/chu-henderson-and-wang-us-food-aid-good-intentions-bad-outcomes.md new file mode 100644 index 00000000000..16670401481 --- /dev/null +++ b/content/replication-hub/blog/chu-henderson-and-wang-us-food-aid-good-intentions-bad-outcomes.md @@ -0,0 +1,41 @@ +--- +title: "CHU, HENDERSON, AND WANG: US Food Aid — Good Intentions, Bad Outcomes" +date: 2017-09-08 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Civil conflict" + - "Nunn and Qian (2014)" + - "replication" + - "US Food Aid" + - "USAID" +draft: false +type: blog +--- + +###### *[NOTE: This post is based on the paper, **[“](http://dx.doi.org/10.1002/jae.2558)****[The Robust Relationship between US Food Aid and Civil Conflict”](http://dx.doi.org/10.1002/jae.2558)****,** Journal of Applied Econometrics, 2017]* + +###### Replication can often be thought of as a useful tool to train graduate students or as a starting point for a new line of research, but sometimes replication is necessary as a means to check the robustness of results that can directly influence policy. Recently, Nunn and Qian (***[US Food Aid and Civil Conflict](http://dx.doi.org/10.1257/aer.104.6.1630)***, *American Economic Review* 2014; 104: 1630–1666) found that United States (US) food aid increases the incidence and duration of civil conflict in recipient countries. This paper has received significant attention and has even been noticed by the United States Agency for International Development (USAID). + +###### If the results of their study are robust, policymakers can attempt to minimize the predicted negative impacts. In our paper, we first were able to successfully replicate the results of Nunn and Qian (2014) using their data, but alternative software (R instead of Stata). We then attempted to further scrutinize one of their conclusions. Specifically, the authors claim that the adverse effect of US food aid on conflict does not vary across pre-determined characteristics of aid recipient countries (a seemingly strong assumption across sometimes vastly different nations). Nunn and Qian (2014) made attempts to allow for heterogeneity in their regression models by interacting US food aid with these pre-determined characteristics, but this simply amounted to group averages which may miss the underlying heterogeneity. + +###### In order to check for more sophisticated forms of heterogeneity, we used a semiparametric estimation procedure. While the results visually suggested the presence of some amount of heterogeneity, this could not be determined statistically as we were unable to formally reject any of the parametric specifications in Nunn and Qian (2014). The conclusion of such a replication is that their models cannot be rejected using their data and we argue that the results of their paper are robust. + +###### While we rightly criticize studies that cannot be replicated, we should also make note of those that can be replicated. It is typically a non-trivial task and while we were successful, we suggest that this study be further tested with samples from a different set of countries (both recipients and donors) and/or time periods. + +###### *Chi-Yang Chu is an assistant professor of economics at National Taipei University. Daniel J. Henderson is a professor of economics and the J. Weldon and Delores Cole Faculty Fellow at the University of Alabama. Le Wang is an associate professor of economics and the Chong K. Liew Chair in Economics at the University of Oklahoma. Correspondence about this blog should be directed to Daniel Henderson at djhender@culverhouse.ua.edu.* + +###### References + +###### [1]  N. Nunn and N. Qian, US Food Aid and Civil Conflict. *American Economic Review*. 104, 1630–1666 (2014) + +###### [2]  C.-Y. Chu, D. J. Henderson and L. Wang. The Robust Relationship between US Food Aid and Civil Conflict. *Journal of Applied Econometrics*. 32, 1027-1032 (2017) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/09/08/chu-henderson-and-wang-us-food-aid-good-intentions-bad-outcomes/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/09/08/chu-henderson-and-wang-us-food-aid-good-intentions-bad-outcomes/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/claire-boeing-reicher-crowdsourcing-a-journal-s-replication-policy.md b/content/replication-hub/blog/claire-boeing-reicher-crowdsourcing-a-journal-s-replication-policy.md new file mode 100644 index 00000000000..87653c59c45 --- /dev/null +++ b/content/replication-hub/blog/claire-boeing-reicher-crowdsourcing-a-journal-s-replication-policy.md @@ -0,0 +1,68 @@ +--- +title: "CLAIRE BOEING-REICHER: Crowdsourcing a Journal’s Replication Policy" +date: 2016-02-06 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "crowdsourcing" + - "Economics E-Journal" + - "Journals" + - "replication policy" +draft: false +type: blog +--- + +###### As reported in a ***[previous blog post](https://replicationnetwork.com/2016/01/27/claire-boeing-reicher-economics-e-journal-launches-a-new-replication-section/)***, the **[*Economics E-Journal*](http://www.economics-ejournal.org)** has launched a new replication section. As part of this initiative, we have developed a set of guidelines for replication submissions. + +###### These guidelines seek to strike a reasonable balance among the needs of replicating authors (a fair chance to publish replications), replicated authors (protection against poorly-done replication studies), and readers (who need to know in a timely manner whether or not economics research is robust). + +###### These guidelines for replication submissions are described at the ***[journal’s website](http://www.economics-ejournal.org/special-areas/replications-1)***. This blog consists of two parts.  In the first part, we provide a summary of our current guidelines.  In the second part, we ask readers for input. + +###### PART I: Current Guidelines + +###### The guidelines for a replication submission at *Economics E-Journal* are summarized as follows. + +###### 1) An assistant editor determines whether the submission is of sufficient merit to be sent through the refereeing process. If not, the paper is desk-rejected. + +###### 2) If a paper passes the first stage, it is sent on to a Co-Editor, who makes a similar determination about merit. If the paper is not of sufficient merit, the paper is desk-rejected. + +###### 3) Then, the replication is sent to the original author, who has a chance to reply within 60 days. This reply is then appended to the submission. + +###### 4) If the paper passes this stage, the paper (with reply) is published as a discussion paper (which is like a working paper). + +###### 5) It is then sent on to two or three anonymous referees, none of whom is the original author. These referees submit reports which are posted online.  Commenters can also comment during this time. + +###### 6) After the referee reports are posted, the author may reply to the referees or even update the paper. + +###### 7) After the author replies, a committee of three then makes a decision (generally to publish as a full-fledged journal article with or without specific revisions, or to reject). Any further exchanges between the replicating and original authors are then appended to the published article. + +###### Our complete guidelines for replicators can be found ***[here](http://www.economics-ejournal.org/special-areas/replications-1)***.  While the guidelines are mostly set, we are still seeking input into our procedures, and we will undoubtedly make changes as we gain more experience with replications. + +###### PART II:  How You Can Help + +###### We are seeking input on the following items: + +###### – In light of the guidelines for non-replication submissions, do you believe that the current guidelines for replication submissions are appropriate? + +###### – If not, what concrete suggestions do you have for improvement? + +###### – Do you believe that the 60-day embargo on the discussion paper (to wait for the original author’s reply) makes sense, or should the discussion paper be published as soon as a Co-Editor believes it has sufficient merit? + +###### – Should an embargo be placed instead on publication of the journal article? That is, should publication as a journal article wait until the original author has a chance to reply to the final version of the replication study? + +###### In addition, we are keen to hear any other ideas you have for improving the replication policy at *Economics E-Journal.* + +###### To provide feedback, comment directly on this blog page, or email Claire Boeing-Reicher (Kiel Institute for the World Economy), at Claire.Reicher@ifw-kiel.de . + +###### We look forward to hearing from you. + +###### + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/02/06/claire-boeing-reicher-crowdsourcing-a-journals-replication-policy/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/02/06/claire-boeing-reicher-crowdsourcing-a-journals-replication-policy/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/claire-boeing-reicher-economics-e-journal-opens-a-replication-section.md b/content/replication-hub/blog/claire-boeing-reicher-economics-e-journal-opens-a-replication-section.md new file mode 100644 index 00000000000..3a67b6f3d14 --- /dev/null +++ b/content/replication-hub/blog/claire-boeing-reicher-economics-e-journal-opens-a-replication-section.md @@ -0,0 +1,47 @@ +--- +title: "CLAIRE BOEING-REICHER: Economics E-Journal Opens a Replication Section" +date: 2016-01-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Economics E-Journal" + - "Journals" + - "publishing replications" +draft: false +type: blog +--- + +###### The **[*Economics E-Journal*](http://www.economics-ejournal.org)** announces the launch of a dedicated replication section. This initiative is a joint effort of the Kiel Institute for the World Economy (IfW) and the German National Library for Economics (ZBW).  It provides authors across all fields of economics an outlet for publishing replication studies. + +###### This initiative is motivated by the difficulties that authors have had in submitting replication studies to other journals, and by the culture of secrecy within the profession around failed replications. We hope that our initiative can help begin to change these things. + +###### The journal has several unique characteristics that may make it an attractive outlet for researchers looking for outlets for their replication studies. These characteristics are driven by a principle of openness. + +###### This openness shows up in the desire for the journal to open up access to the submission and refereeing processes, speed up those processes while maintaining high standards, and open up access to articles for readers outside major academic institutions. + +###### So far, this openness has met with success. + +###### For instance, the journal currently has an impact factor of 0.644 (JCR Social Sciences Edition 2014), which places it between the *Southern Economic Journal* (IF = 0.683) and *Applied Economics* (IF = 0.613).  Further, *Economics E-Journal’s* impact factor is likely to be biased downward since the journal is not yet ten years old. + +###### The characteristics of the *Economics E-Journal* that make it an ideal outlet for replications are as follows. + +###### First of all, the journal’s electronic format implies that there are no space constraints.  The only constraint on the number of replications that can be published is the quantity and quality of the submitted replication studies. + +###### Secondly, the management of the journal occurs alongside, but independently from, the IfW’s and ZBW’s other journals, and from other journals in the economics profession. This ensures a degree of independence not found in some other journals. + +###### Thirdly, the journal is open access and open evaluation. Open access means that authors can reach a wide audience without running into a paywall.  Open evaluation means that reviewers, commenters, and editors adjudicate papers in a fair, transparent, and rapid way. + +###### Fourthly, the journal is a general interest journal, which means that it accepts submissions from all subfields of economics. + +###### Guidelines for replicators can be found ***[here](http://www.economics-ejournal.org/special-areas/replications-1)***.  While the guidelines are mostly set, we are still seeking input into our procedures.  In my next installment, I will be asking TRN readers for their comments and suggestions.  Stay tuned! + +###### -Claire Boeing-Reicher, Researcher, Kiel Institute for the World Economy. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/01/27/claire-boeing-reicher-economics-e-journal-launches-a-new-replication-section/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/01/27/claire-boeing-reicher-economics-e-journal-launches-a-new-replication-section/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/coffman-wilson-assessing-the-rate-of-replications-in-economics.md b/content/replication-hub/blog/coffman-wilson-assessing-the-rate-of-replications-in-economics.md new file mode 100644 index 00000000000..dac4342af0f --- /dev/null +++ b/content/replication-hub/blog/coffman-wilson-assessing-the-rate-of-replications-in-economics.md @@ -0,0 +1,53 @@ +--- +title: "COFFMAN & WILSON: Assessing the Rate of Replications in Economics" +date: 2017-05-31 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "American Economic Review" + - "economics" + - "replications" +draft: false +type: blog +--- + +###### In our *AER Papers and Proceedings* paper, ***[“Assessing the Rate of Replications in Economics”](https://www.aeaweb.org/articles?id=10.1257/aer.p20171119)*** we try to answer two questions. First, how often do economists attempt to replicate results? Second, how aware are we collectively of replication attempts that do happen? + +###### Going into this project, the two of us were concerned about the state of replication in the profession, but neither of us really knew for sure just how bad (or good) it might be. To get a better handle on the problem, we set out to quantify just how often results produced in subsequent work spoke to the veracity of the core insights in empirical papers (even if this was not the main goal of the follow up work). + +###### We couldn’t answer this for all work ever done, so we needed to limit the exercise to a meaningful subsample. To do this we chose a base set of papers from the AER’s 100th volume, published in 2010. This volume sample therefore represented important, general-interest ideas in economics, and gave all the papers at least 5 years since publication to accrue replication attempts. + +###### We wanted to be fairly comprehensive on the fields we included, but we also wanted to focus on “replication” in a very broad sense: had the core hypothesis of the previous paper been exposed to a retest and incorporated into the published literature? But this broad definition led to a problem on the coding, as we wanted the reader of each volume paper to be an expert in the field providing his or her opinion on whether something was a replication. To solve this, we put together a group of coauthors who possessed expertise across of an array of fields (adding James Berry, Rania Gihleb, and Douglas Hanley to the project). + +###### Assigning the volume papers by specialty, we read through and coded just over 1,500 papers citing one of the 70 empirical papers in our volume sample. For each paper we coded our subjective opinions on whether each was a replication and/or an extension for one of the original paper’s main hypotheses. Alongside this, we also coded more-objective definitions on the relationship of the data in each citing paper to the original, allowing us to compare our top-level replication coding to the definitions given by ***[Michael Clemens](http://onlinelibrary.wiley.com/doi/10.1111/joes.12139/full)***. + +###### The end results from our study indicate that only a quarter of the papers in our volume sample were replicated at least once, while 60 percent had either been replicated or extended at least once. While the replication figure is still lower than we would want, it was higher than we expected. Moreover, the papers that were replicated were the most important papers in our sample: Every single volume paper in our sample with 100 published citations had been replicated at least once. Given 50 published citations, the paper was more likely to have been replicated than not. While the quantitative rates differ slightly, this qualitative result is replicated by the findings in the session papers by ***[Daniel Hamermesh](https://www.aeaweb.org/articles?id=10.1257/aer.p20171121)*** and ***[Sandip Sukhtankar](https://www.aeaweb.org/articles?id=10.1257/aer.p20171120)*** (examining very well-cited papers in labor economics, and top-5/field publications in development economics, respectively.) + +###### While the replication rates that we found were certainly higher than we initially expected, one thing that we discovered from the coding exercise was how hard it was to find replications. Our coding exercise was an exhaustive search within all published economics papers citing one of our volume papers. In total we turned up 52 papers that we coded as a replication, where the vast majority of these were *positive* replications. But of these 52, only 18 actually explicitly presented themselves as replications. Simply searching for a paper with a keyword such as “replication” isn’t enough, as many of the replications we found were buried as sub-results within larger papers, for which the replication was not the main contribution. + +###### This hampers awareness of replications. Though one might expect that knowledge of replications is better distributed among the experts within each literature, in a survey we conducted of the volume-paper authors and a subsample of the citing authors, the main finding was substantial uncertainty on the degree to which papers and ideas had been replicated. + +###### Certainly the profession could do a far better job in organizing replications through a ***[market design approach](http://marketdesigner.blogspot.com/2017/05/replicability-and-credibility-of.html)***. In a ***[companion paper](https://www.aeaweb.org/articles?id=10.1257/aer.p20171122)*** to this one that we wrote with Muriel Niederle, we set out some modest proposals for better citation and republication incentives for doing so. But much, much more is possible. + +###### *Lucas Coffman is a Visiting Associate Professor of Economics at Harvard University. Alistair Wilson is an Assistant Professor of Economics at the University of Pittsburgh. Comments/feedback about this blog can be directed to Alistair at [alistair@pitt.edu](mailto:alistair@pitt.edu).* + +###### REFERENCES: + +###### – Berry, James , Lucas Coffman, Rania Gihleb, Douglas Hanley and Alistair J. Wilson. 2017. “Assessing the Rate of Replication in Economics” Am. Econ. Rev P&P, 107 (5): p.27-31 + +###### – Coffman, Lucas, Muriel Niederle and Alistair J. Wilson. 2017. “A Proposal to Incentivize, Promote, and Organize Replications” Am. Econ. Rev P&P, 107 (5): p.41-5 + +###### – Clemens, Michael. 2017. “The Meaning of Failed Replications: A Review and Proposal.” J. Econ. Surv. 31 (1): p.326–42 + +###### – Hamermesh, Daniel S. 2017. “What is Replication? The Possibly Exemplary Example of Labor Economics.” Am. Econ. Rev P&P, 107 (5): p.37-40. + +###### –Sukhtankar, Sandip. 2017. “Replications in Development Economics” Am. Econ. Rev P&P, 107 (5): p.32-6 + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/05/31/coffman-wilson-assessing-the-rate-of-replications-in-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/05/31/coffman-wilson-assessing-the-rate-of-replications-in-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/coup-are-replications-worth-it.md b/content/replication-hub/blog/coup-are-replications-worth-it.md new file mode 100644 index 00000000000..3db35662fa7 --- /dev/null +++ b/content/replication-hub/blog/coup-are-replications-worth-it.md @@ -0,0 +1,60 @@ +--- +title: "COUPÉ: Are Replications Worth it?" +date: 2016-12-13 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Political Analysis" + - "replications" + - "Retraction Watch" + - "Tom Coupe" +draft: false +type: blog +--- + +###### Does it make sense for an academic to put effort in replicating another study? While reading a paper in *Political Analysis* (Katz, 2001[[1]](#_ftn1)) in 2005, I noticed a strange thing. In that paper, the author uses simulations to check how biased estimates are if one estimates fixed effects in a logit model by including fixed effects dummies rather than doing conditional logit. + +###### However, the way the author described the fixed effects in the paper suggested that he assumed all fixed effects were equal. This, in fact, means there are no fixed effects, as equal fixed effects are just like a constant term. The author’s Stata code confirmed he indeed generated a ‘true’ model without fixed effects and hence the article’s interpretation was different from what it was actually doing. I fixed the code, re-ran the simulation and wrote up a correction which was also published in *Political Analysis* (Coupé, 2005[[2]](#_ftn2)). The author in his reply admitted the issue (Katz, 2005[[3]](#_ftn3)). + +###### These two articles, Katz (2001) and Coupé (2005) thus provide a clean experiment to assess how a successful replication affects citations of both the replication and the original paper. Both papers were published in the same journal. Katz’s reply (Katz, 2005) shows the author of the original paper agrees with the flaw in the analysis of Katz (2001) so there is no uncertainty about whether the replication or the original is incorrect. And the flaw is at the core of the analysis in Katz (2001). In most replications, only parts of the analyses are shown to be incorrect or not replicable so subsequent citations might refer to the ‘good’ parts of the paper. + +###### I used Google search to find citations of Katz (2001) and Coupé (2005) and then eliminated the citations coming from multiple versions of the same papers. The table below gives the results. + +![coupetable](/replication-network-blog/coupetable.webp) + +###### The table shows that even after publication of the correction, more than 70% of citing papers only cite the Katz study. This remains true even if one restricts the sample to citations from more than 5 years after the publication of the correction. + +###### I also investigated how those papers that cite both Katz (2001) and Coupé (2005) cite these papers. I could find the complete text for 13 out of 15 such papers. None indicates the issue with the Katz (2001) study, instead both studies are used as examples of studies that find one can include fixed effects dummies in a logit regression if the number of observations per individual is sufficiently big. While both studies indeed come to that conclusion, the Katz (2001) study could not make that claim based on the analysis it did. This suggest that even those people who at least knew about the Coupé (2005) article also did not really care about this fact. + +###### While the fact that many people continue to cite research that has been shown to be seriously flawed is possibly disappointing, this should not come as a surprise. ***[Retraction Watch (2015)](http://retractionwatch.com/the-retraction-watch-leaderboard/top-10-most-highly-cited-retracted-papers/)*** has a league table of citations given to papers after they have been retracted. + +###### Further, my experience is consistent with the results of ***[Hubbard and Armstrong (1994)](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1114&context=marketing_papers)***.  They find that “Published replications do not attract as many citations after publication as do the original studies, even when the results fail to support the original studies.”  In other words, even after the replication has been published, the original article continues to be cited more frequently than the replication.  This is true even when the results from the original study were overturned by the replication. + +###### Citations are only one measure of “worth.”  But my experience, and the evidence from Hubbard and Armstrong (1994), suggest that replicated research is not valued as highly by the discipline as original research.  Which may be one reason why so little replication research is done. + +###### **REFERENCES** + +###### Coupé, T. (2005). Bias in conditional and unconditional fixed effects logit estimation: A correction. *Political Analysis*, Vol. 13: 292-295. + +###### Hubbard, R. and Armstrong, J.S. (1994). Replications and extensions in marketing – rarely published but quite contrary.  *International Journal of Research in Marketing*, Vol. 11: 233-248. + +###### Katz, E. (2001). Bias in conditional and unconditional fixed effects logit estimation. *Political Analysis*, Vol. 9: 379-384. + +###### Katz, E. (2001). Response to Coupé. *Political Analysis*, Vol. 13: 296-296. + +###### *Tom Coupé is an Associate Professor of Economics at the University of Canterbury, New Zealand.* + +###### [[1]](#_ftnref1) Abstract of Katz (2001): “Fixed-effects logit models can be useful in panel data analysis, when N units have been observed for T time periods. There are two main estimators for such models: unconditional maximum likelihood and conditional maximum likelihood. Judged on asymptotic properties, the conditional estimator is superior. However, the unconditional estimator holds several practical advantages, and therefore I sought to determine whether its use could be justified on the basis of finite-sample properties. In a series of Monte Carlo experiments for T < 20, I found a negligible amount of bias in both estimators when T ≥ 16, suggesting that a researcher can safely use either estimator under such conditions. When T < 16, the conditional estimator continued to have a very small amount of bias, but the unconditional estimator developed more bias as T decreased.” + +###### [[2]](#_ftnref2) Abstract of Coupe (2005). “In a recent paper published in this journal, Katz (2001) compares the bias in conditional and unconditional fixed effects logit estimation using Monte Carlo Simulation. This note shows that while Katz’s (2001) specification has ‘‘wrong’’ fixed effects (in the sense that the fixed effects are the same for all individuals), his conclusions still hold if I correct his specification (so that the fixed effects do differ over individuals). This note also illustrates the danger, when using logit, of including dummies when no fixed effects are present”. + +###### [[3]](#_ftnref3) Katz’ (2005) reply. “I agree with the author’s main point. Although I tried to fit a fixed-effects model to the simulated data, those data were generated from a model without fixed effects. In my experiment, therefore, use of the unconditional estimator was perfectly confounded with misspecification of the model. I thank the author for catching this flaw.” + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/12/13/coupe-are-replications-worth-it/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/12/13/coupe-are-replications-worth-it/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/coup-i-tried-to-replicate-a-paper-with-chatgpt-4-here-is-what-i-learned.md b/content/replication-hub/blog/coup-i-tried-to-replicate-a-paper-with-chatgpt-4-here-is-what-i-learned.md new file mode 100644 index 00000000000..8983ce9c786 --- /dev/null +++ b/content/replication-hub/blog/coup-i-tried-to-replicate-a-paper-with-chatgpt-4-here-is-what-i-learned.md @@ -0,0 +1,83 @@ +--- +title: "COUPÉ: I Tried to Replicate a Paper with ChatGPT 4. Here is What I Learned." +date: 2024-04-08 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "AI" + - "ChatGPT" + - "Econometrics" + - "OLS" + - "Python" + - "replication" + - "Stata" +draft: false +type: blog +--- + +Recent research suggests ChatGPT ‘***[aced the test of understanding in college economics](https://journals.sagepub.com/doi/10.1177/05694345231169654)***’,   ChatGPT [‘](https://arxiv.org/abs/2308.06260)***[is effective in stock selection](https://arxiv.org/abs/2308.06260)***[’](https://arxiv.org/abs/2308.06260) , that it “***[can predict future interest rate decisions](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4572831)***” and that using ChatGPT “***[can yield more accurate predictions and enhance the performance of quantitative trading strategies](https://arxiv.org/pdf/2304.07619.pdf)***’. ChatGPT 4 also does ***[econometrics](https://www.timberlake.co.uk/news/chatgpt-4-0)***: when I submitted the dataset and description of one of my econometric case studies, ChatGPT was able to ‘read’ the document, run the regressions and correctly interpret the estimates. + +But can it solve the replication crisis?  That is, can you make ChatGPT 4 replicate a paper? + +To find out the answer to this question, I selected a paper I recently tried to replicate without the help of ChatGPT, so I knew the data needed to replicate the paper were publicly available and the techniques used were common techniques that most graduate students would be able to do.[[1]](#_ftn1) + +I started by asking ChatGPT “can you replicate this paper : . + +ChatGPT answered : “I can’t directly access or replicate documents from external links.” + +I got a similar answer when I asked it to download the dataset used in this paper – while it did find the dataset online, when asked to get the data, ChatGPT answered: ‘I can’t directly access or download files from the internet’. + +****LIMITATION** 1: ChatGPT 4 cannot download papers or datasets from the internet.** + +So I decided to upload paper and the dataset myself – however ChatGPT informed me that ‘currently, the platform doesn’t support uploading files larger than 50 MB’. That can be problematic, the Life in Transition survey used for the paper, for example, is 200MB. + +****LIMITATION** 2: ChatGPT 4 cannot handle big datasets (>50MB).** + +To help ChatGPT, I selected, from the survey, the data needed to construct the variables used in the paper and supplied ChatGPT with this much smaller dataset. I then asked ‘can you use this dataset to replicate the paper’. Rather than replicating the paper, ChatGPT reminded me of the general steps needed to analyse the data, that ‘we’re limited in executing complex statistical models directly here’, demonstrated how to do some analysis in Python and warned that ‘For an IV model, while I can provide guidance, you would need to implement it in a statistical software environment that supports IV estimations, such as R or Stata’. + +While ChatGPT does provide R code when specifically asked for it, ChatGPT seems to prefer Python. Indeed, when I first tried to upload the dataset as an R dataset it answered [‘The current environment doesn’t support directly loading or manipulating R data files through Python libraries that aren’t available here, like **rpy2’**] So I then uploaded the data as a Stata dataset which it accepted. It’s also interesting ChatGPT recommends Stata and R for IV regressions even though IV regressions can be done in Python using the Statsmodels or linearmodels packages. What’s more, at a later stage ChatGPT did use Statsmodels to run the IV regression. + +This focus on Python also limits the useability of ChatGPT to replicate papers for which the code is available – when I supplied the Stata code and paper for one of my own papers, it failed to translate and run the code into Python. + +****LIMITATION** 3: ChatGPT 4 seems to prefer Python.** + +To make life easier for ChatGPT, I next shifted focus to one specific OLS regression: ‘can you try to replicate the first column of table 5 which is an OLS regression’. + +ChatGPT again failed. Rather than focusing on column I which had the first stage of an IV regression, it took the second column with the IV results. And rather than running the regression, it provided some example code as it seemed unable to use the labels of the variables to construct the variables mentioned in the table and the paper. It is true that in the dataset the variable names were not informative (f.e. q721) but the labels attached to each question were informative so I made that explicit in the next step: ‘can you use the variable labels to find the variables corresponding to the ones uses in table 1’? + +ChatGPT was still not able to create the variables and indicated that ‘Unfortunately, without direct access to the questionnaire or detailed variable labels and descriptions, I can provide only a general guide rather than specific variable names.’ + +I therefor upload the questionnaire itself. This helped ChatGPT a lot as it now discussed in more detail which variables were included. And while it still did not run the regression, it provided code in R rather than Python! Unfortunately, the code was still very far from what was needed: some needed variables were not included in the regression, some were included but not in the correct functional form, others that did not need to be included were included. ChatGPT clearly has difficulties to think about all the information mentioned in a paper when proposing a specification. + +****LIMITATION** 4: ChatGPT 4 has trouble creating the relevant variables from variable names and labels.** + +Given its trouble with R, I asked ChatGPT to do the analysis Python. But that just lead to more trouble: ‘It looks like there was an issue converting the q722 variable, which represents life satisfaction, directly to a float. This issue can occur if the variable includes non-numeric values or categories that cannot be easily converted to numbers (e.g., “Not stated” or other text responses).’ Papers often do not explicitly state how they handle missing values and ChatGPT did not suggest focusing on ‘meaningful’ observations only.  Once I indicated only values between 0 and 10 should be used, ChatGPT was able to use the life satisfaction variable but ran into trouble again when it checked other categorical variables. + +****LIMITATION** 5: ChatGPT 4 gets into trouble when some part of the data processing is not fully described.** + +I next checked some other explanatory variables. The ‘network’ variable was based on a combination of two variables. ChatGPT, rather than using the paper to find how to construct the variable, described how such variable can be generated in general. Only after I reminded ChatGPT that ‘the paper clearly describes how the network variable was created’, ChatGPT created the variable correctly. + +**LIMITATION 6: ChatGPT 4 needs to be reminded to see the ‘big picture’ and consider all the information provided in the paper.** + +Finally, for the ‘minority’ variable one needed to check whether the language spoken by the mother of the respondent was an official language of the country where the respondent lives. ChatGPT used its knowledge of official languages to create a variable that suggested 97% of the sample belonged to a minority (against about 14% according to the paper’s summary statistics) but realized this was probably a mistake – it noted ‘this high percentage of respondents classified as linguistic minorities might suggest a need to review the mapping of countries to their official languages or the accuracy and representation of mother’s language data ‘ + +After this I gave up and concluded that while ChatGPT 4 can read files, analyse datasets and even run and interpret regressions, it is still very far from being able to be of much help while replicating a paper. That’s bad news for the replication crisis, but good news for those doing replications: there is still some time before those doing replications will be out of jobs! + +**CONCLUSION: ChatGPT 4 does not destroy replicators’ jobs (yet)** + +Full transcripts of my conversation with ChatGPT can be found [***here***](https://github.com/dataisdifficult/ChatGPTColumn). + +*Tom Coupé is a Professor of Economics at the University of Canterbury, New Zealand. He can be contacted at tom.coupe@canterbury.ac.nz*. + +--- + +[[1]](#_ftnref1) For a paper analysing how wars affect happiness, my co-authors and I tried to replicate 5 papers, the results can be found [***here***](https://dataisdifficult.github.io/PAPERLongTermImpactofWaronLifeSatisfaction.html), + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/04/08/coupe-i-tried-to-replicate-a-paper-with-chatgpt-4-here-is-what-i-learned/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/04/08/coupe-i-tried-to-replicate-a-paper-with-chatgpt-4-here-is-what-i-learned/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/coup-why-you-should-add-a-specification-curve-analysis-to-your-replications-and-all-your-papers.md b/content/replication-hub/blog/coup-why-you-should-add-a-specification-curve-analysis-to-your-replications-and-all-your-papers.md new file mode 100644 index 00000000000..484db727e38 --- /dev/null +++ b/content/replication-hub/blog/coup-why-you-should-add-a-specification-curve-analysis-to-your-replications-and-all-your-papers.md @@ -0,0 +1,52 @@ +--- +title: "COUPÉ: Why You Should Add a Specification Curve Analysis to Your Replications – and All Your Papers!" +date: 2024-05-09 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Garden of Forking Paths" + - "Many analyst projects" + - "Specification curve analysis" + - "War and happiness" +draft: false +type: blog +--- + +When making a conclusion based on a regression, we typically need to assume that the specification we use is the ‘correct’ specification. That is, we include the right control variables, use the right estimation technique, apply the right standard errors, etc. Unfortunately, most of the time theory doesn’t provide us with much guidance about what is the correct specification. To address such ‘model uncertainty’, many papers include robustness checks that show that conclusions remain the same whatever changes one makes to the main specification. + +While when reading published papers, it’s rare to see specifications that do not support the main conclusions of a paper, many people who analyse data themselves quickly realize regression results often are much more fragile than what the published literature seems to suggest. For some, this even might lead to existential questions like “Why do I never get clean results, while everybody else does?”. + +The recent literature based on ***[‘many-analyst’ projects](https://www.iza.org/publications/dp/13233/the-influence-of-hidden-researcher-decisions-in-applied-microeconomics)*** confirms however that when different researchers are given the same research question and the same dataset, they often will come to different conclusions. Sometimes you can even observe such many-analyst project in real life: ***[in a recent paper](https://dataisdifficult.github.io/PAPERLongTermImpactofWaronLifeSatisfaction.html)***, my co-authors and I replicate several published papers that all use data from the same “Life in Transition Survey’ to estimate the long-term impact of war on life satisfaction. But while one paper concludes that there is a positive and significant effect, another concludes that there is a negative and significant effect, while a third one concludes there is no significant effect. + +Interestingly, we can replicate the findings of these three papers so these differing findings cannot be explained by coding errors. Instead, it’s how these authors choose to specify their model that drove these differing results. + +To illustrate the impact of specification choices on outcomes we use the [s](https://github.com/masurp/specr)***[pecr R-package](https://github.com/masurp/specr)***.  To use specr, you indicate what are reasonable choices for your dependent variable, for your independent variable of interest, for your control variables, for your estimation model and for your sample restrictions.  The general format is given below. + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2024/05/image-2.webp) + +After specifying this snippet of code, specr will run all possible combinations and present them in two easy-to-understand graphs. For example, in our paper, we used 2 dependent variables (life satisfaction on a scale from 1-10 and life satisfaction on a scale from 1 to 5),  one main variable of interest (injured or having relatives injured or killed during World War II), 5 models (based on how fixed effects and clusters were defined in 5 different published papers), 8 sets of controls (basic controls, additional war variables, income variables, other additional controls, etc.) and 4 datasets (the full dataset, respondents under 65 years old, those living in countries heavily affected by World War II, and under 65s living in heavily affected countries). This gave a total of 320 regression specifications. + +The first graph produced by specr plots the specification curve, a curve showing all estimates of impact of the variable-of-interest on the outcome, and the standard errors, ordered from smallest to largest, giving an idea of the extent to which model uncertainty affects outcomes. + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2024/05/image-1.webp) + +In the case of our paper, the specification curve showed a wide range of estimates of the impact of experiencing war on life satisfaction (from -0.5 to +0.25 on a scale of 1 to 5/10), with negative estimates often being significant (significant estimates are in red, grey is insignificant). + +The second graph shows estimates by each specification-choice, illustrating what drives the heterogeneity in outcomes. In the case of our paper, we found that from the moment we controlled for a measure of income the estimate of war on life satisfaction became less negative and insignificant! + +[![](/replication-network-blog/image-3.webp)](https://replicationnetwork.com/wp-content/uploads/2024/05/image-3.webp) + +Given the potential importance of choices the researchers make on outcomes, it makes sense, when replicating a paper, to not just exactly replicating the authors specifications. Robustness checks in papers typically check how changing the specification in one dimension affects the outcome. A specification curve, however, allows to illustrate what happens if we look at all possible combinations of the robustness checks done in a paper.  Moreover, programs like specr allow to easily check what happens if one adds other variables, include fixed effects or clusters at different level of aggregation, or restricts the sample in this or that way. In other words, you can illustrate the effects of model uncertainty in a much more comprehensive way than is typically done in a paper. + +And why restrict this to replication papers only? Why not add a comprehensive specification curve to all your papers, showing the true extent of robustness in your own analysis too? In the process, you will perform a great service to many researchers, showing that they are not the only one getting estimates that are all over the place; and help science, by providing a more accurate picture of how sure we can be about what we know and what we do not know. + +*Tom Coupé is a Professor of Economics at the University of Canterbury, New Zealand. He can be contacted at tom.coupe@canterbury.ac.nz*. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/05/09/coupe-why-you-should-add-a-specification-curve-analysis-to-your-replications-and-all-your-papers/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/05/09/coupe-why-you-should-add-a-specification-curve-analysis-to-your-replications-and-all-your-papers/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/coup-why-you-should-use-quarto-to-make-your-papers-more-replicable-and-your-life-easier.md b/content/replication-hub/blog/coup-why-you-should-use-quarto-to-make-your-papers-more-replicable-and-your-life-easier.md new file mode 100644 index 00000000000..e7ce28e1875 --- /dev/null +++ b/content/replication-hub/blog/coup-why-you-should-use-quarto-to-make-your-papers-more-replicable-and-your-life-easier.md @@ -0,0 +1,47 @@ +--- +title: "COUPÉ: Why You Should Use Quarto to Make Your Papers More Replicable (and Your Life Easier!)" +date: 2024-06-21 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Automatic updating" + - "coding" + - "Life Satisfaction" + - "Quarto" + - "R" + - "R Markdown" + - "Replicability" + - "War" + - "writing" +draft: false +type: blog +--- + +An important part of writing a paper is polishing the paper. You start with a first draft but then you find small mistakes, things to add or to remove. Which leads to redoing the analysis and a second, third and fourth draft of the paper.  And then you get comments at seminars and from referees, further increasing the number of re-analyses and re-writes. + +An annoying part of this process is that whenever you update your code and get new results, you also need to update the numbers and tables in your draft paper. Reformatting the same MS Word table for the 5th time is indeed frustrating. But it is also bad for the replicability of papers as it’s so easy to update the wrong column of a table or forget to update a number. + +When writing [***my latest paper***](https://dataisdifficult.github.io/PAPERLongTermImpactofWaronLifeSatisfaction.html) on the long term impact of war on life satisfaction, I discovered how [***Quarto***](https://quarto.org/) allows one to solve these problems by enabling one to create code and paper from one document. It’s like writing your whole paper, text and code in Stata; or it’s like writing your whole paper, text and code in MS Word. True, R Markdown allows this too, but Quarto makes the process easier as it has MS Word-like drop down menus so you need to know less coding, making the learning curve substantially easier! + +Whenever I now update the code for my latest paper, the text version gets updated automatically since every number and every table in the text is linked directly to the code! People who want to replicate the paper will also waste less time finding where in the code is the bit for table 5 from the paper, as the text is wrapped around the code so the code for table 5 is next to the text for table 5. + +And there’s more! The R folder that has your code and datasets can easily be linked to [***Github***](https://github.com/dataisdifficult/war) so no more need to upload replication files to OSF or Harvard Dataverse! + +And did I tell you documents in Quarto can be printed as pdf or word, and even html so you can publish your paper as a website: [***click here for an example***](https://dataisdifficult.github.io/PAPERLongTermImpactofWaronLifeSatisfaction.html). + +How cool is that! + +And did I tell you that you can use Quarto to create slides that are nicer than PPT and that can be linked directly to the code, so updating the code also means updating the numbers and tables in the slides? + +Now I realize that, for the older reader, the cost of investing in R and Quarto might be prohibitive. But for the younger generation, there can only be one advice: drop Stata, drop Word, go for the free software that will make your life easier and science more replicable! + +*Tom Coupé is a Professor of Economics at the University of Canterbury, New Zealand. He can be contacted at tom.coupe@canterbury.ac.nz*. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/06/21/coupe-why-you-should-use-quarto-to-make-your-papers-more-replicable-and-your-life-easier/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/06/21/coupe-why-you-should-use-quarto-to-make-your-papers-more-replicable-and-your-life-easier/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/coville-vivalt-should-we-trust-evidence-on-development-programs.md b/content/replication-hub/blog/coville-vivalt-should-we-trust-evidence-on-development-programs.md new file mode 100644 index 00000000000..f9a3f7fa53e --- /dev/null +++ b/content/replication-hub/blog/coville-vivalt-should-we-trust-evidence-on-development-programs.md @@ -0,0 +1,46 @@ +--- +title: "COVILLE & VIVALT: Should We Trust Evidence On Development Programs?" +date: 2018-01-31 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Development Economics" + - "False Negative Rate" + - "false positive rate" + - "Impact evaluation" + - "Power" + - "Research credibility" + - "Type M error" + - "Type S error" +draft: false +type: blog +--- + +###### *[From the working paper, “How Often Should We Believe Positive Results? Assessing the Credibility of Research Findings in Development Economics” by Aidan Coville and Eva Vivalt]* + +###### Over $140 billion is spent on donor assistance to developing countries annually to promote economic development. To improve the impact of these funds, aid agencies both produce and consume evidence about the effects of development interventions to inform policy recommendations. But how reliable is the evidence that development practitioners use?  Given the “replication crisis” in psychology, we may wonder how studies in international development stack up. + +###### There are several reasons that a study could fail to replicate. First, there may be changes in implementation or context between the original study and the replication, particularly in field settings, where most applied development economics research takes place. Second, publication bias can enter into the research process. Finally, studies may simply fail to replicate due to statistical reasons. Our analysis focuses on this last issue, especially as it relates to statistical power. + +###### Ask a researcher what they think a reasonable power level is for a study and, inevitably, the answer will be “at least 80%”. The textbook suggestion of “reasonable” and the reality are, however, quite different. Reviews for the medical, economic and general social sciences literature estimate median power to be in the range of 8% – 24% (Button et al., 2013; Ioannidis et al., forthcoming; Smaldino & McElreath, 2016). This reduces the likelihood of identifying an effect when it is present. Importantly, however, this also increases the likelihood that a statistically significant result is spurious and exaggerated (Gelman & Carlin, 2014). In other words, the likelihood of false negatives *and* false positives depends critically on the power of the study. + +###### To explore this issue, we follow Wacholder et al. (2004)’s “false positive report probability” (FPRP), an application of Bayes’ rule that leverages estimates of a study’s power, the significance level, and the prior belief that an intervention is likely to have a meaningful impact to estimate the likelihood that a statistically significant effect is spurious. Using this approach, Ioannidis (2005) estimates that more than half of the significant published literature in biomedical sciences could be false. A recent paper by Ioannidis et al. (2017) finds 90% of the more general economic literature is under-powered. As further measures of study credibility, we explore Gelman & Tuerlinckx (2000)’s errors of sign (Type S errors) and magnitude (Type M errors), respectively the probability that a given significant result has the wrong sign and the degree to which it is likely exaggerated compared to the true effect. + +###### In order to calculate these statistics for a particular study, an informed estimate of the underlying “true” effect of the intervention being studied is needed. The standard approach in the literature is to use meta-analysis results as the benchmark. This is possible in settings where a critical mass of evidence is available, but that kind of evidence is not always available, and meta-analysis results may themselves be biased depending on the studies that are included. As an alternative approach to estimate the likely “true” effect sizes of each study intervention, we gathered up to five predictions from each of 125 experts covering 130 different results across typical interventions in development economics. This was used to estimate the power and consequently false positive or negative report probabilities for each study. To focus on those topics that were the most well-studied within development, we looked at the literature on cash transfers, deworming programs, financial literacy training, microfinance programs, and programs that provided insecticide-treated bed nets. + +###### Our findings in this subset of studies are less dramatic than estimates for other disciplines. The median power was estimated to be between 18% and 59%, largely driven by large-scale conditional cash transfer programs. Experts predict that interventions will have a meaningful impact approximately 60% of the time, across interventions. With these inputs, we calculate the median FPRP to be between 0.001 and 0.008, compared to the median significant p-value of 0.002. The likelihood of a significant effect having the wrong sign (Type S error) is close to 0 while the median exaggeration factor (Type M error) of significant results is estimated to be between 1.2 and 2.2. + +###### In short, the majority of studies reviewed fair exceptionally well, particularly when referenced against other disciplines that have performed similar exercises.  We must emphasize that other study topics in development economics not covered in this review may be less credible; conditional cash transfer programs, in particular, tend to have very large sample sizes and thus low p-values. The broader contribution of the paper is to highlight how analysis of study power and the systematic collection of priors can help assess the quality of research, and we hope to see more work in this vein in the future. + +###### To read the working paper, [***click here***](https://osf.io/preprints/bitss/5nsh3/). + +###### *Aidan Coville is an Economist in the Development Impact Evaluation Team (DIME) of the Development Research Group at the World Bank. Eva Vivalt is a Lecturer in the Research School of Economics at Australian National University. They can be contacted at* [*acoville@worldbank.org*](mailto:acoville@worldbank.org) *and eva.vivalt@anu.edu.au, respectively.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/01/31/5014/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/01/31/5014/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/cox-craig-tourish-publishers-cannot-be-coy.md b/content/replication-hub/blog/cox-craig-tourish-publishers-cannot-be-coy.md new file mode 100644 index 00000000000..3892f594130 --- /dev/null +++ b/content/replication-hub/blog/cox-craig-tourish-publishers-cannot-be-coy.md @@ -0,0 +1,53 @@ +--- +title: "COX, CRAIG, & TOURISH: Publishers Cannot Be Coy" +date: 2018-04-25 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "data fabrication" + - "economics" + - "Journal policies" + - "malpractice" + - "publishers" + - "retraction" + - "Times Higher Education" +draft: false +type: blog +--- + +###### *[This blog is a **[repost](https://www.timeshighereducation.com/opinion/publishers-cannot-afford-be-coy-about-ethical-breaches)** from the article “Publishers cannot afford to be coy about ethical breaches” published April 19th, 2018 in the Times Higher Education by Adam Cox, Russell Craig, and Dennis Tourish.]* + +###### There are rising concerns about the reliability of academic research, yet even when papers are retracted, the reasons are often left unexplained. + +###### We recently studied 734 peer-reviewed journals in economics and identified 55 papers retracted for reasons other than “accidental duplication” or “administrative error”. Of those, 28 gave no clear indication of whether any questionable research practice was involved. It appears likely that it was: the reasons given for retraction in the other 27 papers include fake peer review, plagiarism, flawed reasoning, and multiple submission. + +###### For 23 of the 28 “no reason” retractions, it is not even clear who instigated them: the editor alone, the author alone, or both in concert. + +###### This reticence means that other papers by the same authors may not be investigated – as they should be – and are left in circulation. The feelings of authors may be spared, but the disincentives for them and others to engage in malpractice are reduced. + +###### Many publishers refer approvingly to the guidelines of the Committee on Publication Ethics and the International Committee of Medical Journal Editors, which require the disclosure of a clear reason for retraction. However, we found that publishers’ policy statements on retraction are often ambiguous and unclear about what action they will take in response to serious research-related offences. + +###### Perhaps the publishers are reluctant to embarrass themselves. Or perhaps they are intimidated by the possibility of legal action. But apart from their ethical obligations, they should recognise that the growing awareness of malpractice is diminishing public confidence in research integrity. + +###### Publishers will claim that they safeguard research quality by providing a level of editorial scrutiny that keeps poor scholarship out of journals. If that claim is diluted, so is much of that unique selling point. However, if publishers take robust action against malpractice, they will have a stronger claim that they add value to the publishing process when it comes to safeguarding standards. + +###### In our view, the publisher of a journal retracting a paper for research malpractice should be obliged to alert other journals that have published papers by the same authors. In egregious cases, such as those involving data fabrication, those journals’ editors should be required to audit the papers. + +###### Relatedly, publishers should require submitting authors to make their data available in a way that facilitates inspection, re-analysis and replication. This would act as a bulwark against data fraud and poor statistical analysis. Such a requirement is reasonably widespread in the physical and life sciences, but it still tends to be confined to the top echelon of journals in economics. This may help explain why we found no articles retracted because of data fabrication. + +###### Greater diligence is warranted. The Research Papers in Economics Plagiarism Committee is an international group of academic volunteers, mostly economists, who look into possible cases of plagiarism. They are well known in the economics community and have, to date, identified seven papers as involving malpractice. As we write, none of these have been retracted or corrected. + +###### Nor is the social science community particularly diligent at watermarking those papers that are retracted. An article retracted by the *American Economic Review* in 2007, for instance, is still not identified as retracted anywhere in the document. Failure to mark flawed papers runs the risk that defective work might continue to be cited and influence scholarly thinking. + +###### Journals must be more proactive. Failure to take serious actions against malpractice in scholarly publications is harming the integrity of research. Publishers and editors are critical gatekeepers. They cannot go on demanding full transparency from authors while being so non-transparent themselves. + +###### *Adam Cox is a senior lecturer in economics and finance and Russell Craig is professor of accounting and financial management, both at the University of Portsmouth. Dennis Tourish is professor of leadership and organisation studies at the University of Sussex. This is an abridged version of their paper, “Retraction statements and research malpractice in economics”, published in Research Policy.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/04/25/cox-craig-tourish-publishers-cannot-be-coy/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/04/25/cox-craig-tourish-publishers-cannot-be-coy/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/duan-reed-how-are-meta-analyses-different-across-disciplines.md b/content/replication-hub/blog/duan-reed-how-are-meta-analyses-different-across-disciplines.md new file mode 100644 index 00000000000..b927ca12553 --- /dev/null +++ b/content/replication-hub/blog/duan-reed-how-are-meta-analyses-different-across-disciplines.md @@ -0,0 +1,135 @@ +--- +title: "DUAN & REED: How Are Meta-Analyses Different Across Disciplines?" +date: 2021-05-18 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Disciplines" + - "Effect size" + - "Estimation" + - "Fixed Effects" + - "Journals" + - "Meta-analysis" + - "Random Effects" +draft: false +type: blog +--- + +**INTRODUCTION** + +Recently, one of us gave a workshop on how to conduct meta-analyses. The workshop was attended by participants from a number of different disciplines, including economics, finance, psychology, management, and health sciences. During the course of the workshop, it became apparent that different disciplines conduct meta-analyses differently. While there is a vague awareness that this is the case, we are unaware of any attempts to quantify those differences. That is the motivation for this blog. + +We collected recent meta-analyses across a number of different disciplines and recorded information on the following characteristics: + +– Size of meta-analysis sample, measured both by number of studies and number of estimated effects included in the meta-analysis + +– Type of effect size + +– Software package used + +– Procedure(s) used to estimate effect size + +– Type of tests for publication bias + +– Frequency that meta-analyses report (i) funnel plots, (ii) quantitative tests for publication bias, and (iii) meta-regressions. + +Unfortunately, given the large number of meta-analyses, and large number of disciplines that do meta-analyses, we were unable to do an exhaustive analysis. Instead, we chose to identify the disciplines that publish the most meta-analyses, and then analyse the 20 most recent meta-analyses published in those disciplines. + +**LITERATURE SEARCH** + +To conduct our search, we utilized the library search engine at our university, the University of Canterbury. This search engine, while proprietary to our university, allowed us to simultaneously search multiple databases by discipline (see below). + +[![](/replication-network-blog/trn120210518.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/trn120210518.webp) + +We conducted our search in January 2021. We used the keyword “meta-analysis”, filtering on “Peer-reviewed” and “Journal article”, and restricted our search depending on publication date. A total of 58 disciplines were individually searchable, including Agriculture, Biology, Business, Economics, Education, Engineering, Forestry, Medicine, Nursing, Physics, Political Science, Psychology, Public Health, Sociology, Social Welfare & Social Work, and Zoology. + +Of the 58 disciplines we could search on, 18 stood out as publishing substantially more meta-analyses than others. These are listed below. For each discipline, we then searched for all meta-analyses/”Peer-reviewed”/”Journal article” that were published in January 2021, sorted by relevance. We read through the title and abstract until we found 20 meta-analyses. If January 2021 produced less than meta-analyses for a given discipline, we extended the search back to December 2020. In this manner, we constructed a final sample of 360 meta-analyses. The results are reported below. + +**NUMBER OF STUDIES** + +TABLE 1 below reports mean, median, and minimum number of studies for each sample of 20 meta-analyses corresponding to the 18 disciplines. Maximum values are indicated by green shading. Minimum values are indicated by blue. + +The numbers indicate wide differences across disciplines in the number of studies included in a “typical” meta-analysis. Business meta-analysis tend to have the largest number of studies with mean and median values of 87.6 and 88 studies, respectively. Ecology and Economics also typically include large numbers of studies. + +On the other side, disciplines in the health sciences (Dentistry, Diet & Clinical Nutrition, Medicine, Nursing, and Pharmacy, Therapeutics & Pharma) include relatively few studies. The mean and median number of studies included in meta-analyses in Diet & Clinical Nutrition are 13.9 and 11; and 14.8 and 10 for Nursing, respectively. We even found a meta-analysis in Dentistry that only included [***2 studies***](https://onlinelibrary.wiley.com/doi/full/10.1111/idh.12477). + +[![](/replication-network-blog/table120210518.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/table120210518.webp) + +**NUMBER OF EFFECTS** + +Meta-analyses differ not only in number of studies, but the total number of observations/estimated effects they include. In some fields, it is common to include a representative effect, or the average effect from that study. Other disciplines include extensive robustness checks, where the same effect is estimated multiple times using different estimation procedures, variable specifications, and subsamples. Similarly, there may be multiple measures of the same effect, sometimes included in the same equation, and these produce multiple estimates. + +Measured by number of estimated effects, Agriculture has the largest meta-analyses with mean and median sample sizes of 934 and 283. Not too far behind are Economics and Business. These three disciplines are characterized by substantially larger samples than other disciplines. As with number of studies, the disciplines with the smallest number of effects per study are health-related fields such as Dentistry, Diet & Clinical Nutrition, Medicine, Nursing, Pharmacy, Therapeutics & Pharma, and Public Health. + +[![](/replication-network-blog/table220210518.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/table220210518.webp) + +**MEASURES OF EFFECT SIZE** + +Disciplines also differ in the effects they measure. We identified four main types of effects: (i) Mean Differences, including standardized mean differences, Cohen’s d, and Hedge’s g; (ii) Odds-Ratios; (iii) Risk Ratios, including Relative Risk, Response Ratios, and Hazard Ratios; (iiia) Correlations, including Fisher’s z; (iiib) Partial Correlations, and (iv) Estimated Effects. + +We differentiate correlations from partial correlations because the latter primarily appear in Economics. Likewise, Economics is somewhat unique because the range of estimated effects vary widely across primary studies, with studies focusing on things like elasticities, various treatment effects, and other effects like fiscal multipliers or model parameters. The table below lists the most common and second most common effect sizes investigated by meta-analyses across the different disciplines. + +[![](/replication-network-blog/table320210518.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/table320210518.webp) + +We might ask why does it matter that meta-analyses differ in their sizes and estimated effects? In a recent study, ***[Hong and Reed (2021)](https://onlinelibrary.wiley.com/doi/full/10.1002/jrsm.1467)*** present evidence that the performance of various estimators depends on the size of the meta-analyst’s sample. They provide an ***[interactive ShinyApp](https://hong-reed.shinyapps.io/HongReedInteractiveTables/)*** that allows one to filter performance measures by various study characteristics in order to identify the best estimator for the specific research situation. Performance may also depend on the type of effect being estimated (***[see here](https://ideas.repec.org/p/cbt/econwp/20-08.html)*** for some tentative experimental evidence on partial correlations). + +**ESTIMATION – Estimators** + +One way in which disciplines are very similar is on their reliance on the same estimators to estimate effect sizes. TABLE 4 reports the two most common estimators by discipline. Far and away the most common estimator is the Random Effects estimator that allows for heterogeneous effects across studies. + +The second most common estimator is the Fixed Effects estimator, which is built on the assumption of a single population effect, whereby studies produce different estimated effects due only to sampling error. A close relative of the Fixed Effects estimator common in Economics is the ***[Weighted Least Squares estimator](https://onlinelibrary.wiley.com/doi/full/10.1002/sim.6481?casa_token=K9xDceRWgAUAAAAA%3ARlsUBjb13M-vT99SEw8MnHtgbc3_QJjIrQetu9xJbfiHbi2wz5TPGsSoK0R_uLgkiidZ5P4_RhumbeEf)*** of Stanley and Doucouliagos. This estimator produces coefficient estimates identical to the Fixed Effects estimator, but with different standard errors. Despite being the most common estimator, ***[Hong and Reed (2021)](https://onlinelibrary.wiley.com/doi/full/10.1002/jrsm.1467)*** show that Random Effects frequently underperforms relative to other meta-analytic estimators. + +[![](/replication-network-blog/table420210518.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/table420210518.webp) + +**SOFTWARE PACKAGES** + +Another way in which disciplines differ is with respect to the software packages they use. These include a number of standalone packages such as ***[MetaWin](https://psycnet.apa.org/record/1997-09001-000)***, ***[RevMan](https://training.cochrane.org/online-learning/core-software-cochrane-reviews/revman)*** (for Review Manager), and ***[CMA](https://www.meta-analysis.com/)*** (for Comprehensive Meta-Analysis); as well as packages designed to be used in conjunction with comprehensive software programs such as R and Stata. + +A frequently used R package is ***[metafor](https://www.metafor-project.org/doku.php)***. Stata has a built-in meta-analysis suite called ***[meta](https://www.stata.com/manuals/meta.pdf)***. In addition to these packages, many researchers have customized their own programs to work with R or Stata. As an example, in economics, Tomas Havránek has published a wide variety of meta-analyses using customized Stata programs. These can be viewed ***[here](http://meta-analysis.cz/)***. + +TABLE 5 reports the most common software packages used by the studies in our sample. It is clear that R and Stata are the packages of choice for most researchers when estimating effect sizes. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/image.webp) + +**ESTIMATION – Tests for Publication Bias** + +Another area where there is much commonality among disciplines is statistical testing for publication bias. While disciplines differ in how frequently they report such tests (see below), when they do, they usually rely on some measure of the relationship between the estimated effect size and its standard error or variance. + +Egger’s test is the most common statistical test for publication bias. It consists of a regression of the effect size on the standard error of the effect size. Closely related is the FAT-PET (or its extension, FAT-PET-PEESE). FAT-PET stands for Funnel Asymmetry Test – Precision Effect Test. This is essentially the same as an Egger regression except that the regression is also used to obtain a publication-bias adjusted estimate of the effect size (“PET”, since this effect is commonly estimated in a specification where the mean effect size is measured by the coefficient on the effect size precision variable). + +The rank correlation test, also known as Begg’s test or the Begg and Mazumdar rank correlation test, works very similarly except rather than a regression, it rank correlates the estimated effect size with its variance. Other tests, such as Trim and fill, Fail-safe N, and tests based on selection models, are less common. + +[![](/replication-network-blog/table620210518.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/table620210518.webp) + +**OTHER META-ANALYSIS FEATURES** + +In addition to the characteristics identified above, disciplines also differ by how commonly they report information in addition to estimates of the effect size. Three common features are funnel plots, publication bias tests, and meta-regressions. + +Funnel plots can be thought of as a qualitative Egger’s test. Rather than a regression relating the estimated effect size to its standard error, a funnel plot plots the relationship, providing a visual impression of potential publication bias. As is apparent from TABLE 6, not all meta-analyses report funnel plots. They appear to be particularly scarce in Agriculture, where only 15% of our sampled meta-analyses reported a funnel plot. For most disciplines, roughly half of the meta-analyses reported funnel plots. Funnel plots were most frequent in Medicine, with approximately 4 out of 5 meta-analyses showing a funnel plot. + +TABLE 6 reports the most common statistical tests for publication bias conditional on such tests being carried out. While not all meta-analyses test for publication bias, most do. 15 of the 18 disciplines had a reporting rate of at least 50% when it comes to statistical tests of publication bias. Anatomy & Physiology and Diet & Clinical Nutrition had the highest rates, with 85% of meta-analyses reporting tests for publication bias. Agriculture had the lowest at 30%. + +The last feature we focus on is meta-regression. A meta-regression is a regression where the dependent variable is the estimated effect size and the explanatory variables consist of various study, data, and estimation characteristics that the researcher believes may influence the estimated effect size. Technically speaking, an Egger regression is a meta-regression. However, here we restrict it to studies that attempt to explain differences in estimated effects across studies by relating them to characteristics of those studies beyond the standard error of the effect size. + +Meta-regressions are very common in Economics, with almost 9 out of 10 meta-analyses including them. They are less common in other disciplines, with most disciplines having a reporting rate less than 50%. None of the 20 Agriculture meta-analyses in our sample reported a meta-regression. + +Nevertheless, there are other ways that meta-analyses can explore systematic differences in effect sizes. Many studies perform subgroup analyses. For example, a study of the effect of a certain reading program may break out the full sample according to the predominant racial or ethnic characteristics of the school jurisdiction to determine whether there these characteristics are related to the effectiveness of the program. + +[![](/replication-network-blog/table720210518.webp)](https://replicationnetwork.com/wp-content/uploads/2021/05/table720210518.webp) + +**CONCLUSION** + +While our results are based on a limited sampling of meta-analyses, the results indicate that there are important differences in meta-analytic research practices across disciplines. Researchers can benefit from this knowledge by appropriately accommodating their research if they are considering submitting their work to interdisciplinary journals. Likewise, being familiar with another discipline’s norms enables one to provide a fairer, more objective review when one is called to referee meta-analyses from journals outside one’s discipline. + +As noted above, estimator performance may also be impacted by study and data characteristics. While some research has explored this topic, this is largely unexplored territory. Recognizing that meta-analyses from different disciplines have different characteristics should make one sensitive that estimators and practices that are optimal in one field may not be well suited in others. We hope this study encourages more research in this area. + +*Jianhua (Jane) Duan is a post-doctoral fellow in the Department of Economics at the University of Canterbury. She is being supported by a grant from the Center for Open Science. Bob Reed is Professor of Economics and the Director of* ***[UCMeta](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*** *at the University of Canterbury. They can be contacted at* [*jianhua.duan@pg.canterbury.ac.nz*](mailto:jianhua.duan@pg.canterbury.ac.nz) *and* [*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*, respectively.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2021/05/18/duan-reed-how-are-meta-analyses-different-across-disciplines/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2021/05/18/duan-reed-how-are-meta-analyses-different-across-disciplines/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/eberhardt-revisiting-the-causal-effect-of-democracy-on-long-run-development.md b/content/replication-hub/blog/eberhardt-revisiting-the-causal-effect-of-democracy-on-long-run-development.md new file mode 100644 index 00000000000..0b2942982b0 --- /dev/null +++ b/content/replication-hub/blog/eberhardt-revisiting-the-causal-effect-of-democracy-on-long-run-development.md @@ -0,0 +1,157 @@ +--- +title: "EBERHARDT: Revisiting the Causal Effect of Democracy on Long-run Development" +date: 2019-05-03 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Democracy" + - "Development" + - "Economic Growth" + - "Heterogeneous effects" + - "replication" + - "Robustness checks" + - "Sample selection" +draft: false +type: blog +--- + +###### *[This Guest Blog is a repost of a blog by Markus Eberhardt, published at **[Vox – CEPR Policy Portal](https://voxeu.org/article/revisiting-causal-effect-democracy-long-run-development)**]* + +###### *Recent evidence suggests that a country switching to democracy achieves about 20% higher per capita GDP over subsequent decades. This column demonstrates the sensitivity of these findings to sample selection and presents an implementation which generalises the empirical approach. If we assume that the democracy– growth nexus can differ across countries and may be distorted by common shocks or network effects, the average long-run effect of democracy falls to 10%.* + +###### In a recent paper, Acemoglu et al. (2019), henceforth “ANRR”, demonstrated a significant and large causal effect of democracy on long-run growth. By adopting a simple binary indicator for democracy, and accounting for the dynamics of development, these authors found that a shift to democracy leads to a 20% higher level of development in the long run.1 + +###### The findings are remarkable in three ways: + +###### – Previous research often emphasised that a simple binary measure for democracy was perhaps “too blunt a concept” (Persson and Tabellini 2006) to provide robust empirical evidence. + +###### – Positive effects of democracy on growth were typically only a “short-run boost” (Rodrik and Wacziarg 2005). + +###### – The empirical findings are robust across a host of empirical estimators with different assumptions about the data generating process, including one adopting a novel instrumentation strategy (regional waves of democratisation). + +###### ANRR’s findings are important because, as they highlight in a[column on Vox](https://voxeu.org/article/democracy-and-growth-new-evidence), there is “a belief that democracy is bad for economic growth is common in both academic political economy as well as the popular press.” For example, Posner (2010) wrote that “[d]ictatorship will often be optimal for very poor countries”. + +###### The simplicity of ANRR’s empirical setup, the large sample of countries, the long time horizon (1960 to 2010), and the robust positive – and remarkably stable – results across the many empirical methods they employ send a very powerful message against such doubts that democracy does cause growth. + +###### I agree with their conclusion, but with qualifications. My investigation of democracy and growth (Eberhardt 2019) captures two important aspects that were assumed away in ANRR’s analysis: + +###### **–****Different countries may experience different relationships between democracy and growth**. Existing work (including by ANRR) suggests that there may be thresholds related to democratic legacy, or level of development, or level of human capital, or whether the democratisation process was peaceful or violent. All may lead to differential growth trajectories.2 + +###### **–****The world is a network**. It is subject to common shocks that may affect countries differently. The Global Crisis is one example, as are spillovers across countries (Acemoglu et al. 2015, in the case of financial networks). + +###### **Robustness of ANRR’s findings** + +###### One way in which these features could manifest themselves in ANRR’s findings would be if their democracy coefficient differed substantially across different samples. I carried out two sample reduction exercises: + +###### – Since their panel is highly unbalanced, I drop countries by observation count, first those countries which possess merely five observations, then those with six, and so on. + +###### – I adopt a standard strategy from the time series literature, shifting the end year of the sample. I drop 2010, then 2009-2010, and so on. This strategy is also justified because the Global Crisis and its aftermath, the biggest global economic shock since the 1930s, occurs towards the end of ANRR’s sample. Clearly it may affect the data on the democracy-growth nexus.3 + +###### Figures 1a and 1b present the findings from these exercises for four parametric models, using the preferred specification of ANRR. + +###### **Figure 1. Robustness of ANRR’s findings** + +###### (a) Sample reduction by minimum observation count + +###### TRN1(20190503) + +###### (b) Sample reduction by end yearTRN2(20190503) + +###### *Notes*: All estimates are for the specification with four lags of GDP (and four lags of the instrument for 2SLS) preferred by ANRR. The left-most estimates in panel (a) replicate the results in ANRR’s Table 2, column (3) for 2FE, (7) for AB, and (11) for HHK (Hahn et al. 2001), and Table 6, column (2) Panel A for 2SLS (two-stage least squares). The left-most estimates in panel (b) replicate the results in ANRR’s Table 2, column (3) for 2FE, (7) for AB (Arellano and Bond 1991), and (11) for HHK, and Table 6, column (2) Panel A for 2SLS. + +###### Taking, for instance, the IV results in Figure 1a,4 it can be seen that long-run democracy estimates are initially statistically significant (indicated by a filled circle), in excess of 30% in magnitude, and stable – note that the left-most estimate is the full sample one which replicates the result of ANRR. + +###### However, once I exclude any country with fewer than 21 time series observations, the long-run coefficient turns statistically insignificant (indicated by a hollow circle). Further sample reduction results in a substantial drop in the coefficient magnitude. The results for all other estimators, and those in Figure 1b, can be read in the same way, although in Figure 1b moving to the right means moving the sample end year forward in time. + +###### We can conclude that the fixed effects estimates are stable. But those of all other estimators vary substantially, typically dropping off towards (or even beyond) zero as the sample is constrained. Of course empirical results change when the sample changes, but the omitted observations are relatively unsubstantial, relative to the overall sample size. For the IV results: + +###### – dropping either 3% (Figure 1b) or 7% (Figure 1a) of observations leads to an insignificant long-run coefficient;5 + +###### – dropping either 18% (Figure 1a) or 27% (Figure 1b) of observations leads to a long-run coefficient on democracy below 5% in magnitude (the full sample coefficient is 31.5%). + +###### If we purposefully mine the sample for influential observations, and omit Turkmenistan (never a democracy), the Ukraine (democratic in 17 out of 20 sample years), and Uzbekistan (never a democracy), which provide 60 observations or 0.95% of the full ANRR sample, this yields a statistically insignificant long-run democracy coefficient for the IV implementation. + +###### However, we can also substantially boost the IV estimate by adopting the balanced panel employed in a separate exercise by Chen et al. (2019). These authors study the FE and Arellano and Bond (1991) implementations by ANRR, and conclude that correcting for the known biases afflicting these estimators does not substantially change the long-run democracy coefficient. If I estimate ANRR’s IV estimator for the same balanced panel the long-run democracy coefficient is almost 180%, roughly six times that of the full sample result. + +###### **New findings** + +###### These exercises highlight that ANRR’s results are sensitive to sample selection. I argue that spillovers between – or heterogeneous democracy-growth relationships across – countries are at the source of this fragility. This violates the basic assumptions of the set of methods used by ANRR, and so it calls for different empirical estimators. + +###### I therefore employ recently developed estimators from the impact evaluation literature (Chan and Kwok 2018) that study the effect of democratisation in the sub-sample of countries for which the democracy indicator changed during the sample period. Chan and Kwok’s approach accounts for endogenous selection into democracy, as well as uncommon and stochastic trends, by including cross-section averages of the subsample of countries that *never* experienced democracy in the estimation equation. + +###### Since ANRR’s results are all based on dynamic specifications (models including lags of the dependent variable) I adjust the methodology to incorporate this feature, and present long-run democracy estimates, as ANRR did. Subjecting this methodology to the same sample reduction exercises as above gives the results in Figure 2. + +###### **Figure 2. Employing heterogeneous parameter estimators** + +###### (a) Sample reduction by minimum observation countTRN3(20190503) + +###### (b) Sample reduction by end yearTRN4(20190503) + +###### Comparing the results of the preferred specification from Chan and Kwok (incorporating covariates – the teal-coloured line and circles in the figures) and of ANRR’s IV estimation my findings suggest that the full sample average long-run democracy effect across countries is more modest than that found in ANRR, at around 12% compared to 31.5%. Although Chan and Kwok’s estimates vary when the sample is reduced, the democracy coefficient remains statistically significant, the magnitude substantial, and, at least for the first exercise in Figure 2a, remarkably stable.6 + +###### **The democratic dividend** + +###### The implicit conclusion from pooled empirical analysis as presented in ANRR is that the 20% democratic dividend applies to any country. This interpretation was perhaps not even intended by the authors but, as my empirical exercises demonstrate, their empirical implementations are compromised if the growth effect of democracy differs across countries. + +###### Once we shift to a heterogeneous model, the long-run democracy effect averaged across countries is more modest. The most important question for future research is what drives the differential magnitude of this effect across countries. My initial investigations suggest that democratic legacy is not a prerequisite for a positive democracy effect, but the relationship between the democratic dividend and initial levels of literacy (as a proxy for human capital) appears to follow a U-shape. + +###### *Markus Eberhart is an Associate Professor in the School of Economics, University of Nottingham, and a Research Affiliate at the Center for Economic and Policy Research (CEPR).* + +###### **References** + +###### Acemoglu, D, S Naidu, P Restrepo, and J A Robinson (2019), “Democracy Does Cause Growth”,*Journal of Political Economy*127(1): 47-100. + +###### Acemoglu, D, A Ozdaglar, and A Tahbaz-Salehi (2015), “Systemic risk and stability in financial networks”, *American Economic Review*105(2): 564–608. + +###### Aghion, P, A Alesina, and F Trebbi (2008), “Democracy, Technology, and Growth”, in E Helpman (ed,) *Institutions and Economic Performance*, Harvard University Press. + +###### Arellano, M, and S R Bond (1991), “Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations”,*Review of Economic Studies* 58(2): 277-297. + +###### Cervellati, M, and U Sunde (2014), “Civil Conflict, Democratization, and Growth: Violent Democratization as Critical Juncture”, *Scandinavian Journal of Economics* 116(2): 482-505. + +###### Chan, M K, and S Kwok (2018), “***[Difference-in-Difference when Trends are Uncommon and Stochastic”](http://dx.doi.org/10.2139/ssrn.3125890)***, available at SSRN. + +###### Chen, S, V Chernozhukov and I Fernandez-Val (2019), “***[Causal Impact of Democracy on Growth: An Econometrician’s Perspective](https://www.aeaweb.org/conference/2019/preliminary/paper/H3zAn6KA)***“, paper presented at the 2019 ASSA meetings in Atlanta, GA. + +###### Eberhardt, M (2019), “***[Democracy Does Cause Growth: Comment](https://cepr.org/active/publications/discussion_papers/dp.php?dpno=13659)***“, CEPR Discussion Paper 13659. + +###### Eberhardt, M, and F Teal (2011), “Econometrics for grumblers: a new look at the literature on cross-country growth empirics”, *Journal of Economic Surveys* 25(1): 109-155 + +###### Gerring, J, P Bond, W T Barndt, and C Moreno (2005), “Democracy and economic growth: A historical perspective”, *World Politics* 57(3): 323-364. + +###### Hahn, J, J A Hausman, and G Kuersteiner (2001), “Bias Corrected Instrumental Variables Estimation for Dynamic Panel Models with Fixed Effects”, MIT Department of Economics Working Paper 01-24. + +###### Madsen, J B, P A Raschky, and A Skali (2015), “Does democracy drive income in the world, 1500- 2000?”, *European Economic Review* 78: 175-195. + +###### Papaioannou, E, and G Siourounis (2008), “Democratisation and growth’, *Economic Journal*118(532): 1520-1551. + +###### Persson, T, and G Tabellini (2006), “Democracy and development: The devil in the details”, *American Economic Review, Papers & Proceedings* 96(2): 319-324. + +###### Posner, R (2010), “***[Autocracy, Democracy, and Economic Welfare](https://www.becker-posner-blog.com/2010/10/autocracy-democracy-and-economic-welfareposner.html)***“, The Becker-Posner blog, 10 October. + +###### Rodrik, D, and R Wacziarg (2005), “Do democratic transitions produce bad economic outcomes?’, *American Economic Review, Papers & Proceedings* 95(2): 50-55. + +###### Endnotes + +###### [1] I follow the practice of ANRR in using ‘growth’ as a short-hand for economic development (the level of per capita GDP). The term ‘cross-country growth regression’ is a misnomer, given that the standard specification represents a dynamic model of the levels of per capita GDP; see Eberhardt and Teal (2011) for a more detailed discussion of growth empirics. + +###### [2] Gerring, et al (2005) for democracy legacy, Aghion, et al (2008) and Madsen, et al (2015) for thresholds, finally Cervellati and Sunde (2014) for concerns related to democratisation scenarios. + +###### [3] Note that shifting the start year of the sample does not result in any substantial changes in ANRR’s result: it is the experience of the post-2000 period which drives their results. + +###### [4] The AB and HHK results are arguably less robust than the IV results to sample restrictions. + +###### [5] For comparison, in related work on democratisation and growth Papaioannou and Siourounis (2008) show the robustness of their findings using a cut-off equivalent to 12% of their full sample. + +###### [6] One would expect that the temporal sample reduction as presented in Figure 4 would slowly chip away at the magnitude of the coefficient. + +###### + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/05/03/eberhardt-revisiting-the-causal-effect-of-democracy-on-long-run-development/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/05/03/eberhardt-revisiting-the-causal-effect-of-democracy-on-long-run-development/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/economics-e-journal-is-looking-for-a-few-good-replicators.md b/content/replication-hub/blog/economics-e-journal-is-looking-for-a-few-good-replicators.md new file mode 100644 index 00000000000..d42ed3a365b --- /dev/null +++ b/content/replication-hub/blog/economics-e-journal-is-looking-for-a-few-good-replicators.md @@ -0,0 +1,39 @@ +--- +title: "Economics E-Journal is Looking for a Few Good Replicators" +date: 2017-04-25 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Economics E-Journal" + - "replication" + - "special issue" +draft: false +type: blog +--- + +###### The journal *Economics: The Open Access, Open Assessment E-Journal* is publishing a special issue on “The Practice of Replication.” This is how the journal describes it: + +###### “The last several years have seen increased interest in replications in economics.  This was highlighted by the most recent meetings of the American Economic Association, which included three sessions on replications (see ***[here](https://www.aeaweb.org/conference/2017/preliminary/1530?page=5&per-page=50)***, ***[here](https://www.aeaweb.org/conference/2017/preliminary/2100?sessionType%5Bsession%5D=1&organization_name=&search_terms=replication&day=&time=))***, and ***[here](https://www.aeaweb.org/conference/2017/preliminary/1542?sessionType%5Bsession%5D=1&organization_name=&search_terms=miguel&searchLimits%5Bauthor_last%5D=1&day=&time=)***). Interestingly, there is still no generally acceptable procedure for how to do a replication.  This is related to the fact that there is no standard for determining whether a replication study “confirms” or “disconfirms” an original study. This special issue is designed to highlight alternative approaches to doing replications, while also identifying core principles to follow when carrying out a replication.” + +###### “Contributors to the special issue will each select an influential economics article that has not previously been replicated, with each contributor selecting a unique article.  Each paper will discuss how they would go about “replicating” their chosen article, and what criteria they would use to determine if the replication study “confirmed” or “disconfirmed” the original study.” + +###### “Note that papers submitted to this special issue will not actually do a replication.  They will select a study that they think would be a good candidate for replication; and then they would discuss, in some detail, how they would carry out the replication.  In other words, they would lay out a replication plan.” + +###### “Submitted papers will consist of four parts: (i) a general discussion of principles about how one should do a replication, (ii) an explanation of why the “candidate” paper was selected for replication, (iii) a replication plan that applies these principles to the “candidate” article, and (iv) a discussion of how to interpret the results of the replication (e.g., how does one know when the replication study “replicates” the original study).” + +###### “The contributions to the special issue are intended to be short papers, approximately Economics Letters-length (though there would not be a length limit placed on the papers).” + +###### “The goal is to get a fairly large number of short papers providing different approaches on how to replicate.  These would be published by the journal at the same time, so as to maintain independence across papers and approaches.  Once the final set of articles are published, a summary document will be produced, the intent of which is to provide something of a set of guidelines for future replication studies.” + +###### Despite all the attention that economics, and other disciplines, have devoted to research transparency, data sharing, open science, reproducibility, and the like, much remains to be done on best practice guidelines for doing replications.  Further, there is much confusion about how one should interpret the results from replications.  Perhaps this is not surprising.  There is still much controversy about how to interpret tests of hypotheses!  At the very least, it is helpful to have a better understanding of the current state of replication practice, and how replicators understand their own research.  It is hoped that this special issue will help to progress our understanding on these subjects. + +###### To read more about the special issue, and how to contribute, ***[click here](http://www.economics-ejournal.org/special-areas/special-issues/the-practice-of-replication)***. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/04/25/economics-e-journal-is-looking-for-a-few-good-replicators/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/04/25/economics-e-journal-is-looking-for-a-few-good-replicators/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/edward-leamer-on-econometrics-in-the-basement-the-bayesian-approach-and-the-leamer-rosenthal-prize.md b/content/replication-hub/blog/edward-leamer-on-econometrics-in-the-basement-the-bayesian-approach-and-the-leamer-rosenthal-prize.md new file mode 100644 index 00000000000..3552df28966 --- /dev/null +++ b/content/replication-hub/blog/edward-leamer-on-econometrics-in-the-basement-the-bayesian-approach-and-the-leamer-rosenthal-prize.md @@ -0,0 +1,46 @@ +--- +title: "EDWARD LEAMER: On Econometrics in the Basement, the Bayesian Approach, and the Leamer-Rosenthal Prize" +date: 2015-09-04 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Bayesian approach" + - "BITSS" + - "Edward Leamer" + - "Leamer-Rosenthal Prize" +draft: false +type: blog +--- + +###### (THIS BLOG IS REPOSTED FROM [THE BITSS WEBSITE](http://www.bitss.org/2015/08/25/who-inspired-the-leamer-rosenthal-prizes-part-ii-ed-leamer/)) I became interested in methodological issues as a University of Michigan graduate student from 1967 to 1970, watching the economics faculty build an econometric macro model in the basement of the building (The Michigan Model), and comparing how these same faculty members described what they were doing when they taught econometric theory on the top floor of the building.  Though the faculty in the basement and on the top floor to outward appearances were the very same people, ascending or descending the stairs seemed to alter their inner intellectual selves completely. + +###### The words “specification search” in my 1978 book [*Specification Searches*](http://www.anderson.ucla.edu/faculty/edward.leamer/books/specification_searches/SpecificationSearches.pdf) refers to the search for a model to summarize the data in the basement where the dirty work is done, while the theory of pristine inference taught on the top floor presumes the existence of the model before the data are observed. This assumption of a known model may work in an experimental setting in which there are both experimental controls and randomized treatments, but for the non-experimental data that economists routinely study, much of the effort is an exploratory search for a model, not estimation with a known and given model. The very wide model search that usually occurs renders econometric theory suspect at best, and possibly irrelevant.  Things like unbiased estimators, standard errors and t-statistics lose their meaning well before you get to your 100th trial model. + +###### Looking at what was going on, it seemed to me essential to make theory and practice more compatible, by changing both practice and theory.   An essential but fortuitous accident in my intellectual life had me taking courses in Bayesian statistics in the Math Department. The Bayesian philosophy seemed to offer a logic that would explain the specification searches that were occurring in the basement and that were invalidating the econometric theory taught in the top floor, and also a way of bringing the two floors closer together. + +###### The fundamental message of the Bayesian approach is that, when the data are weak, the context matters, or more accurately the analyst’s views about the context matter.  The same data set can allow some to conclude legitimately that executions deter murder and also allow others to conclude that there is no deterrent effect, because they see the context differently.  While it’s not the only kind of specification search, per my book, an “interpretative search” combines the data information with the analyst’s ambiguous and ill-defined understanding of the context.  The Bayesian philosophy offers a perfect hypothetical solution to the problem of pooling the data information with the prior contextual information – one summarizes the contextual information in the form of a previous hypothetical data set. + +###### A HUGE hypothetical benefit of a Bayesian approach is real transparency both to oneself and to the audience of readers.  Some people think that transparency can be achieved by requiring researchers to record and to reveal all the model exploration steps they take, but if we don’t have any way to adjust or to discount conclusions from these specification searches, this is transparency without accountability, without consequence.   What is really appealing about the Bayesian approach is that the prior information of the analyst is explicitly introduced into the data analysis and “straightforwardly” revealed both to the analyst and to her audience.   This is transparency with consequence.  We can see why some think executions deter murders and others see no deterrent effect. + +###### The frustratingly naïve view that often meets this proposal is that “science doesn’t make up data.”   When I hear that comment, I just walk away.  It isn’t worth the energy to try to discuss how inferences from observational data are actually made, and for that matter how experiments are interpreted as well.   We all make up the equivalent of previous data sets, in the sense of allowing the context to matter in interpreting the evidence.   It’s a matter of how, not if.  Actually, I like to suggest that the two worst people to study data sets are a statistician who doesn’t understand the context, and a practitioner who doesn’t understand the statistical subtleties. + +###### However, we remain far from a practical solution, Bayesian or otherwise, and current practice is more or less the same as it was when punch cards were fed into computers back in the 1960s.  The difference is that with each advance in technology from counting on fingers to Monroe calculators to paper tapes to punch cards to mainframes to personal computers to personal digital assistants, we have made it less and less costly to compute new estimates from the same data set, and the supply of alternative estimated models has greatly increased, though almost all of these are hidden on personal hard drives or in [Rosenthal’s File Drawers](http://www.bitss.org/2015/08/05/who-inspired-the-leamer-rosenthal-prizes-part-i-robert-rosenthal/). + +###### The classical econometrics that is still taught to almost all economists has no hope of remedying this unfortunate situation, since the assumed knowledge inputs do not come close to approximating the contextual information that is available. But the Bayesian priests who presume the existence of a prior distribution that describes the context are not so different from the econometric theorists who presume the existence of a model.  Both are making assumptions about how the dirty work of data analysis in the basement is done or should be done, but few of either religious persuasion leave their offices and classrooms on the top floor and descend into the basement to analyze data sets.  Because of the impossibility of committing to any particular prior distribution, the Bayesian logic turns the search for a model into a search for a prior distribution. My solution to the prior-ambiguity problem has been to design tools for sensitivity analysis to reveal how much the conclusions change as the prior is altered, some local perturbations (point to point mapping) and some global ones (correspondences between sets of priors and sets of inferences). + +###### As I read what I have just written, I think this is hugely important and highly interesting.  But I am reminded of the philosophical question:  When Leamer speaks and no one listens, did he say anything?   None of the tenured faculty in Economics at Harvard took any interest in this enterprise, and they gave me the Donald Trump treatment: You’re fired.   My move to UCLA was to some extent a statement of approval for my book, *Specification Searches*, but my pursuit of useful sensitivity methods remained a lonely one.  The sincerest form of admiration is copying, but no one pursued my interest in these sensitivity results. I did gain notoriety if not admiration with the publication of a watered down version of my ideas in “[Let’s take the con out of econometrics](http://www.jstor.org/stable/1803924?seq=1#page_scan_tab_contents).” But not so long after that, finding that I was not much affecting the economists around me, and making less progress producing sensitivity results that I found amusing, I moved onto the study of International Economics, and later I took the professionally disreputable step of forecasting the US macro economy on a quarterly basis, back to my Michigan days.   I memorialized that effort with the book titled [*Macroeconomic Patterns and Stories*](http://www.amazon.com/Macroeconomic-Patterns-Stories-Edward-Leamer/dp/3540463887), which is an elliptical comment that we don’t do science, we do persuasion with patterns and stories.  And more recently, I have tried again to reach my friends by offering context-minimal measures of model ambiguity which I have called [s-values (s for sturdy)](http://www.anderson.ucla.edu/faculty/edward.leamer/documents/Leamer_on_Conventional_Measures_of_Model_Ambiguity.pdf) to go along with t-values and p-values.    This one-more attempt illustrates what is the fundamental problem – we don’t have the right tools. + +###### It is my hope that the [Leamer-Rosenthal prize](http://www.bitss.org/prizes/leamer-rosenthal-prizes/) will bring some added focus on these deep and persistent problems with our profession, stimulating innovations that can produce real transparency by which I mean ways of studying data and reporting the results that allow both the analyst and the audience to understand the meaning of the data being studied, and how that depends on the contextual assumptions. + +###### This whole thing reminds me of the parable of the Emperor’s New Clothes.  Weavers (of econometric theory) offer the Emperor a new suit of clothes, which are said to be invisible to incompetent economists and visible only to competent ones.  No economist dares to comment until a simple-minded one hollers out “He isn’t wearing any clothes at all.”   The sad consequence is that everyone thinks the speaker both impolite and incompetent, and the Emperor continues to parade proudly in that new suit, which draws repeated compliments from the weavers:  Elegant, very elegant. + +###### OK, it’s delusional.  I know. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/09/04/edward-leamer-on-econometrics-in-the-basement-the-bayesian-approach-and-the-leamer-rosenthal-prize/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/09/04/edward-leamer-on-econometrics-in-the-basement-the-bayesian-approach-and-the-leamer-rosenthal-prize/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/etienne-lebel-introducing-curatescience-org.md b/content/replication-hub/blog/etienne-lebel-introducing-curatescience-org.md new file mode 100644 index 00000000000..793ef4b414e --- /dev/null +++ b/content/replication-hub/blog/etienne-lebel-introducing-curatescience-org.md @@ -0,0 +1,43 @@ +--- +title: "ETIENNE LEBEL: Introducing “CurateScience.Org”" +date: 2015-10-21 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "CurateScience.org" + - "Etienne LeBel" + - "Forest plot" + - "Meta-analysis" + - "Psychology" +draft: false +type: blog +--- + +###### It is my pleasure to introduce Curate Science () to The Replication Network. Curate Science is a web application that aims to facilitate and incentivize the curation and verification of empirical results in the social sciences (initial focus in Psychology). Science is the most successful approach to generating cumulative knowledge about how our world works. This success stems from a key activity, **independent verification,** which maximizes the likelihood of detecting errors, hence maximizing the reliability and validity of empirical results. The current academic incentive structure, however, does not reward verification and so verification rarely occurs and when it does, is highly difficult and inefficient. Curate Science aims to help change this by facilitating the *verification* of empirical results (pre- and post-publication) in terms of (1) replicability of findings in independent samples and (2) reproducibility of results from the underlying raw data. + +###### The platform facilitates **replicability** by enabling users to link replications directly to their original studies, with corresponding real-time updating of meta-analytic effect size estimates and forest plots of replications (see Figure below).[[1]](#_ftn1) The platform aims to incentivize verification in terms of replicability by easily allowing users to invite others to replicate one’s work and also by providing a professional credit system that explicitly acknowledges replicators’ hard work commensurate to the “expensiveness” of the executed replication. + +###### [row-based-meta](https://replicationnetwork.com/wp-content/uploads/2015/10/row-based-meta.webp) + +###### The platform facilitates **reproducibility** by enabling researchers to check and endorse the analytic reproducibility of each other’s empirical results via data analyses executed within their web browser for studies with open data. The platform will visually acknowledge the endorser via a professional credit system to incentivize researchers to verify the reproducibility of each other’s results, when direct replications are not feasible or too expensive to execute. + +###### The platform allows curation of study information, which is required for independent verification in terms of replicability and reproducibility. However, the platform also features additional curation activities including “revised community abstracts” (crowd-sourced abstracts summarizing how follow-up research has qualified original findings, e.g., boundary conditions) and curation of organic and external post-publication peer-review commentaries. + +###### **Our vision** + +###### Curate Science’s vision for the future of academic science is one where verification is routinely and easily done in the cloud, and in which appropriate professional credit is given to researchers who engage in such verification activities (i.e., verifying replicability and reproducibility of empirical results, and post-publication peer review). We foresee a future where one can easily look up important articles in one’s field to see the current status of findings via revised community abstracts (a la Wikipedia). This will maximize the impact and value of research in terms of re-use by other researchers (e.g., help unearth new insights from different theoretical perspectives), and hence accelerate theoretical progress and innovation for the benefit of society. + +###### **Current activities** + +###### Our current activities include the curation of articles and replications in psychology, which includes identifying professors who will get PhD students to curate and link replications for seminal studies covered in their seminar classes. We’re also busy in terms of advocacy and canvassing: I’m currently on a 3-month USA-Europe tour presenting Curate Science and getting concrete feedback from over 10 university psychology departments. Finally, and crucially, we’re particularly busy with software development and refinement of the website’s user-interface to improve the usability and user experience of the website (e.g., fixing bugs, implementing refinements, and improvements). To check out the early beta version of our website, please go here: + +[[1]](#_ftnref1) In the future, users will also be able to create their own meta-analyses in the cloud for generalizability studies (a.k.a “conceptual replications”), which other users will easily be able to add to/update via crowd-sourcing. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/10/21/lebel-introducing-curatescience-org/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/10/21/lebel-introducing-curatescience-org/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/faff-international-society-of-pitching-research-for-responsible-science.md b/content/replication-hub/blog/faff-international-society-of-pitching-research-for-responsible-science.md new file mode 100644 index 00000000000..efc10ed012d --- /dev/null +++ b/content/replication-hub/blog/faff-international-society-of-pitching-research-for-responsible-science.md @@ -0,0 +1,104 @@ +--- +title: "FAFF: International Society of Pitching Research for Responsible Science" +date: 2021-05-11 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "InSPiR^2eS" + - "Pitching Research" + - "Research network" + - "Responsible science" +draft: false +type: blog +--- + +**What is the International Society of Pitching Research for Responsible Science (InSPiR2eS) research network?** + +InSPiR2eS is a globally-facing research network primarily aimed at research training and capacity building, resting on a foundation theme of responsible science. + +As measures of its success, the network succeeds if (beyond what we would have otherwise achieved) it inspires: + +– responsible research – i.e., research that produces new knowledge that is credible, useful & independent. + +– productive research collaboration & partnerships – locally, regionally & globally. + +– a collective sense of purpose and achievement towards the whole research process. + +Importantly, the network aims to inclusively embrace like-minded university researchers centred on the multi-faceted utility provided by the “[***Pitching Research***](http://ssrn.com/abstract=2462059)” framework, as a natural enabler of responsible science. + +The network alliance is a fully “opt in” organisational structure. Through the very act of joining the network, each member will abide by an appropriate Code of Conduct (under development), including: privacy, confidentiality, communication and notional IP relating to research ideas. + +**Why create InSPiR2eS?** + +The underlying premise for creating InSPiR2eS is to facilitate an efficient co-ordinated sharing of relevant research information and resources for the mutual benefit of all participants – whether this occurs through the inputs, processes or outputs linked to our research endeavours. More generally, while the enabling focus is on the Pitching Research framework, the network can offer its members a global outreach for their research efforts – in new and novel ways. + +For example, actively engaging the network could spawn new research teams and projects, or other international initiatives and alliances. While such positive outcomes could “just happen” anyway, absent creating a new network – which is hardly novel, the network could e.g., experiment with a “shark tank” type webinar event in which some members pitch for new project collaborators. Exploiting the power of a strong alliance, the network can deliver highly leveraged outcomes compared to what is possible when we act alone – as isolated “sole traders”. + +**Who are the members of InSPiR2eS?** + +Professor Robert Faff (Bond University), as the network initiator, is the network convenor and President of InSPiR2eS. Currently, the network has more than 500 founding Ambassadors, Members and Associate Members already signed up representing 73 countries/ jurisdictions: Australia, Pakistan, China, Canada, New Zealand, Vietnam, Brazil, Nigeria, Germany, Indonesia, the Netherlands, England, Kenya, Romania, Poland, Mauritius, Sri Lanka, Bangladesh, Italy, Spain, India, Scotland, Singapore, Japan; Norway; Ireland; the US; Malaysia; Chile; Turkey; Wales; Serbia; Belgium; Thailand; France; South Africa; Switzerland; Croatia; Czech Republic; Hong Kong; Taiwan; Macau; South Korea; Greece; Ukraine; Ghana; Slovenia; Austria; Cyprus; Uganda; Namibia; Portugal; Tanzania; Fiji; Saudi Arabia; Estonia; Iceland; Egypt; Mongolia; Lithuania; Slovakia; Finland; Sweden; Ecuador; Israel; Hungary; UAE; North Cyprus; Mozambique; Philippines; Nepal; Argentina; Malta. + +**How will InSPiR2eS operate?** + +***Phase 1:** Network setup and initial information exchange.* + +To begin with, we will rely (mostly) on email communication. We will establish an e-newsletter – to provide engaging and organised information exchange. Dr Searat Ali (University of Wollongong) has agreed to be the inaugural Editor of the InSPiR2eS e-Newsletter (in his role as VP – Communications). + +***Phase 2:** Establishing interactive network engagement.* + +Live webinar Zoom sessions will be offered on topics linked to the network Mission. These sessions would be recorded and freely accessible from an InSPiR2eS “Resource Library”. Initially, these sessions will be presented by the network leader, but over time others in the network would be welcome to offer sessions – especially, if the topics are of a general nature aiming for research training/capacity building (rather than a research seminar on their latest paper). These webinars would be open to all, irrespective of whether they are network members or not – including network members, as well as to their students, their research collaborators and any other invited associated researcher in their networks. + +The inaugural network webinar will broadly address the core theme of “responsible science”, and this material will serve as a beacon against which all network activities will be offered. Subsequent webinar topics might include the following modules: + +– A Basic Primer on Pitching Research. + +– Using Pitching Research as a Reverse Engineering Tool. + +– Advanced Guidelines on applying the Pitching Research Framework. + +– Pitching Research for Engagement & Impact. + +– Pitching Research as a Tool for Responsible Science. + +– Pitching Research as a Tool for Replications. + +– Pitching Research as a Tool for Pre-registration. + +– Pitching Research for Diagnostic use in Writing. + +– Roundtable Panels – e.g., discussing issues related to “responsible science”, etc. + +***Phase 3:** Longer-term, post-COVID network initiatives*. + +Downstream network initiatives will include the creation of a “one-stop shop” network website. And, once COVID is behind us, we will explore some in-person events like: + +– a conference or symposium. + +– “shark tank” event(s), either themed on “pitching research for finding collaborators” or “pitching research to journal editors”. + +– initiatives/ special projects/ network events suggested and/or co-ordinated by network members. + +**When will InSPiR2eS content activity begin?** + +Release of the inaugural edition of the e-Newsletter will be a signature activity  and we are aiming for this to be ready later in May, 2021. Zoom webinars, will also start soon – we are aiming for a network opening event in June 2021. Please keep an eye for publicity on this event soon. + +**How do I join the InSPiR2eS research network?** + +If you are interested in joining the InSPiR2eS research network and engaging in its upcoming rich program of webinars, workshops and research resources, then register at the following Google Docs link ([***click here***](https://tinyurl.com/4u6wen43)). In locations where Google is problematic, *[**click here**](https://www.wjx.top/vm/t19EkAc.aspx).* + +*Robert Faff is Professor of Finance at Bond University. He is Network Convenor & President of InSPiR2eS. Professor Faff can be contacted at rfaff@bond.edu.au.* + +—————————————————————- + +[[1]](#_ftnref1) In part, the idea of the network itself is inspired by the community for [***Responsible Research in Business and Management***](https://www.rrbm.network/) that released a position paper in 2017, in which they outline a vision for the year 2030 “… of a future in which business schools and scholars worldwide have successfully transformed their research toward responsible science, producing useful and credible knowledge that addresses problems important to business and society.” + +--- + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2021/05/11/faff-international-society-of-pitching-research-for-responsible-science/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2021/05/11/faff-international-society-of-pitching-research-for-responsible-science/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/fecher-wagner-fr-ssdorf-social-scientists-and-replications-tell-me-what-you-really-think.md b/content/replication-hub/blog/fecher-wagner-fr-ssdorf-social-scientists-and-replications-tell-me-what-you-really-think.md new file mode 100644 index 00000000000..45bf06f0f84 --- /dev/null +++ b/content/replication-hub/blog/fecher-wagner-fr-ssdorf-social-scientists-and-replications-tell-me-what-you-really-think.md @@ -0,0 +1,68 @@ +--- +title: "FECHER, WAGNER & FRÄSSDORF: Social Scientists and Replications: Tell Me What You Really Think!" +date: 2017-02-04 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "German Socio-Economic Panel Study" + - "replication" + - "Social scientists" + - "Survey" +draft: false +type: blog +--- + +###### *NOTE: This entry is based on the article, “**[Perceptions and Practices of Replication by Social and Behavioral Scientists: Making Replications a Mandatory Element of Curricula Would Be Useful](http://legacy.iza.org/en/webcontent/publications/papers/viewAbstract?dp_id=9896)**”* + +###### In times of increasing publication rates and specialization of disciplines, it is particularly important for academia to reflect upon measures to safeguard the integrity of research, beyond the classical peer review. Empirical economics especially faces this challenge due to its responsibility towards society, but also because an increasing number of studies have called the reproducibility of findings into question (*1*–*4*). A prominent example is Reinhart and Rogoff’s study “Growth in a Time of Debt” on the effectiveness of austerity-based fiscal policies for highly indebted economies (*5*). The results of the study clearly translated into politics although it was based on fundamental miscalculations, as demonstrated by a replication study by Herndon et al. (*6*). + +###### Replication studies are important because they contribute to the self-correction abilities of the self-referential scientific ecosystem. Moreover, “low cost” replication studies that use the primary investigator’s original dataset, seem increasingly feasible considering pressure by funding agencies and science policy makers to make research data available (*7*, *8*). Nonetheless, replication studies are rarely conducted (*9*). + +###### To better understand researchers views towards replication, we surveyed the perceptions and replication practices of 300 social and behavioral scientists who use data from the German Socio-Economic Panel Study (SOEP), a widely analyzed multi-cohort study of the German population (*10*). + +###### 84 per cent of the surveyed researchers agree that replications are necessary for improving scientific output and 71 per cent disagree with the statement that replications are not worthwhile because major mistakes will be found at some point anyway. + +###### 58 per cent of our respondents never attempted a replication despite the fact that SOEP data is easily obtained, well-documented and frequently analyzed. Of those respondents who had conducted a replication study in the past, more than half of them were conducted during regular coursework – either while teaching a class (13% of all respondents) or while being taught as a student (9%). 20% of the respondents used a replication of a SOEP article for their own research. Of those who never conducted a replication study, 76% never saw a need to do so, while the rest thought it would be too time consuming (15%) or did not have enough information (9%)— either about the data, the software or the way results in the original article were produced, i.e., the scripts—were not available. + +###### As for those who did replicate a SOEP article, 84% were able to reproduce the results of the original article (although the results were not always exactly identical to those found by the original authors), while only 16% were not able to do so. When asked about the reason why the results could not be completely replicated, 69% of the respondents stated that the information about details of the analysis in the original article was insufficient. + +###### The situation regarding replications can be regarded as a “tragedy of the commons”: everybody knows that they are useful, but almost everybody counts on others to conduct them. A possible explanation for this is that conducting replication studies is not worthwhile in the context of the academic reward system since they are time-consuming and rarely published (*9*). Previous research showed that impact considerations are already steering replication efforts (*11*, *12*). For instance, researchers target high impact studies. Nevertheless, the number of replication studies is still considerably low. Against this background, we argue that more replications would be conducted if they received more formal recognition (e.g., journals could adapt their policies and publish more replication studies (*13*)). Our results furthermore show that most of the replication studies are conducted in the context of teaching. In our view, this is a promising detail: in order to increase the number of replication studies, it seems useful to make replications a mandatory part of curricula and an optional chapter of (cumulative) doctoral theses. + +###### *Benedikt Fecher is a doctoral student at the German Institute of Economic Research and the Alexander von Humboldt Institute for Internet and Society. Mathis Fräßdorf is Head of the Department for Scientific Information at Wissenshaftszentrum Berlin für Sozialforschung. Gert Wagner is Professor of Economics at the Berlin University of Technology. Correspondence about this blog should be directed to Benedikt Fecher at* [*fecher@hiig.de*](mailto:fecher@hiig.de)*.* + +###### **References** + +###### (1) R. G. Anderson, A. Kichkha, Replication versus Meta-Analysis in Economics: Where Do We Stand 30 Years After Dewald, Thursby and Anderson? (2017), (available at ). + +###### (2) C. F. Camerer *et al.*, Evaluating replicability of laboratory experiments in economics. *Science* (2016), doi:10.1126/science.aaf0918. + +###### (3) W. G. Dewald, J. G. Thursby, R. G. Anderson, Replication in Empirical Economics: The Journal of Money, Credit and Banking Project. *The American Economic Review*. **76**, 587–603 (1986). + +###### (4) M. Duvendack, R. Jones, R. Reed, What is Meant by “Replication” and Why Does It Encounter Such Resistance in Economics? (2017), (available at ). + +###### (5) C. Reinhart, K. Rogoff, “Growth in a Time of Debt” (w15639, National Bureau of Economic Research, Cambridge, MA, 2010), (available at ). + +###### (6) T. Herndon, M. Ash, R. Pollin, Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. *Cambridge Journal of Economics*. **38**, 257–279 (2013). + +###### (7) M. McNutt, Reproducibility. *Science*. **343**, 229–229 (2014). + +###### (8) B. Fecher, G. G. Wagner, A research symbiont. *Science*. **351**, 1405–1406 (2016). + +###### (9) C. L. Park, What is the value of replicating other studies? *Research Evaluation*. **13**, 189–195 (2004). + +###### (10) DIW Berlin, Übersicht über das SOEP (2015), (available at ). + +###### (11) D. Hamermesh, What is Replication? The Possibly Exemplary Example of Labor Economics (2017), (available at ). + +###### (12) S. Sukhtankar, Replications in Development (2017), (available at ). + +###### (13) J. H. Hoeffler, Replication and Economics Journal Policies (2017), (available at ). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/02/04/fecher-wagner-frassdorf-social-scientists-and-replications-tell-me-what-you-really-think/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/02/04/fecher-wagner-frassdorf-social-scientists-and-replications-tell-me-what-you-really-think/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/ferraro-shukla-is-a-replicability-crisis-on-the-horizon-for-environmental-and-resource-economics.md b/content/replication-hub/blog/ferraro-shukla-is-a-replicability-crisis-on-the-horizon-for-environmental-and-resource-economics.md new file mode 100644 index 00000000000..431a89f29fc --- /dev/null +++ b/content/replication-hub/blog/ferraro-shukla-is-a-replicability-crisis-on-the-horizon-for-environmental-and-resource-economics.md @@ -0,0 +1,60 @@ +--- +title: "FERRARO & SHUKLA: Is a Replicability Crisis on the Horizon for Environmental and Resource Economics?" +date: 2020-09-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" +draft: false +type: blog +--- + +Across scientific disciplines, researchers are increasingly questioning the credibility of empirical research. This research, they argue, is rife with unobserved decisions that aim to produce publishable results rather than accurate results. In fields where the results of empirical research are used to design policies and programs, these critiques are particularly concerning as they undermine the credibility of the science on which policies and programs are designed. In a ***[paper](https://academic.oup.com/reep/article/14/2/339/5894765)*** published in the *Review of Environmental Economics and Policy*, we assess the prevalence of empirical research practices that could lead to a credibility crisis in the field of environmental and resource economics. + +We looked at empirical environmental economics papers published between 2015 and 2018 in four top journals: The American Economic Review (AER), Environmental and Resource Economics (ERE), The Journal of the Association of Environmental and Resource Economics (JAERE), and The Journal of Environmental Economics and Management (JEEM). From 307 publications, we collected more than 21,000 test statistics to construct our dataset. We reported four key findings: + +**1. Underpowered Study Designs and Exaggerated Effect Sizes** + +As has been observed in other fields, the empirical designs used by environmental and resource economists are statistically underpowered, which implies that the magnitude and sign of the effects reported in their publications are unreliable. The conventional target for adequate statistical power in many fields of science is 80%. We estimated that, in environmental and resource economics, the median power of study designs is 33%, with power less than 80% for nearly two out of the three estimated parameters. When studies are underpowered and when scientific journals are more likely to publish results that pass conventional tests of statistical significance – tests that can only be passed in underpowered designs when the estimated effect is much larger than the true effect size – these journals will tend to be publish exaggerated effect sizes. ***We estimated that 56% of the reported effect sizes in the environmental and resource economics literature are exaggerated by a factor of two or more; 35% are exaggerated by a factor of four or more.*** + +**2. Selective Reporting of Statistical Significance or “p-hacking”** + +Researchers face strong professional incentives to report statistically significant results, which may lead them to selectively report results from their analyses. One indicator of selective reporting is an unusual pattern in the distribution of test statistics; specifically, a double-humped distribution around conventionally accepted values of statistical significance. In the figure below, we present the distribution of test statistics for the estimates in our sample, where 1.96 is the conventional value for statistical significance (p<0.05). ***The unusual dip just before 1.96, is consistent with selective reporting of results*** that are above the conventionally accepted level of statistical significance. + +![](/replication-network-blog/capture-2.webp) + +**3. Multiple Comparisons and False Discoveries** + +Repeatedly testing the same data set in multiple ways increases the probability of making false (spurious) discoveries, a statistical issue that is often called the “multiple comparisons problem.” To mitigate the probability of false discoveries when testing more than one related hypothesis, researchers can adopt a range of approaches. For example, they can ensure the false discovery rate is no larger than a pre-specified level. These approaches, however, are rare in the environmental and resource economics literature: ***63% of the studies in our sample conducted multiple hypothesis tests, but less than 2% of them used an accepted approach to mitigate the multiple comparisons problem.*** + +**4. Questionable Research Practices (QRPs)** + +To better understand empirical research practices in the field of environmental and resource economics, we also conducted a survey of members of the Association of Environmental and Resource Economists (AERE) and the European Association of Environmental and Resource Economists (EAERE). In the survey, we asked respondents to self-report whether they had engaged in research practices that other scholars have labeled “questionable”. These QRPs include selectively reporting only a subset of dependent variables or analyses conducted, hypothesizing after results are known (also called HaRKing), choosing regressors or re-categorizing data after looking at the results, etc. Although one might assume that respondents would be unlikely to self-report engaging in such practices, ***92% admitted to engaging in at least one QRP.*** + +**Recommendations for Averting a Replication Crisis** + +To help improve the credibility of the environmental and resource economics literature, we recommended changes to the current incentive structures for researchers. + +– Editors, funders, and peer reviewers should emphasize the designs and research questions more than results, abolish conventional statistical significance cut-offs, and encourage the reporting of statistical power for different effect sizes. + +– Authors should distinguish between exploratory and confirmatory analyses, and reviewers should avoid punishing authors for exploratory analyses that yield hypotheses that cannot be tested with the available data. + +– Authors should be required to be transparent by uploading to publicly-accessible, online repositories the datasets and code files that reproduce the manuscript’s results, as well as results that may have been generated but not reported in the manuscript because of space constraints or other reasons. Authors should be encouraged to report everything, and reviewers should avoid punishing them for transparency. + +– To ensure their discipline is self-correcting, environmental and resource economists should foster a culture of open, constructive criticism and commentary. For example, journals should encourage the publication of comments on recent papers. In a flagship field journal, *JAERE*, we could find no published comments in the last five years. + +– Journals should encourage and reward pre-registration of hypotheses and methodology, not just for experiments, but also for observational studies for which pre-registrations are rare. We acknowledge in our article that pre-registration is no panacea for eliminating QRPs, but we also note that, in other fields, it has been shown to greatly reduce the frequency of large, statistically significant effect estimates in the “predicted” direction. + +– Journals should also encourage and reward replications of influential, innovative, or controversial empirical studies. To incentivize such replications, we recommend that editors agree to review a replication proposal as a pre-registered report and, if satisfactory, agree to publish the final article regardless of whether it confirms, qualifies, or contradict the original study. + +Ultimately, however, we will continue to rely on researchers to self-monitor their decisions concerning data preparation, analysis, and reporting. To make that self-monitoring more effective, greater awareness of good and bad research practices is critical. We hope that our publication contributes to that greater awareness. + +*Paul J. Ferraro is the Bloomberg Distinguished Professor of Human Behavior and Public Policy at Johns Hopkins University. Pallavi Shukla is a Postdoctoral Research Fellow at the Department of Environmental Health and Engineering at Johns Hopkins University. Correspondence regarding this blog can be sent to Dr. Shukla at [**pshukla4@jhu.edu**](mailto:pshukla4@jhu.edu).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2020/09/15/ferraro-shukla-is-a-replicability-crisis-on-the-horizon-for-environmental-and-resource-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2020/09/15/ferraro-shukla-is-a-replicability-crisis-on-the-horizon-for-environmental-and-resource-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/fiala-is-the-evidence-for-the-lack-of-impact-of-microfinance-just-a-design-problem.md b/content/replication-hub/blog/fiala-is-the-evidence-for-the-lack-of-impact-of-microfinance-just-a-design-problem.md new file mode 100644 index 00000000000..f1f12a6818c --- /dev/null +++ b/content/replication-hub/blog/fiala-is-the-evidence-for-the-lack-of-impact-of-microfinance-just-a-design-problem.md @@ -0,0 +1,48 @@ +--- +title: "FIALA: Is the Evidence for the Lack of Impact of Microfinance Just a Design Problem?" +date: 2020-01-02 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Economic development" + - "Ex-post power calculations" + - "Microcredit" + - "microfinance" + - "Randomized controlled trials (RCTs)" + - "Statistical power" +draft: false +type: blog +--- + +###### Microfinance is one of the most hotly debated interventions in developing countries over the last 20 years. These are generally small loans, often given to women with short repayment periods and high interest rates (though often much lower than local market rates). + +###### Proponents argue that the poor often are severely cash constrained and a little bit of money can help them to realize their economic potential. Researchers and policy makers have been worried about people being caught in debt traps, making their poverty worse. Prior to the rise of randomized control trials (RCTs) in economics the evidence either way was mostly anecdotal. + +###### Then some experimental evidence started to emerge, and the results were not encouraging. ***[Karlan and Zinman (2011)](https://science.sciencemag.org/content/332/6035/1278?casa_token=D4RlrknUTnwAAAAA:LGmcoxY-qeU8ap6mSQ2pLc9-Arg3ze_I3herd8FRAM0xRkUXEbpPrzqQpsOQZB8Swx2QBPR6E0n8c6k)*** and then ***[six other RCT studies](https://www.aeaweb.org/issues/360)*** published in a 2015 special issue of the *American Economic Journal: Applied Economics* found no statistically significant economic impacts from giving people small loans. In Fiala (2018) I found some positive short-run impacts, but only for men. + +###### However, not all evidence, even experimental, is created equal. In ***[Dahal and Fiala (2020)](https://www.sciencedirect.com/science/article/abs/pii/S0305750X1930422X)*** we closely re-analyze these eight papers to examine just how well these studies were designed to answer the questions they wanted to answer. We find that the lack of statistical significance is likely not due to a lack of impacts. Rather, the problem is that these studies are extremely underpowered. I + +###### ndividual coefficients are actually quite large, but the standard errors are even bigger. Ex-post power calculations for each of the studies show the minimum detectable effect (MDE) size for main outcomes is up to 1,000%. Median (mean) MDE is 132% (201%). The authors find effects closer to 30%, a large impact but far from what is needed to be statistically significant. + +###### Why are these studies so underpowered? One of the biggest reasons is that there is significant non-compliance. Take-up rates of loans in the treatment groups is generally low. Often the difference between take-up of loans between treatment and control groups is tiny. Three of the *American Economic Journal: Applied Economics* papers have net compliance rates less than 10 percentage points, making reasonable inference almost impossible. + +###### While some of the authors of the original studies acknowledge potential issues of low power, they never quantify them. This lack of transparency has led many people, including the original authors, to describe the null results as precisely estimated. + +###### Our analysis opens up a bigger problem within experimental methods in general: RCTs can be the gold standard of inference, but only when designed and implemented properly on questions they can be used to answer. + +###### The problems of design in microfinance isn’t due to low quality researchers running poor quality studies. Two of the recent Noble Prize winners in Economics are authors of these studies, and one of them was the editor of the *American Economic Journal: Applied Economics* when these studies were published. One of the papers had been a working paper for years. Even looking back at the standards of 2015, none of these studies passed what was considered appropriate quality. + +###### Although these papers do not show a “transformative” impact of microfinance on the lives of poor households, careful reading of the papers reveals that they also do not discredit the role of microcredit in poverty alleviation and improving livelihoods of poor households. + +###### The main conclusion of Dahal and Fiala (2020) is that we actually have no idea about the impact of microfinance on the lives of poor people because there is not a single study looking at the impact of the traditional microfinance model that is designed well enough to answer this question. + +###### *Nathan Fiala is an assistant professor at the University of Connecticut, honorary senior lecturer at Makerere University in Uganda, and a research fellow at RWI in Essen, Germany. He can be contacted at [nathan.fiala@uconn.edu](mailto:nathan.fiala@uconn.edu).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2020/01/02/fiala-is-the-evidence-for-the-lack-of-impact-of-microfinance-just-a-design-problem/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2020/01/02/fiala-is-the-evidence-for-the-lack-of-impact-of-microfinance-just-a-design-problem/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/fidler-mody-the-replicats-bus-where-it-s-been-where-it-s-going.md b/content/replication-hub/blog/fidler-mody-the-replicats-bus-where-it-s-been-where-it-s-going.md new file mode 100644 index 00000000000..4e863b69841 --- /dev/null +++ b/content/replication-hub/blog/fidler-mody-the-replicats-bus-where-it-s-been-where-it-s-going.md @@ -0,0 +1,100 @@ +--- +title: "FIDLER & MODY: The repliCATS Bus – Where It’s Been, Where It’s Going" +date: 2019-08-19 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "AIMOS conference" + - "DARPA" + - "IDEA Protocol" + - "replication" + - "RepliCATS" + - "Scientific reliability" + - "SCORE" + - "SIPS workshop" +draft: false +type: blog +--- + +###### For several years now scientists—in at least some disciplines—have been concerned about low rates of replicability. As scientists in those fields, we worry about the development of cumulative knowledge, and about wasted research effort. An additional challenge is to consider end-users (decision and policy makers) and other consumers of our work: what level of trust should they place in the published literature[1]? How might they judge the reliability of the evidence base? + +###### With the latter questions in mind, our research group recently launched ***[‘The repliCATS project’ (Collaborative Assessments for Trustworthy Science)](https://replicats.research.unimelb.edu.au/)***. In the first phase, we’re focussing on eliciting predictions about the likely outcomes of direct replications of 3,000 (empirical, quantitative) research claims[2] in the social and behavioural sciences[3]. A subset of these 3,000 research claims will be replicated by an independent team of researchers, to serve as an evaluation of elicited forecasts. + +###### The repliCATS project forms part of a broader program of replication research which will cover 8 disciplines: business, criminology, economics, education, political science, psychology, public administration, and sociology. The broader program—Systematizing Confidence in Open Research and Evidence, or SCORE—is funded by the US department of defense (DARPA). + +###### It is an example of end user investment in understanding and improving the reliability of our scientific evidence base. It is unique in its cross-disciplinary nature, in scale, ***[as well as its ultimate ambition: to explore the ability to apply machine learning to rapidly assess the reliability of a published study](https://www.natureindex.com/news-blog/adam-russell-the-search-for-automated-tools-to-rate-research-reproducibility)***. And as far as we know, it is the first investment (at least of its size) in replication studies and related research to come from end users of research. + +###### The repliCATS project uses a structured group discussion—rather than a prediction market or a survey—called the **IDEA** protocol to elicit predictions about replicability. + +###### Working in groups of 5-10, diverse groups of participants first **Investigate** a research claim, answering three questions: (i) about how comprehensible the claim is; (ii) whether the underlying effect described in the research claims seems real or robust; and (iii) then making a private estimate of likelihood of a successful direct replication. + +###### They then join their group, either in a face-to-face meeting or in remote, online groups, to **Discuss**. Discussions start with the sharing of private estimates as well as the information and reasoning that went into forming those estimates. The purpose of the discussion phase is for the group to share and cross-examine each other’s’ judgements; it is not to form consensus. + +###### After discussion has run its course, researchers are then invited to update their original **Estimates**, if they wish, providing what we refer to as a ‘round 2 forecast’. These round 2 forecasts are made privately, and not shared with other group members. + +###### Finally, we will mathematically **Aggregate** these forecasts. For this project, we are trialling a variety of aggregation methods, ranging from unweighted linear averages to aggregating log odds transformed estimates (see figure below). + +###### TRN(20190819) + +###### Some previous replication projects have run prediction markets and/or surveys alongside. Over time, these have become more accurate, particularly in the case of the social science replication project (of *Science* and *Nature* papers). Our project departs from these previous efforts, not only by using a very different method of elicitation, but also in the qualitative information we gather about reasoning, information sharing, and the process of updating beliefs (following discussion). + +###### **575 claims assessed in our first local IDEA workshop** + +###### Earlier this month, we ran our first set of large face-to-face IDEA groups, prior to the Society for Improving Psychological Science (SIPS) conference. 156 researchers joined one of 30 groups, each with a dedicated group facilitator. Over two days, those groups evaluated 575 published research claims (20-25 per group) in business, economics and psychology, making a huge contribution to our understanding of: + +###### – those published claims themselves, + +###### – how participants reason about replication, what information cues and heuristics they use to make such predictions, including what counter points make them change their minds, and + +###### – the research community’s overall beliefs about the state of our published evidence base. + +###### We’ve also started to learn about how researchers evaluate claims within their direct field of expertise versus slightly outside that scope. We don’t yet know, or necessarily expect, that there will be differences in accuracy, but there do seem to be differences in approach and subjective levels of confidence. + +###### **What happens to those predictions? How accurate were they?** + +###### The short answer is that we wait. As discussed above, the repliCATS projects is part of a larger program. What happens next is that a subset of those 3,000 claims will be fully replicated by an independent research team, serving as evaluation criteria for the accuracy of our elicited predictions. + +###### In about a year’s time, we’ll know how accurate those predictions are. (We’re hoping for at least 80% accuracy.) Our 3,000 predictions will also be used to benchmark machine learning algorithms, developed by other (again, independently funded by DARPA) research teams. + +###### Following our first workshop, our repliCATS team now has a few thousand probabilistic predictions, and associated qualitative reasoning data to get stuck into. It’s an overwhelming amount of information, and barely one fifth of what we’ll have this time next year! + +###### **Feedback on SIPS workshop** + +###### As you’ve probably gathered, the success of our project relies heavily on attracting large numbers of participants interested in assessing research claims. So it was hugely heartening for us that the researchers who joined our SIPS workshop gave us very positive overall feedback about the experience. + +###### They particularly enjoyed the core task of thinking about what factors or features of a study contribute to likely replicability (or not). + +###### Early career researchers in particular also appreciated the chance to see and discuss others’ ratings and reasoning, and told us that the workshop has helped build their confidence about writing peer reviews in the future. (In fact, several of us came to the opinion that something like our IDEA protocol would make a good substitute for current peer review process in some places!) + +###### To hear what others thought, check out Twitter @repliCATS (*click on the image below*) + +###### [TRN1(20190819)](https://twitter.com/replicats) + +###### **What’s next** + +###### Our next large scale face-to-face workshop will take place at the ***[Association for Indisciplinary Metaresearch and Open Science (AIMOS) conference in Melbourne, Australia, this November](https://www.aimos2019conference.com/)***. + +###### In the meantime, we’re ready to deploy our “repliCATS bus” (we’ll come to you, or help you, run smaller scale workshops at your institution), offering you the opportunity to join ‘remote IDEA groups’ online. + +###### **Get in touch** + +###### Here is how: + +###### [***repliCATS-project@unimelb.edu.au***](mailto:repliCATS-project@unimelb.edu.au) | [***https://replicats.research.unimelb.edu.au***](https://replicats.research.unimelb.edu.au) | ***@repliCATS*** + +###### *Fiona Fidler is an Associate Professor at the University of Melbourne with joint appointments in the School of BioSciences and the School of Historical and Philosophical Studies (SHAPS). Fallon Mody is a postdoctoral research fellow in the Department of History and Philosophy of Science at the University of Melbourne.* + +###### [1] Note that here we are specifically concerned with trust in the ‘the published literature’ and not trust in science more broadly, or in scientists themselves. The published literature is as much created by the publishing industry as it is by scientists and other scholars. + +###### [2] In this project, a “research claim” has a very specific meaning: it is used to describe a single major finding from a published study – for example, a journal article – as well as details of the methods and results that support this finding. + +###### [3] In subsequent phases, we’ll be thinking about conceptual replications, generalisability, and other factors that build confidence for end users of research. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/08/19/fidler-mody-the-replicats-bus-where-its-been-where-its-going/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/08/19/fidler-mody-the-replicats-bus-where-its-been-where-its-going/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/findley-jensen-malesky-pepinsky-nothing-up-with-acai-berries-some-reflections-on-null-results-from-a.md b/content/replication-hub/blog/findley-jensen-malesky-pepinsky-nothing-up-with-acai-berries-some-reflections-on-null-results-from-a.md new file mode 100644 index 00000000000..32a8045bf0b --- /dev/null +++ b/content/replication-hub/blog/findley-jensen-malesky-pepinsky-nothing-up-with-acai-berries-some-reflections-on-null-results-from-a.md @@ -0,0 +1,61 @@ +--- +title: "FINDLEY, JENSEN, MALESKY, & PEPINSKY: Nothing Up with Acai Berries: Some Reflections On Null Results from a Results-Free Peer Review Process" +date: 2016-09-18 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Comparative Political Studies" + - "Edmund Malesky" + - "Michael Findley" + - "Nathan Jensen" + - "Results-Free" + - "Results-Free Peer Review process" + - "Thomas Pepinsky" +draft: false +type: blog +--- + +###### In the academy and well beyond, the problem of null results has become quite significant. Indeed, discussions of null results have made their way as far as TV commentator John Oliver’s ***[recent discussion of science](https://www.youtube.com/watch?v=0Rnq1NpHdmw)*** in which he poignantly notes that people generally do not like to hear about null results. And yet, maybe we would all be better off – with more money in the bank – if his headline “Nothing Up With Acai Berries” actually made it to the general public and we embraced it ***[(see NIH)](https://nccih.nih.gov/health/acai/ataglance.htm)***. + +###### This is not just a problem in health, the sciences in general struggle with how to engage null results. Social scientists are no exception. Our interest in the topic led to a special issue of *Comparative Political Studies,* a leading political science journal, in which we solicited results free submissions and conducted the entire review process with all decisions made in the absence of results. (See our original ***[call for papers](https://www.washingtonpost.com/blogs/monkey-cage/wp/2014/09/24/can-greater-transparency-lead-to-better-social-science/?tid=a_inl)***).  This meant that reviewers and editors could not condition their decisions on significant results, thereby allowing a greater opportunity for null results to end up in published manuscripts. This special issue has been accepted and three results free papers will be published along with our ***[introduction](http://cps.sagepub.com/content/early/2016/07/01/0010414016655539.abstract)***. + +###### The exercise demonstrated in practice that papers with developed theories and designs, but that ultimately ended up with null results, could make it through review and into print.  One of the three published papers documented statistically insignificant treatment effects in its main experimental interventions. This in itself was a huge success of the special issue—revealing that research may find null results, and allowing readers to learn from them. But the process was also quite instructive on the challenges of evaluating work where null results have a greater probability of being published. We found the authors’, and especially the reviewers’, comments on this process illuminating. + +###### We offer two interrelated suggestions from our pilot to help make null results more prominent in peer-reviewed publications. The first has to do with acclimating reviewers to a new way of thinking about null findings—that they may be meaningful theoretically. The second has to do with helping authors frame their prospective work, so that null results can be read as meaningful contributions. + +###### The first problem unveiled by our pilot was that reviewers have become conditioned to view null results as empirically suspect. It seemed especially difficult for referees to accept that potential null findings might mean that a theory fails to explain the phenomenon being investigated. Rather, it seemed that reviewers feel more comfortable when they can simply interpret null results as evidence of mechanical problems with how the hypothesis was tested (low power, poor measures, etc.). Tellingly, many reviewers described null results as “non-findings,” suggesting that one learns nothing from results that are not consistent with a directional hypothesis. Making this distinction, of course, is one of the main benefits of results free peer review. + +###### Perhaps the single most compelling argument in favor of results-free peer review is that it allows for findings that a hypothesized relationship is not found in the data. Yet, our reviewers pushed back against making such calls. They appeared reluctant to endorse manuscripts in which null findings were possible, or if so, to believe those potential null results might be interpreted as evidence against the existence of a hypothesized relationship. For some reviewers, this was a source of some consternation: Reviewing manuscripts without results made them aware of how they were making decisions based on the strength of findings, and also how much easier it was to feel “excited” by strong findings. + +###### The second problem that our pilot revealed has to do with how authors discuss their contributions. The main reason for rejection in our results free review process was that authors failed to explain their projects in ways that made a potential null result theoretically compelling. + +###### Again and again, reviewers posed some version of the question: If the tested hypotheses proved insignificant, would that move debates in this sub-literature forward in any way? In many of the rejected papers and even one of the accepted papers, the answer was no. There were three reasons that reviewers reached this conclusion. First, a null finding would not be interesting because the reviewer found the theory to be implausible in the first place. Proving that the implausible was in fact implausible is not a recipe for scintillating scholarship. + +###### The second was a variant of Occam’s razor. Reviewers did not believe that the author had adequately accounted for the simpler, alternative theory to explain the underlying puzzle that motivated their research. In this instance, a null result would only reinforce the notion that the more parsimonious theory was superior, or that a natural experiment was confounded by unobservable selection. + +###### Third, there was too much distance between the articulated theory and the abstract field, lab-in-field, or survey experiment articulated in the paper. The theory invoked a compelling concept, but the proposed research design failed to adequately capture it or stretched the meaning of the concept to the point of unrecognizability. In this case, a null result would only prove the empirical test was inadequate for the bigger question. + +###### None of these dismissals of proposed research plans are new problems or unique to results-free review. They are a standard part of the way scholars evaluate research. The interesting implications for results-free review manifest themselves in how strategic authors may alter their research agenda to survive the review process. Introducing a laundry list of hypotheses and potential heterogeneous effects will not suffice. Our reviewers were quick to spot and reject this type of “hypothesis trolling.” + +###### Three author strategies would seem most plausible for articulating work that makes null results compelling. First, authors might place themselves between two competing theories with contrasting observable implications, posing their research design as the distinguishing test. For example, does fiscal decentralization decrease or increase corruption? Here, a null finding might rule out one of the competing hypotheses (although concerns about statistical power might still appear). Second, authors may offer their research design as the first or a better test of prevailing theory or logic that has been inadequately tested in the literature. The theory of deliberative democracy, for instance, offers a number of very clear implications about how deliberation should affect the thinking and behavior of citizens, yet, most of these have been subjected to only limited empirical testing. If designed properly, this would be interesting purely because the potential target would be well known. Again, however, reviewers reacted quite negatively to this type of approach. Most referees wanted authors to build on the existing literature in important ways or to thoroughly explain why the observational work of previous generations was flawed. Finally, authors might offer a test of a hypothesis that is the next logical step within a prevailing and well-traveled research paradigm. + +###### There is a clear drawback to the way results review prioritizes this type moderate theoretical progress and testing — what Kuhn referred to as normal science. Knowing that they have to convince a skeptical reviewer that a null finding is interesting, scholars may choose to abjure big questions and paradigmatic shifting scholarship for incremental research designs.  There is a reasonable debate to be had about the proper balance between normal science and big questions, and certainly outlets need to be available for the next big breakthrough. + +###### Outside the academy, these problems may be magnified. For economists or political scientists working in the Bureau of Labor Statistics or the World Bank, for examples, authors may face an explicit peer review process or perhaps even simply face scrutiny from policymakers or funders who carry their own mental theoretical models within which they adjudicate the (likely) results from researchers. If such researchers see the possibility for null results, they may similarly attempt to strategize about the types of phenomena they track, the sorts of programs they endorse or evaluate, or more basically avoid potentially innovative policy approaches in favor of incremental programming and evaluation. + +###### In our view, the more immediate problem, however, is that publication bias in the social sciences is impeding both normal science *and* the next big thing. We can’t even make incremental progress on the critical questions of our day without a clear documentation of all the research paths that did not prove fruitful.  Knowing about the failed tests is just as valuable as learning about successes when we envision new research projects.  Worse, the proliferation of successful tests means it is hard to identify the truly path-breaking findings, and, even more importantly, to trust that we can build upon them. + +###### These observations point to an important conclusion for an exercise: scientists engaged in null-significance hypothesis testing lack a coherent framework for thinking about what null results actually mean, and how to build them into a cumulative scientific enterprise. Bayesians have long criticized null-significance hypothesis testing for this very reason. Our exercise proved useful in unexpected ways for bringing this problem to light, and point to the need for a more robust and honest discussion about why scientists are so eager to dismiss null findings as “non-results,” and the implications for collective efforts across the disciplines. + +###### There is room for disagreement about the best approach. Indeed, there will be a variety of benefits and costs to admitting null results more equally into scholarly and policy debates. We endorse practices that will allow for null results to become more central in and out of the academy, but we suspect that a robust discussion lies ahead regarding the complexities of author, reviewer, and editor incentives for producing and evaluating them. + +###### *Michael Findley is associate professor in the Department of Government and the LBJ School of Public Affairs (courtesy) at the University of Texas at Austin.**Nathan Jensen is professor in the Department of Government at the University of Texas at Austin.**Edmund Malesky is professor in the Department of Political Science at Duke University.**Thomas Pepinsky is associate professor in the Department of Government at Cornell University.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/09/18/findley-nothing-up-with-acai-berries-some-reflections-on-null-results-from-a-results-free-peer-review-process/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/09/18/findley-nothing-up-with-acai-berries-some-reflections-on-null-results-from-a-results-free-peer-review-process/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/gaarder-jimenez-replication-research-on-financial-inclusion-in-developing-countries.md b/content/replication-hub/blog/gaarder-jimenez-replication-research-on-financial-inclusion-in-developing-countries.md new file mode 100644 index 00000000000..4a8ee4d80d5 --- /dev/null +++ b/content/replication-hub/blog/gaarder-jimenez-replication-research-on-financial-inclusion-in-developing-countries.md @@ -0,0 +1,46 @@ +--- +title: "GAARDER & JIMENEZ: Replication Research on Financial Inclusion in Developing Countries" +date: 2020-04-16 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "3ie" + - "3ie replication programme" + - "Bill and Melinda Gates Foundation" + - "Financial inclusion" + - "Financial services" + - "Financial Services for the Poor (FSP)" + - "International Initiative for Impace Evaluation" + - "Journal of Development Effectiveness" + - "replication" + - "special issue" +draft: false +type: blog +--- + +###### Capture + +###### In support of recent efforts by social scientists to address the ‘***[reproducibility crisis](http://science.sciencemag.org/content/349/6251/aac4716.full?ijkey=1xgFoCnpLswpk&keytype=ref&siteid=sci)***’, the *Journal of Development Effectiveness* (JDEff) recently devoted a ***[special issue on replication research](https://www.tandfonline.com/toc/rjde20/current)*** studies in its last issue of 2019.  Most journals continue to favor new research rather than replication work and to publish research whose data and codes have not been tested. As editors of the journal, and as the past and current executive director of the International Initiative for Impact Evaluation (3ie) which hosts it, we felt that such an issue could increase awareness among funders and researchers of how replication strengthens the reliability, rigor and relevance of their investment.  It would also ensure that the replication research will be acknowledged and appreciated by the larger development community. + +###### The special issue was devoted to the topic of enhancing financial inclusion in developing countries.  ***[3ie, which has championed replication research](https://www.3ieimpact.org/our-expertise/replication)*** for many years, had worked closely with the Bill and Melinda Gates Foundation’s Financial Services for the Poor (FSP) program to identify the studies, screen the applicants, and quality assure the replication research in this important area.  FSP invests millions of dollars to broaden the reach of low-cost digital financial services for the poor by supporting the most catalytic approaches to financial inclusion, such as the development of digital payment systems, advancing gender equality, and supporting national and regional strategies.  In doing so, it relies heavily on research and evidence studies, many of which, although cited and referenced heavily, have not been replicated. + +###### We hope that these replications can be used appropriately by FSP and other stakeholders to inform future investments in an important part of the development toolkit – expanding financial services to the poor.  About ***[1.7 billion people worldwide are excluded](https://globalfindex.worldbank.org/)*** from formal financial services, such as savings, payments, insurance, and credit. In developing economies, nearly one billion are still left out of the formal financial system, and there is a 9 percent persistent gender gap in financial inclusion in developing economies. + +###### Most poor households instead, operate almost entirely through a cash economy. They are cut off from potentially stabilizing and uplifting opportunities like building credit or getting a loan to start a business. And it’s harder to weather common financial setbacks, such as serious illness, a poor harvest, or an economic downturn, such as the one the world experiencing now due to the coronavirus epidemic.  In fact, just last week, one of us, who had just moved from Delhi to Manila, was able to help his former Indian housekeeper with a cash transfer, only because she had a bank account.  It took all of 15 minutes to send her much needed financial support, which would have been much more difficult otherwise.  Millions of others have no access to such mechanisms. + +###### The issue in JDEff replicates 6 important financial inclusion studies.  These include a study on providing banking access to farmers; three studies that evaluated interventions to introduce innovative alternatives to traditional banking, such as using mobile phones or biometric smartcards as payment mechanisms for transfers; and two studies that studied the effects of different kinds of transfers (cash versus kind; conditional versus unconditional) that are distributed through financial institutions.  Importantly, the replications were able to reproduce the principal results of all of the studies.  It is as important to highlight this finding, as much as it is to get notoriety through “gotcha” replications that appear to overturn results, such as in the [“***worm wars***”](https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0006940) of a few years ago. + +###### The replications also make several useful findings about nuances that the original research may have missed, such as about heterogeneous effects. + +###### Research must meet the higher bar of being good enough for decision-making that affects human lives (not merely good enough for publication. Organizations like 3ie which consider replication as an important tool in making research more rigorous takes the following lessons from this JDEff issue on doing replications in the future.  One is to ensure that, beyond taking the original research at face value, enough attention is dedicated to results not reported (to avoid reporting bias), to the policy significance of the reported results, to reflections about possible rival explanations for the results, and to how the main variables were constructed. Another replication deficiency in current research relates to the replication of qualitative research. While there is increasing acceptance that replication of quantitative research is part of best practice by funders and journals, replication in the qualitative research field is nascent. In a new initiative, 3ie is partnering with the Qualitative Data Repository (Syracuse University) to get their assistance in archiving and sharing select qualitative data, learn from the experience and thereby contribute to lessons and guidance on how to do this in the future.  Finally, within the evidence architecture, it is worthwhile promoting ***[systematic reviews](https://developmentevidence.3ieimpact.org/)*** as a set of replications of studies in differentiated real life settings. + +###### *Marie Gaarder is the current Executive Director of the International Initiative for Impact Evaluation (3ie).  Emmanuel Jimenez is a Senior Fellow at and former Executive Director of 3ie, and Editor-in-Chief of the Journal of Development Effectiveness.  They can be contacted, respectively, at **[mgaarder@3ieimpact.org](mailto:mgaarder@3ieimpact.org)** and **[ejimenez@3ieimpact.org](mailto:ejimenez@3ieimpact.org)**.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2020/04/16/jiminez-gaarder-replication-research-on-financial-inclusion-in-developing-countries/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2020/04/16/jiminez-gaarder-replication-research-on-financial-inclusion-in-developing-countries/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/garret-christensen-an-introduction-to-bitss.md b/content/replication-hub/blog/garret-christensen-an-introduction-to-bitss.md new file mode 100644 index 00000000000..df01f462a23 --- /dev/null +++ b/content/replication-hub/blog/garret-christensen-an-introduction-to-bitss.md @@ -0,0 +1,51 @@ +--- +title: "GARRET CHRISTENSEN: An Introduction to BITSS" +date: 2015-11-12 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "BITSS" + - "Garret Christensen" + - "Transparency" +draft: false +type: blog +--- + +###### The Berkeley Initiative for Transparency in the Social Sciences ([BITSS](http://www.bitss.org/)) was formed in late 2012 after a [meeting](http://www.bitss.org/annual-meeting/2012-2/) in Berkeley that led to the publication of an [article in *Science*](http://www.sciencemag.org/content/343/6166/30.long) on ways to increase transparency and improve reproducibility in research across the social sciences. BITSS is part of Berkeley’s Center for Effective Global Action ([CEGA](http://cega.berkeley.edu/)), and is led by development economist [Edward Miguel](http://emiguel.econ.berkeley.edu/) and advised by a group of [leaders in transparent research](http://www.bitss.org/about/leadership/) from economics, psychology, political science, and public health. + +###### Since our founding, we’ve worked to build a network of like-minded researchers, and focused on the following aspects of research transparency, which hopefully covers the entire lifecycle of a research project: + +###### – Registering Studies: Whether it is [clinicialtrials.gov](http://www.clinicaltrials.gov/), the [AEA’s registry](http://www.socialscienceregistry.org/), [EGAP’s registry](http://www.egap.org/design-registration/), or [3ie’s registry](http://ridie.3ieimpact.org/), creating a database of the universe of studies helps combat publication bias. + +###### – Writing Pre-Analysis Plans: Tying your hands a bit by pre-specifying the analysis you plan to run can reduce your ability to consciously or unconsciously mine the data for spurious results. + +###### – Replication and Meta-Analysis: We encourage researchers to conduct and publish replications and meta-analyses so we can build on existing work more systematically. + +###### – Reproducible Workflow: Organizing your research in a way that others (or just your future self) will be able to understand your code and re-run it to get the same results. + +###### – Sharing Data and Code: Put data, code, and adequate documentation in a trusted public repository so that others can more easily build off your work. + +###### To help spread the methods of more transparent and reproducible research, we’ve engaged in the following activities: + +###### – Manual of Best Practices: a how-to guide and [reference manual](https://github.com/garretchristensen/BestPracticesManual) for researchers interested in conducting reproducible research. + +###### – Semester-Long Research Transparency Course: taught by Edward Miguel as Econ 270D, the course is available on [YouTube](https://www.youtube.com/playlist?list=PL-XXv-cvA_iBN9JZND3CF91aouSHH9ksB) and we are working to make an interactive MOOC. + +###### – Summer Institute and Workshops: an annual [training for graduate students](http://www.bitss.org/training/training-2015/) and young researchers held each June in Berkeley, featuring lectures from eminent scholars in transparency plus hands-on training in dynamic documents, version control, and data sharing. + +###### – Annual Meeting and Conference Sessions: we host a conference in Berkeley with an open call for papers ([coming up December 10 and 11!](http://www.bitss.org/2015-bitss-annual-meeting/)) and have also organized sessions at past AEA/ASSA meetings and other conferences. This year we’re co-hosting a workshop on replication and transparency in San Francisco January 6-7, right after the AEA meeting. [Registration is open now!](http://ineteconomics.org/community/events/replication-and-transparency-in-economic-research) + +###### – Grants: We had a call for our Social Science Meta-Analysis and Research Transparency ([SSMART](http://www.bitss.org/ssmart/)) grants. Announcements of winners will be made soon, and we plan to have an additional call next year. + +###### – Prizes: We will soon be announcing the first winners of the [Leamer-Rosenthal Prizes for Open Social Science](http://www.bitss.org/prizes/leamer-rosenthal-prizes/)—young researchers who have been incorporating transparency in their work as well as established faculty who have been teaching transparency. + +###### If you’re interested in getting involved, we’d love to hear from you. (You can e-mail me: [garret@berkeley.edu](mailto:garret@berkeley.edu), or our Program Director Jen Sturdy [jennifer.sturdy@berkeley.edu](mailto:jennifer.sturdy@berkeley.edu).) We’re working on formalizing a Catalyst program where you could be an ambassador for transparency at your own university or institution and receive BITSS funding for workshops and trainings. Follow us on [our blog](http://www.bitss.org/) or on [Twitter (@UCBITSS)](http://www.twitter.com/UCBITSS) to hear the latest updates. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/11/12/garret-christensen-an-introduction-to-bitss/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/11/12/garret-christensen-an-introduction-to-bitss/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/gelman-some-natural-solutions-to-the-p-value-communication-problem-and-why-they-won-t-work.md b/content/replication-hub/blog/gelman-some-natural-solutions-to-the-p-value-communication-problem-and-why-they-won-t-work.md new file mode 100644 index 00000000000..3200b880f89 --- /dev/null +++ b/content/replication-hub/blog/gelman-some-natural-solutions-to-the-p-value-communication-problem-and-why-they-won-t-work.md @@ -0,0 +1,55 @@ +--- +title: "GELMAN: Some Natural Solutions to the p-Value Communication Problem—And Why They Won’t Work" +date: 2017-03-23 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Bayes factors" + - "Bayesian approach" + - "confidence intervals" + - "p-value" + - "significance testing" +draft: false +type: blog +--- + +###### [*NOTE: This is a repost of a blog that Andrew Gelman wrote for the blogsite **[Statistical Modeling, Causal Inference, and Social Science](http://andrewgelman.com/)***]. + +###### Blake McShane and David Gal recently wrote two articles (“***[Blinding us to the obvious? The effect of statistical training on the evaluation of evidence](http://www.blakemcshane.com/Papers/mgmtsci_pvalue.pdf)***” and “Statistical significance and the dichotomization of evidence”) on the misunderstandings of p-values that are common even among supposed experts in statistics and applied social research. + +###### The key misconception has nothing to do with tail-area probabilities or likelihoods or anything technical at all, but rather with the use of significance testing to finesse real uncertainty. + +###### As John Carlin and I write in ***[our discussion of McShane and Gal’s second paper](http://www.stat.columbia.edu/~gelman/research/published/jasa_signif_2.pdf)*** (to appear in the Journal of the American Statistical Association): + +###### Even authors of published articles in a top statistics journal are often confused about the meaning of p-values, especially by treating 0.05, or the range 0.05–0.15, as the location of a threshold. The underlying problem seems to be deterministic thinking. To put it another way, applied researchers and also statisticians are in the habit of demanding more certainty than their data can legitimately supply. The problem is not just that 0.05 is an arbitrary convention; rather, even a seemingly wide range of p-values such as 0.01–0.10 cannot serve to classify evidence in the desired way. + +###### In our article, John and I discuss some natural solutions that won’t, on their own, work: + +###### – Listen to the statisticians, or clarity in exposition + +###### – Confidence intervals instead of hypothesis tests + +###### – Bayesian interpretation of one-sided p-values + +###### – Focusing on “practical significance” instead of “statistical significance” + +###### – Bayes factors + +###### You can read ***[our article](http://www.stat.columbia.edu/~gelman/research/published/jasa_signif_2.pdf)*** for the reasons why we think the above proposed solutions won’t work. + +###### From our summary: + +###### We recommend saying No to binary conclusions . . . resist giving clean answers when that is not warranted by the data. . . . It will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue. + +###### **P.S.** Along similar lines, Stephen Jenkins sends along the similarly-themed ***[article](https://academic.oup.com/esr/article/33/1/1/2739015/Sing-Me-a-Song-with-Social-Significance-The-MisUse)***, “‘Sing Me a Song with Social Significance’: The (Mis)Use of Statistical Significance Testing in European Sociological Research,” by Fabrizio Bernardi, Lela Chakhaia, and Liliya Leopold. + +###### *Andrew Gelman is a professor of statistics and political science and director of the Applied Statistics Center at Columbia University. He blogs at **[Statistical Modeling, Causal Inference, and Social Science](http://andrewgelman.com/)**.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/03/23/gelman-some-natural-solutions-to-the-p-value-communication-problem-and-why-they-wont-work/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/03/23/gelman-some-natural-solutions-to-the-p-value-communication-problem-and-why-they-wont-work/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goldstein-more-replication-in-economics.md b/content/replication-hub/blog/goldstein-more-replication-in-economics.md new file mode 100644 index 00000000000..4ed98c61abe --- /dev/null +++ b/content/replication-hub/blog/goldstein-more-replication-in-economics.md @@ -0,0 +1,46 @@ +--- +title: "GOLDSTEIN: More Replication in Economics?" +date: 2016-10-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Development" + - "Econ Journal Watch" + - "Goldstein" + - "maren duvendack" + - "replication" + - "World Bank" +draft: false +type: blog +--- + +###### [*This blog originally appeared at the blogsite* ***[Development Impact](http://blogs.worldbank.org/impactevaluations/more-replication-economics)***] About a year ago, I ***[blogged](https://blogs.worldbank.org/impactevaluations/infinite-loop-failure-replication-economics)*** on a paper that had tried to replicate results on 61 papers in economics and found that in 51% of the cases, they couldn’t get the same result.   In the meantime, someone brought to my attention a paper that takes a wider sample and also makes us think about what “replication” is, so I thought it would be worth looking at those results.     The paper in question is by Maren Duvendack, Richard Palmer-Jones and Robert Reed and appeared last year in *Econ Journal Watch* (***[see here](https://econjwatch.org/articles/replications-in-economics-a-progress-report)***). The paper starts with an interesting history of replication in economics.   It turns out that replication goes pretty far back.   Duvendack and co. cite the introductory editorial to *Econometrica,*where Frisch wrote “In statistical and other numerical work presented in *Econometrica* the original raw data will, as a rule, be published, unless their volume is excessive.   This is important to stimulate criticism, control and further studies.”   That was in 1933.     Various journals have made similar affirmations of the need for replication over the years.   The *Journal of Human Resources* put it in its policy statement in 1990 – explicitly saying that it welcomed the submission of studies that replicated studies that had appeared in the *JHR* in the last five years.   But this is missing from the current policy, which focuses more on making data and code available with published papers.  The *Journal of Political Economy* took a different approach, and had a “confirmations and contradictions” section from 1976-1999.   These explicit publication opportunities may have declined in recent times, but there has been a sharp surge in a different path to replication – the requirement that authors submit their code and dataset for a given paper.   Duvendack and co. find 27 journals that regularly publish data and code – and many of these are top journals.  The only development field journal that makes this list is the *World Bank Economic Review*.  In addition, many funders now require that, after a decent interval, the data they funded be made publicly available in its entirety.     Before we look at Duvendack and co.’s review of replication trends, it’s worth taking a short detour as to what exactly replication means.   Unfortunately, as it’s used in many conversations, it’s imprecise.  Michael Clemens has a very nice (and very precise)***[paper](http://www.cgdev.org/sites/default/files/CGD-Working-Paper-399-Clemens-Meaning-Failed-Replications.pdf)*** where he lays out a number of distinctions.   In this case, precision requires some verbosity, so hang on.  Clemens lays out four different types (in two groups): + +###### *Replication* (both sub-types use the same sampling distribution of parameter estimates and are looking for discrepancies that come from random chance, error, or fraud): + +###### – *Verification* – uses the same specification, same population and same sample + +###### – *Reproduction* – uses the same specification, same population but not the same sample + +###### *Robustness* (uses different sampling distribution for parameter estimates and is looking for discrepancies that come from changes in the sampling distribution – as Clemens notes they need not give identical results in expectation): + +###### – *Reanalysis* – uses a different specification, the same population and not necessarily the same sample + +###### – *Extension* – uses the same specification, different population and a different sample. + +###### Duvendack and co. are using a broader definition of replication (especially when compared to the paper I blogged on last year):  they’re including what Clemens calls robustness.   They go out, casting a wide net to look for replication studies (they include not only Google Scholar and the Web of Science, and the [Replication in Economics wiki](http://replication.uni-goettingen.de/wiki/index.php/Main_Page), but also suggestions from journal editors, their own collections and a systematic search of the top 50 economics journals).   This search gives them 162 published studies.   The time trend is interesting, as the figure below (reproduced from their paper) shows what could be an upwards trend: + +![trend](/replication-network-blog/trend.webp) + +###### One development journal that contributed significantly to these 162 studies was the *Journal of Development Studies.*We can’t tell how many other development papers are in the 162 since the rest of the main contributors are more general interest journals.   The *Journal of Applied Econometrics* (JAE) is the main overall contributor to this body of work – they clock in with 31 replications – in large part because they have a dedicated replication section which can consist of pretty short summaries.   Duvendack and co. then look at the characteristics of these replications.   I am going to focus on the non-JAE, non-experimental (as in experimental economics, not field experiments) studies which number 119.   About 61 percent of these studies are an “exact” replication or, to use Clemens’ taxonomy, a verification study.   55 percent of studies extend the original findings.  As might be expected, the majority of the 119 studies (73 percent) find a significant difference with the original result.   About 17 percent confirm the previous study and 10 percent are mixed.   And 26 percent of studies have a reply by the original study authors.      As Duvendack and co. point out, we shouldn’t think of the published studies as a random sample.   Which brings us to incentives.    Clearly getting confirmatory studies published in major journals is going to be hard, and particularly hard for simple verification studies.    Turning to results that disagree with the original study, Duvendack and co. speculate that some journals might be reluctant to publish contradictions of influential authors.  I can also imagine that younger researchers may be averse to taking on this particular challenge, given that influential senior researchers may show up in their career futures.   Returning to the journal side of the equation, it also doesn’t look particularly good for the journal to take down one of their own papers.  On another level, for journals the citation per page count is likely to be significantly lower for a replication than for an original paper (although Duvendack and co. suggest this could be alleviated by publishing very short replication papers with the longer paper as an online appendix).  Finally, the incentives for authors of the original study to make replication easy are pretty weak – if someone confirms your study it’s not really a big deal, but it’s a big deal if they don’t.     So all of these factors point towards the lower likelihood of replications.   There are a couple of factors that might make replications more likely.   The first is there are somewhat more communities to support this than there used to be.   Beyond the wiki mentioned above, Duvendack and co. have a list inside economics (e.g. 3ie’s work), but also for other disciplines.   In addition, the growth in online storage, the growth in computer processing power, plus the increasing number of journals requiring the posting of code and data lower the costs of replication dramatically.     And the trend is positive, so maybe there is some social support.   It will be interesting to see what the future brings. + +###### *Markus Goldstein is a development economist with experience working in Sub-Saharan Africa, East Asia, and South Asia. He is currently a Lead Economist in the Office of the Chief Economist for Africa at the World Bank.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/10/27/goldstein-more-replication-in-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/10/27/goldstein-more-replication-in-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goodman-hold-the-bus.md b/content/replication-hub/blog/goodman-hold-the-bus.md new file mode 100644 index 00000000000..522f6214684 --- /dev/null +++ b/content/replication-hub/blog/goodman-hold-the-bus.md @@ -0,0 +1,53 @@ +--- +title: "GOODMAN: Hold the Bus!" +date: 2019-01-01 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Journal policies" + - "Nature journal" + - "null results" + - "Open Science" + - "Pre-registration" +draft: false +type: blog +--- + +###### A ***[recent news piece in Nature](https://www.nature.com/articles/d41586-018-07118-1)*** reported in glowing terms on the “first analysis of ‘pre-registered’ studies”, stating that “[pre-registration] seems to work as intended: to reduce publication bias for positive results.” There are reasons to be somewhat dubious about this claim. + +###### The analysis in question appears in a preprint, “***[Open Science challenges, benefits and tips in early career and beyond](https://psyarxiv.com/3czyt/)***”. The analysis is a small part of the paper, occupying about half a page of an 11-page document. The paper draws no strong claims from the data; the Nature story goes well beyond what the paper says, though I can easily believe the authors waxed proudly about their results when interviewed by the journalist. + +###### The preprint is really an essay on preregistration and true to its title discusses challenges and risks, esp. for early stage investigators, as well as potential benefits. The authors are proponents of preregistration and reach the expected conclusion that the benefits outweigh the risks. + +###### The angle of the Nature story is that preregistration cuts publication bias by increasing the proportion of null results that are published. The reporter drives the point home with a graphic (reproduced below) that vividly shows the increase in null findings in bright red: 55-66% with pre-registration vs. 5-20% without. Sounds convincing. + +![trn(goodman, 20190119)](/replication-network-blog/trngoodman-20190119.webp) + +###### + +###### But hold the bus! There are several problems. + +###### 1. The data comes from a biased sample. The first wave of pre-registrants are presumably people committed to the ideal who want to do it right. There is no rigorous way to extrapolate from this (or any) biased sample to the population as a whole. + +###### 2. The study confounds two factors: preregistration and journals’ willingness to publish null results. There’s no way to allocate the treatment effect without a study that separates the factors. Perhaps the results would be just as good if journals were eager to publish null findings that weren’t preregistered. + +###### 3. The output variable is, at best, a surrogate for what we actually want. Is our goal really to increase the number of null results in the literature? If so, there are many trivial ways to accomplish this. No. I suspect the true goal is to improve the quality of science. The proportion of nulls is somehow thought to be an indicator of quality, although I’m not aware of any evidence to support this claim. + +###### The research enterprise is a complex dynamic system driven by economic forces. Before mucking about with something as central as the criteria for publication, one needs to consider long term unintended consequences. Unless we repeal publish-or-perish, everyone in the field will continue publishing in order to keep their jobs. Unless we improve the quality of people in the field or give them more money or time to do research, we’ll be stuck with the same researchers publishing the best papers they can with limited resources. + +###### Preregistration will make it harder and more costly to do research and the likely first-order effect will be to reduce the amount of research. If the change preferentially makes it harder to do good research, the outcome will be a classic unintended consequence: we will worsen, not improve, overall research quality. + +###### Are you willing to risk these unintended consequences without a proper controlled study of the proposed change? I hope not. + +###### This begs the question of how to do such a study as there is no obvious way to blind participants as to whether they’re in the treatment (preregistration) or control group. An alternative is to look at the experience in other fields where preregistration has been in effect for a long time, for example, medical research. To date, this has not been subject to rigorous study. However, it is this author’s assessment that the outcome is not as positive as the Nature report suggests. + +###### *Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others. He can be contacted at natg@shore.net.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/01/01/goodman-hold-the-bus/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/01/01/goodman-hold-the-bus/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goodman-ladies-and-gentlemen-i-introduce-to-you-plausibility-limits.md b/content/replication-hub/blog/goodman-ladies-and-gentlemen-i-introduce-to-you-plausibility-limits.md new file mode 100644 index 00000000000..55c8b14ce93 --- /dev/null +++ b/content/replication-hub/blog/goodman-ladies-and-gentlemen-i-introduce-to-you-plausibility-limits.md @@ -0,0 +1,66 @@ +--- +title: "GOODMAN: Ladies and Gentlemen, I Introduce to You, “Plausibility Limits”" +date: 2019-12-02 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "confidence intervals" + - "p-values" + - "Plausibility limits" + - "Sampling distributions" + - "significance testing" + - "Uncertainty" +draft: false +type: blog +--- + +###### *Confidence intervals get top billing as the alternative to significance. But beware: confidence intervals rely on the same math as significance and share the same shortcominings. Confidence intervals don’t tell where the true effect lies even probabilistically. What they do is delimit a range of true effects that are broadly consistent with the observed effect.* + +###### Confidence intervals, like p-values and power, imagine we’re repeating a study an infinite number of times, drawing a different sample each time from the same population. Though unnatural for basic, exploratory research, it’s a useful mathematical trick that let’s us define the concept of *sampling distribution* – the distribution of expected results – which in turn is the basis for many common stats. The math is the same across the board; I’ll start with a pedantic explanation of p-values, then generalize the terminology a bit, and use the new terminology to explain confidence intervals. + +###### Recall that the (two-sided) p-value for an observed effect *dobs* is the probability of getting a result as or more extreme than *dobs* *under the null*. “Under the null” means we assume the population effect size *dpop=0*. In math terms, the p-value for *dobs* is the tail probability of the sampling distribution – the area under the curve beyond *dobs* – times *2* to account for the two sides. Recall further that we declare a result to be *significant* and *reject the null* when the tail probability is so low that we deem it implausible that *dobs* came from the null sampling distribution. + +###### Figure 1a shows a histogram of simulated data overlaid with the sampling distribution for sample size *n=40* and *dpop=0*. I color the sampling distribution by p-value, switching from blue to red at the conventional significance cutoff of *p=0.05*. The studies are simple two group difference-of-mean studies with equal sample size and standard deviation, and the effect size statistic is standardized difference (aka *Cohen’s d*). The line at *dobs=0.5* falls in the red indicating that we deem the null hypothesis implausible and reject it. + +###### TRN1(20191202) + +![TRN2(20191202)](/replication-network-blog/trn220191202-1.webp) + +###### Figure 1b shows sampling distributions for *n=40* and several values of *dpop*. The coloring, analogous to Figure 1a, indicates how plausible we deem *dobs* given *n* and *dpop*. The definition of “plausibility value” (*plaus*) is the same as p-value but for arbitrary *dpop*. *dobs=0.5* is in the red for the outer values of *dpop* but in the blue for the inner ones. This means we deem *dobs=0.5* implausible for the outer values of *dpop* but plausible for the inner ones. The transition from implausible to plausible happens somewhere between *dpop=0* and *0.1*; the transition back happens between *0.9* and *1*. + +###### Plausibility is a function of three parameters: *n*, *dpop*, and *dobs*. We can compute and plot plaus-values for any combination of the three parameters. + +###### Figure 2a plots *plaus* vs. *dpop* for *dobs=0.5* and several values of *n*. The figure also shows the limits of plausible *dpop*s at points of interest. As with significance, we can declare “plausibility” at thresholds other than *0.05*. From the figure, we see that the limits gets tighter as we increase *n* or the plausibility cutoff. For *n=40* and the usual *0.05* cutoff, the limits are wide: [0.5, 0.94]. For *n=400* and a stringent cutoff of *0.5*, the limits are narrow: [0.45,0.55]. This makes perfect sense: (1) all things being equal, bigger samples have greater certainty; (2) a higher cutoff means we demand more certainty before deeming a result plausible, ie, we require that *dpop* be closer to *dobs*. + +###### TRN3(20191202) + +![TRN4(20191202)](/replication-network-blog/trn420191202-1.webp) + +###### The plausibility limits in Figure 2a are consistent with Figure 1b. Both figures say that with *n=40* and *dobs=0.5*, *dpop* must be a little more than *0* and a little less than *1* to deem the solution plausible. + +###### Now I’ll translate back to standard confidence interval terminology: *confidence level* is *1 – plaus* usually stated as a percentage; *confidence intervals* are plausibility limits expressed in terms of confidence levels. Figure 2b restates 2a using the standard terminology. It’s the same but upside down. This type of plot is called a *consonance curve*. + +###### A further property of confidence intervals, called the *coverage* property, states that if we repeatedly sample from a fixed *dpop* and compute the C% confidence intervals, C% of the intervals will contain *dpop*. Figure 3 illustrates the property for 95% confidence intervals, *n=40*, and *dpop=0.05*. The figure shows the sampling distribution colored by plaus-value, and confidence intervals as solid blue or dashed red lines depending on whether the interval covers *dpop*. I arrange the intervals along the sampling distribution for visual separation. + +###### TRN3(20191202) + +###### Many texts use the coverage property as the definition of confidence interval and plausibility limits as a derived property. This points the reader in the wrong direction: since C% of intervals cover *dpop*, it’s natural to believe there’s a C% chance that the interval computed from an observed effect size contains *dpop*. This inference is invalid: the interval delimits the range of *dpop*s that are close enough to *dobs* to be deemed plausible. This says nothing about probability. + +###### Stats likes strong words: “significance”, “power”, “confidence”. Words matter: “significant” suggests important; “power” suggests the ability to get the right answer; “95% confidence” suggests we’re pretty darn sure; “95% confidence interval” and “95% of intervals cover *dpop*” suggest a 95% chance that the true effect falls in the interval. None of these inferences are valid. + +###### Statistics is fundamentally about uncertainty. Why hide this behind the smog of strong words? There’s no surefire way to prevent the misuse of statistics, but better language can only help. Words that scream “weak and uncertain” – words like “plausible” – are a step in the right direction. + +###### COMMENTS PLEASE! + +###### Please post comments on ***[Twitter](https://twitter.com/gnatgoodman)*** or ***[Facebook](https://www.facebook.com/nathan.goodman.3367)***, or contact me by email [natg@shore.net](mailto:natg@shore.net). + +###### *Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/12/02/goodman-ladies-and-gentlemen-i-introduce-to-you-plausibility-limits/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/12/02/goodman-ladies-and-gentlemen-i-introduce-to-you-plausibility-limits/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goodman-reed-a-friendly-debate-about-pre-registration.md b/content/replication-hub/blog/goodman-reed-a-friendly-debate-about-pre-registration.md new file mode 100644 index 00000000000..30cd8ef1d32 --- /dev/null +++ b/content/replication-hub/blog/goodman-reed-a-friendly-debate-about-pre-registration.md @@ -0,0 +1,81 @@ +--- +title: "GOODMAN & REED: A Friendly Debate about Pre-Registration" +date: 2019-06-19 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Confirmation procedures" + - "Confirmatory data analysis" + - "Exploratory data analysis" + - "HARKing" + - "p-hacking" + - "Pre-registration" + - "publication bias" +draft: false +type: blog +--- + +###### *Background**: Nat Goodman is generally pessimistic about the benefits of pre-registration. Bob Reed is generally optimistic about pre-registration. What follows is a back-and-forth dialogue about what each likes and dislikes about pre-registration.* + +###### [**GOODMAN, Opening Statement**] We need to remember that science is a contradictory gimish of activities: creative and mundane; fuelled by curiosity and dogged by methodological minutia; fascinating and boring; rigorous and intuitive; exploratory, iterative, incremental, and definitive. + +###### Instead of trying to rein in the galloping horse we call science, we should be trying to spur it on. We should be looking for new methods that will make scientists more productive, able to produce results more quickly — and yes, this means producing more bad results as well as more good. + +###### [**REED, Opening Statement**] In a recent interview, Nicole Lazak, co-author of the ***[editorial](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913)*** accompanying *The American Statistician’s* special issue on statistical significance, identified pre-registration as one of the “best ways” forward for science (see ***[here](https://retractionwatch.com/2019/03/21/time-to-say-goodbye-to-statistically-significant-and-embrace-uncertainty-say-statisticians/)***). + +###### The hope is that pre-registration will provide some discipline on researchers’ tendencies to “graze” through various research outputs in search of something interesting. It is precisely that kind of “grazing” that encourages the discovery of spurious relationships. As spurious relationships are the root cause of the “replication crisis”, pre-registration provides direct medicine for the sickness. + +###### **[GOODMAN]** Pre-registration is important for confirmatory research but irrelevant for exploratory work. The purpose of pre-registration is to eliminate *post hoc* investigator bias. To accomplish this, preregistered protocols must fully specify the study (including data analysis) with sufficient detail that a completely different team could carry out the work. This may sound over-the-top but is the norm in clinical trials of new drugs and focused social science replication projects. + +###### Many people support a soft form of pre-registration in which the preregistered protocol is simply a statement of intent and suggest that this form of pre-registration can be used for exploratory research. I don’t see the point. In my experience, exploratory research never goes as expected; we learn how to do the study by doing the study. Comparing the final method with the original plan is humbling to say the least. + +###### **[REED]** Effective pre-registration should be more than a statement of intent. It should clearly identify the goals of the research, the set of observations to be used, variables to be included in the analysis, and principles for modifying the analysis (e.g., criteria for eliminating outliers). The goal is to prevent HARKing and (possibly unconscious) p-hacking. + +###### Let me explain why I believe pre-registration can be effective in preventing HARKing and reducing p-hacking. + +###### A lot of research consists of looking for patterns in data. In other words, exploratory research. However, too often the patterns one observes are the results of random chance. This itself wouldn’t be so bad if there was a feasible way to adjust the statistical analysis to account for all the paths one had taken through the garden. Instead, researchers report the results of their exploratory analysis as if it were the one-shot, statistical experiment that significance testing presumes. + +###### Pre-registration limits the number of paths one explores, making it less likely that one stumbles upon a random-induced pattern. Where one discovers something after departing from the pre-registration plan, it helps readers to properly categorize the finding as exploratory, rather than confirmatory. + +###### It is important to note that pre-registration does not preclude researchers from exploring data to look for interesting relationships. Rather, the goal of pre-registration is to get researchers to distinguish between confirmatory and exploratory findings when reporting their empirical results. In the former case, statistical inference is valid, assuming the researcher makes Bonferroni-type adjustments when conducting multiple tests. In the latter case, statistical inference is meaningless. + +###### There is some evidence that it works! Recent studies report that effect sizes are smaller when studies have been pre-registered, and that there are fewer significant findings (see [***here***](https://replicationnetwork.com/2019/06/12/not-only-that-effect-sizes-from-registered-reports-are-also-much-lower/) and ***[here](https://replicationnetwork.com/2019/06/11/positive-findings-are-drastically-lower-in-registered-reports/)***). + +###### **[GOODMAN]** There is also evidence that pre-registration has not worked. Early results from studies that have been pre-registered indicate that researchers have not been careful to distinguish exploratory from confirmatory results (see ***[here](https://replicationnetwork.com/2019/05/25/pre-registration-the-doctor-is-still-out/)***). There is good reason to believe that these early returns are not aberrations. + +###### According to your model, exploratory results should be viewed with greater scepticism than results from confirmatory analysis. But researchers who want to see their work published and have impact will always have an incentive to blur that distinction. + +###### I am not alone in my pessimism about pre-registration. Others have also expressed concern that pre-registration does not address the problem of publication bias (see ***[here](https://replicationnetwork.com/2019/06/05/another-economics-journal-pilots-pre-results-review/)***). + +###### Pre-registration is a non-cure for a misdiagnosed disease. Current scientific culture prizes hypothesis-driven research over exploratory work. But good hypotheses don’t emerge full-blown from the imagination but rather derive from results of previous work, hopefully extended through imaginative speculation. + +###### The reality is that the literature is filled with papers claiming to be “hypothesis-driven” but which are actually a composite of exploration, *post hoc* hypothesis generation, and weak confirmation. This is how science works. We should stop pretending otherwise. + +###### Let me get back to what I think is a fundamental contradiction in pre-registration. As I understand it, economics research often involves analysis of pre-existing data. Since the data exists before the study begins, the only way to avoid post-hoc pattern discovery is to prevent the investigator from peeking at the data before pre-registering his research plan. This seems infeasible: how can someone design a study using a dataset without pretty deep knowledge of what’s there? + +###### **[REED]** It’s not the peeking at the data which is the problem, it is estimating relationships. Suppose my data has one dependent variable, Y, and 10 possible explanatory variables, X1 to X10. Pre-registration is designed to reduce unrestricted foraging across all data combinations of X variables to find significant relationships with Y. It does this by requiring me to say in advance which relationships I will estimate. Yes, I must look at the data to see which variables are available and how many usable observations I have. No, this does not eliminate the value of a pre-registration plan. + +###### **[GOODMAN]** Pre-registration puts the emphasis on the wrong thing. Instead, greater emphasis should be placed on developing confirmation procedures. Devising good confirmation procedures is an important area of methodological research. For example, in machine learning the standard practice is to construct a model using one dataset and test the model on another dataset (if you have enough data) or through bootstrapping. This might just do the trick in fields like economics that depend on analysis of large preexisting databases. + +###### Further, as others have noted, the “fast spread of pre-registration might in the end block” other approaches to solving problems of scientific reliability because “it might make people believe we have done enough” (see ***[here](https://replicationnetwork.com/2019/06/05/another-economics-journal-pilots-pre-results-review/)***). + +###### **[REED]** I’m only somewhat familiar with the uses of bootstrapping, but I don’t think this can solve all problems related to p-hacking and HARKing. For example, if there is an omitted variable that is highly correlated with both an included variable and the dependent variable, the included variable will remain significant even if one bootstraps the sample. Thus, while these can be useful tools in the researcher’s toolbox, I don’t believe they are sufficiently powerful to preclude the use of other tools, like pre-registration. + +###### With regard to pre-registration potentially crowding out more effective solutions, I agree this is a possibility, but I’d like to think that researchers could do the scientific equivalent of chewing gum and walking at the same time by adopting pre-registration + other things. + +###### **[REED, Conclusion]** I think our “debate” has played out to the point of diminishing returns, so let me give my final spin on things. I think we both agree that pre-registration is not a silver bullet. First, we don’t want to tie researchers’ hands so they are prevented from exploring data. Second, pre-registration can be ignored and, worse, manipulated. These weaken its ameliorative potential. On these two points we both agree. + +###### Where we disagree is that Nat thinks there is only a negligible benefit to pre-registration for exploratory research, while I think the benefit can be substantial. In my opinion, the benefit accrues mostly to well-intentioned researchers who might accidentally wander around the garden of forking paths without appreciating how it diminishes the significance of their findings (both statistical and practical). While this won’t eliminate the problem of p-hacking and HARKing, I think requiring researchers to complete a pre-analysis plan will make well-intentioned researchers less likely to fall into this trap. And if you believe that most researchers are well-intentioned, as I do, that can lead to a significant improvement in scientific practice, and reliability. + +###### *Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. He can be contacted at [natg@shore.net](mailto:natg@shore.net).* + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at [bob.reed@canterbury.ac.nz](mailto:bob.reed@canterbury.ac.nz).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/06/19/goodman-reed-a-friendly-debate-about-pre-registration/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/06/19/goodman-reed-a-friendly-debate-about-pre-registration/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goodman-systematic-replication-may-make-many-mistakes.md b/content/replication-hub/blog/goodman-systematic-replication-may-make-many-mistakes.md new file mode 100644 index 00000000000..dccc3bd41be --- /dev/null +++ b/content/replication-hub/blog/goodman-systematic-replication-may-make-many-mistakes.md @@ -0,0 +1,146 @@ +--- +title: "GOODMAN: Systematic Replication May Make Many Mistakes" +date: 2018-09-28 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Experimental Economics Replication Project" + - "Many Labs project" + - "Near exact replications" + - "replication" + - "Reproducibility Project" + - "Social Sciences Replication Project" +draft: false +type: blog +--- + +###### *Replication seems a sensible way to assess whether a scientific result is right. The intuition is clear: if a result is right, you should get a significant result when repeating the work; if it it’s wrong, the result should be non-significant. I test this intuition across a range of conditions using simulation. For exact replications, the intuition is dead on, but when replicas diverge from the original studies, error rates increase rapidly. Even for the exact case, false negative rates are high for small effects unless the samples are large. These results bode ill for large, systematic replication efforts, which typically prioritize uniformity over fidelity and limit sample sizes to run lots of studies at reasonable cost.* + +###### **INTRODUCTION** + +###### The basic replication rationale goes something like this: (1) many published papers are wrong; (2) this is a serious problem the community must fix; and (3) systematic replication is an effective solution. (In recent months, I’ve seen an uptick in pre-registration as another solution. That’s a topic for another day.) In this post, I focus on the third point and ask: viewed as a statistical test, how well does systematic replication work; how well does it tell the difference between valid and invalid results? + +###### By “systematic replication” I mean projects like ***[Many Lab](https://osf.io/89vqh/)***, ***[Reproducibility Project: Psychology (RPP)](https://osf.io/ezcuj/wiki/home)***, ***[Experimental Economics Replication Project (EERP)](https://experimentaleconreplications.com)***, and ***[Social Sciences Replication Project (SSRP)](http://www.socialsciencesreplicationproject.com)*** that systematically select studies in a particular field and repeat them in a uniform fashion. The main publications for these projects are ***[Many Lab](https://econtent.hogrefe.com/doi/full/10.1027/1864-9335/a000178)***, ***[RPP](http://science.sciencemag.org/content/349/6251/aac4716)***, ***[EERP](http://science.sciencemag.org/content/early/2016/03/02/science.aaf0918.full)***, ***[SSRP](https://www.nature.com/articles/s41562-018-0399-z)***. + +###### I consider a basic replication scheme in which each original study is repeated once. This is like ***[RPP](https://osf.io/ezcuj/wiki/home)*** and ***[EERP](https://experimentaleconreplications.com)***, but unlike ***[Many Lab](https://osf.io/89vqh/)*** as published which repeated each study 36 times and ***[SSRP](http://www.socialsciencesreplicationproject.com)*** which used a two-stage replication strategy. I imagine that the replicators are trying to closely match the original study (*direct replication*) while doing the replications in a uniform fashion for cost and logistical reasons. + +###### My test for replication success is the same as SSRP (what they call the *statistical significance criterion*): a replication succeeds if the replica has a significant effect in the same direction as the original. + +###### A replication is *exact* if the two studies are sampling the same population. This is an obvious replication scenario. You have a study you think may be wrong; to check it out, you repeat the study, taking care to ensure that the replica closely matches the original. Think ***[cold fusion](https://en.wikipedia.org/wiki/Cold_fusion)***. A replication is *near-exact* if the populations differ slightly. This is probably what systematic replication achieves, since the need for uniformity reduces precision. + +###### Significance testing of the replica (more precisely, the statistical significance criterion) works as expected for exact replications, but error rates increase rapidly as the populations diverge. This isn’t surprising when you think about it: we’re using the replica to draw inferences about the original study; it stands to reason this will only work if the two studies are very similar. + +###### Under conditions that may be typical in systematic replication projects, the rate of false positive mistakes calculated in this post ranges from 1-71% and false negative mistakes from 0-85%. This enormous range results from the cumulative effect of multiple unknown, hard-to-estimate parameters. + +###### My results suggest that we should adjust our expectations for systematic replication projects. These projects may make a lot of mistakes; we should take their replication failure rates with a grain of salt. + +###### The software supporting this post is open source and freely available in ***[GitHub](https://github.com/natgoodman/repwr)***. + +###### **SCENARIO** + +###### The software simulates *studies* across a range of conditions, combines pairs of studies into *pairwise replications*, calculates which replications pass the test, and finally computes false positive and false negative rates for conditions of interest. + +###### The studies are simple two group comparisons parameterized by sample size  and population effect size *dpop* (*dpop* ≥ *0*). For each study, I generate two groups of *n*random numbers. One group comes from a standard normal distribution with *mean = 0*; the other is standard normal with *mean = dpop*. I then calculate the p-value from a t-test. When I need to be pedantic, I use the term *study set* for the ensemble of studies for a given combination of *n*and *dpop*. + +###### The program varies *n*  from 20 to 500 and *dpop* from 0 to 1 with 11 discrete values each (a total of 112 = 121 combinations). It simulates 104 studies for each combination yielding about 1.2 million simulated studies. An important limitation is that all population effect sizes are equally likely within the range studied. I don’t consider *publication bias* which may make smaller effect sizes more likely, or any prior knowledge of expected effect sizes. + +###### To generate pairwise replications, I consider all (ordered) pairs of study sets. For each pair, the software permutes the studies of each set, then combines the studies row-by-row. This multiplies out to 1212 = 14,641 pairs of study sets and almost 150 million simulated replications. The first study of the pair is the *original* and the second the *replica*. I consistently use the suffixes 1 and 2 to denote the original and replica respectively. + +###### Four variables parameterize each pairwise replication: *n1*, *n2*, *d1pop,* and *d2pop*. These are the sample and population effect sizes for the two studies. + +###### After forming the pairwise replications, the program discards replications for which the original study isn’t significant. This reflects the standard practice that non-significant findings aren’t published and thus aren’t candidates for systematic replication. + +###### Next the program determines which replications should pass the replication test and which do pass the test. The ones that *should pass* are ones where the original study is a true positive, i.e., *d1pop* ≠ 0. The ones that *do pass* are ones where the replica has a significant p-value and effect size in the same direction as the original. + +###### A *false positive replication* is one where the original study is a false positive (*d1pop* = 0) yet the replication passes the test. A *false negative replication* is one where the original study is a true positive (*d1pop* ≠ 0), yet the replication fails the test. The program calculates *false positive* and *false negative rates* (abbr. *FPR* and *FNR*) relative to the number of replications in which the original study is significant. + +###### My definition of which replications *should pass* depends only on the original study. A replication in which the original study is a false positive and the replica study a true positive counts as a false positive replication. This makes sense if the overarching goal is to validate the original *study*. If the goal were to test the *result* of the original study rather than the study itself, it would make sense to count this case as correct. + +###### To get “mistake rates” I need one more parameter: , the proportion of replications that are true. This is the issue raised in Ioannidis’s famous paper, ***[“Why most published research findings are false”](http://dx.plos.org/10.1371/journal.pmed.0020124)*** and many other papers and blog posts including ***[one by me](http://daniellakens.blogspot.nl/2017/10/science-wise-false-discovery-rate-does.html)***. The terminology for “mistake rates” varies by author. I use terminology adapted from ***[Jager and Leek](http://doi.org/10.1093/biostatistics/kxt007)***. The *replication-wise false positive rate* (*RWFPR*) is the fraction of positive results that are false positives; the *replication-wise false negative rate* (*RWFNR*) is the fraction of negative results that are false negatives. + +###### **RESULTS** + +###### **Exact replications** + +###### A replication is *exact* if the two studies are sampling the same population; this means *d1pop* = *d*2*pop.* + +###### Figure 1 shows FPR for *n1* = 20  and *n2* varying from 50 to 500. The x-axis shows all four parameters using *d1*, *d2* as shorthand for *d1pop*, *d2pop*. *d1pop* = *d2pop* = 0  throughout because this is the only way to get false positives with exact replications. Figure 2 shows FNR for the same values of *n1* and *n2* but with *d1pop* = *d2pop* ranging from 0.1 to 1. + +###### I mark the conventionally accepted thresholds for false positive and negative error rates (0.05 and 0.2, resp.) as known landmarks to help interpret the results. I do **not** claim these are the right thresholds for replications. + +###### Fig1 + +![Fig2](/replication-network-blog/fig21.webp) + +###### For this ideal case, replication works exactly as intuition predicts. FPR is the significance level divided by 2 (the factor of 2 because the effect sizes must have the same direction). Theory tell us that *FNR* = 1 – *power* and though not obvious from the graph, the simulated data agrees well. + +###### As one would expect, if the population effect size is small, *n2*must be large to reliably yield a positive result. For *d =* 0.2, *n2* must be almost 400 in theory and 442 in the simulation to achieve *FNR* = 0.2; to hit *FNR* = 0.05, *n2* must be more than 650 (in theory). These seem like big numbers for a systematic replication project that needs to run many studies. + +###### **Near exact replications** + +###### A replication is *near-exact* if the populations differ slightly, which means *d1pop* and *d2pop*  differ by a small amount, *near*; technically, *abs*(*d1pop* – *d2pop*) ≤ *near*. + +###### I don’t know what value of *near* is reasonable for a systematic replication project. I imagine it varies by research area depending on the technical difficulty of the experiments and the variability of the phenomena. The range 0.1-0.3 feels reasonable. I extend the range by 0.1 on each end just to be safe. + +###### Figure 3 uses the same values of *n1*, *n2*, and *d1pop* as Figure 1, namely *n1* = 20, *n2* varies from 50 to 500, and *d1pop* = 0. Figure 4 uses the same values of *n1* and *n2* as Figure 2 but fixes *d1pop* = 0.5, a medium effect size. In both figures, *d2pop* ranges from *d1pop* – *near* to *d1pop* + *near* with values less than 0 or greater than 1 discarded. I restrict values to the interval [0,1] because that’s the range of *d* in the simulation. + +###### Fig3 + +###### Fig4 + +###### FPR is fine when *n2* is small, esp. when *near* is also small, but gets worse as *n2* (and *near*) increase. It may seem odd that the error rate increases as the sample size increases. What’s going on is a consequence of power. More power is usually good, but in this setting every positive is a false positive, so more power is bad. This odd result is a consequence of how I define correctness. When the original study is a false positive (*d1pop =* 0) and the replica a true positive (*d2pop* ≠ 0), I consider the replication to be a false positive. This makes sense if we’re trying to validate the original *study*. If instead we’re testing the *result* of the original study, it would make sense to count this case as correct. + +###### FNR behaves in the opposite direction: bad when *n2* is small and better as *n2* increases. + +###### To show the tradeoff between FPR and FNR, Figure 5 plots both error rates for *near* = 0.1 and *near* = 0.3. + +###### Fig5 + +###### For *near* =0.1, *n2* = 150 is a sweet spot with both error rates about 0.05. For *near* = 0.3, the crossover point is *n2* = 137 with error rates of about 0.15. + +###### FNR also depends on *d1pop* for “true” cases, i.e., when the original study is a true positive, getting worse when *d1pop* is smaller and better when *d1pop* is bigger. The table below shows the error rates for a few values of *n2*, *near*, and *d1pop*. Note that FPR only depends on *n2* and *near*, while FNR depends on all three parameters. The FNR columns are for different values of *d1pop* in true cases. + +###### Tab1 + +###### FNR is great for *d1pop* = 0.8, mostly fine for *d1pop* = 0.5, and bad for *d1pop* = 0.2. Pushing up *n2* helps but even when *n2* = 450, FNR is probably unacceptable for *d1pop* = 0.2. Increasing *n2* worsens FPR. It seems the crossover point above, *n2* = 137, is about right. Rounding up to 150 seems a reasonable rule-of-thumb. + +###### **Replication-wise error rates** + +###### The error rates reported so far depend on whether the original study is a false or true positive: FPR assumes the original study is a false positive, FNR assumes it’s a true positive. The next step is to convert these into replication-wise error rates: RWFPR and RWFNR. To do so, we need one more parameter: *prop.true*, the proportion of replications that are true. + +###### Of course, we don’t know the value of *prop.true*; arguably it’s the most important parameter that systematic replication is trying to estimate. Like *near* , it probably varies by research field and may also depend on the quality of the investigator. Some authors assume *prop.true* = 0.5, but I see little evidence to support any particular value. It’s easy enough to run a range of values and see how *prop.true* affects the error rates. + +###### The table below shows the results for *near* = 0.1, 0.3 as above, and *prop.true* ranging from 0.1 to 0.9. The RWFPR and RWFNR columns are for different values of *d1pop* in “true” cases, i.e., when the original study is a true positive. + +###### Tab2 + +###### Check out the top and bottom rows. The top row depicts a scenario where most replications are false (*prop.true* = 0.1) and the replicas closely match the original studies (*near*  = 0.1); for this case, most positives are mistakes and most negatives are accurate. The bottom row is a case where most replications are true (*prop.true* = 0.9) and the replicas diverge from the originals (*near* = 0.3); here most positives are correct and, unless *d1pop* is large, most negatives are mistakes. + +###### Which scenario is realistic? There are plenty of opinions but scant evidence. Your guess is as good as mine. + +###### **DISCUSSION** + +###### Systematic replication is a poor statistical test when used to validate published studies. Replication works well when care is taken to ensure the replica closely matches the original study. This is the norm in focused, one-off replication studies aiming to confirm or refute a single finding. It seems unrealistic in systematic replication projects, which typically prioritize uniformity over fidelity to run lots of studies at reasonable cost. If the studies differ, as they almost certainly must in systematic projects, mistake rates grow and may be unacceptably high under many conditions. + +###### My conclusions depend on the definition of replication correctness, i.e., which replications *should pass*. The definition I use in this post depends only on the original study: a replication should pass if the original study is a true positive; the replica study is just a proxy for the original one. This makes sense if the goal is to validate the original *study*. If the goal were to test the *result* of the original study rather than the study itself, it would make sense to let true positive replicas count as true positive replications. That would greatly reduce the false positive rates I report. + +###### My conclusions also depend on details of the simulation. An important caveat is that population effect sizes are uniformly distributed across the range studied. I don’t consider *publication bias* which may make smaller effect sizes more likely, or any prior knowledge of expected effect sizes. Also, in the near exact case, I assume that replica effect sizes can be smaller or larger than the original effect sizes; many investigators believe that replica effect sizes are usually smaller. + +###### My results suggest that systematic replication is unsuitable for validating existing studies. An alternative is to switch gears and focus on generalizability. This would change the mindset of replication researchers more than the actual work. Instead of trying to refute a study, you would assume the study is correct within the limited setting of the original investigation and try to extend it to other settings. The scientific challenge would become defining good “other settings” – presumably there are many sensible choices — and selecting studies that are a good fit for each. This seems a worthy problem in its own right that would move the field forward no matter how many original studies successfully generalize. + +###### I’ve seen plenty of bad science up close and personal, but in my experience statistics isn’t the main culprit. The big problem I see is faulty research methods. Every scientific field has accepted standard research methods. If the methods are bad, even “good” results are likely to be wrong; the results may be highly replicable but wrong nonetheless. + +###### The quest to root out bad science is noble but ultimately futile. “Quixotic” comes to mind. Powerful economic forces shape the size and make-up of research areas. Inevitably some scientists are better researchers than others. But “Publish or Perish” demands that all scientists publish research papers. Those who can, publish good science; those who can’t, do the best they can. + +###### We will do more good by helping good scientists do good science than by trying to slow down the bad ones. The truly noble quest is to develop tools and techniques that make good scientists more productive. That’s the best way to get more good science into the literature. + +###### *Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others. He can be contacted at natg@shore.net.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/09/28/goodman-systematic-replication-may-make-many-mistakes/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/09/28/goodman-systematic-replication-may-make-many-mistakes/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goodman-what-s-the-true-effect-size-it-depends-what-you-think.md b/content/replication-hub/blog/goodman-what-s-the-true-effect-size-it-depends-what-you-think.md new file mode 100644 index 00000000000..cfafe7a1d11 --- /dev/null +++ b/content/replication-hub/blog/goodman-what-s-the-true-effect-size-it-depends-what-you-think.md @@ -0,0 +1,62 @@ +--- +title: "GOODMAN: What’s the True Effect Size? It Depends What You Think" +date: 2019-10-05 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Bayesian data analysis" + - "Nat Goodman" + - "Priors" + - "True Effect Size" +draft: false +type: blog +--- + +###### What’s the true effect size? That’s my bottom line question when doing a study or reading a paper. I don’t expect an exact answer, of course. What I want is a probability distribution telling where the true effect size probably lies. I used to think confidence intervals answered this question, but they don’t except under artificial conditions. A better answer comes from Bayes’s formula. But beware of the devil in the priors. + +###### Confidence intervals, like other standard methods such as the t-test, imagine we’re repeating a study an infinite number of times, drawing a different sample each time from the same population. That seems unnatural for basic, exploratory research, where the usual practice is to run a study once (or maybe twice for confirmation). + +###### As I looked for a general way to estimate true effect size from studies done once, I fell into Bayesian analysis. Much to my surprise, this proved to be simple and intuitive. The code for the core Bayesian analysis (available [***here***](https://natgoodman.github.io/bayez/baysx.stable.html)) is simple, too: just a few lines of R. + +###### The main drawback is the answer depends on your prior expectation. Upon reflection, this drawback may really be a strength, because it forces you to articulate key assumptions. + +###### Being a programmer, I always start with simulation when learning a new statistical method. I model the scenario as a two stage random process. The first stage selects a population (aka “true”) effect size, *dpop*, from a distribution; the second carries out a study with that population effect size yielding an observed effect size, *dobs*. The studies are simple two group difference-of-mean studies with equal sample size and standard deviation, and the effect size statistic is standardized difference (aka *Cohen’s d*). + +###### I record *dpop* and *dobs* from each simulation producing a table showing which *dpop*s give rise to which *dobs*s. Then I pick a target value for *dobs*, say *0.5*, and limit the table to rows where *dobs* is near *0.5*. The distribution of *dpop* from this subset is the answer to my question. In Bayesian-speak, the first-stage distribution is the *prior*, and the final distribution is the *posterior*. + +###### Now for the cool bit. The Bayesian approach lets us pick a prior that represents our assumptions about the distribution of effect sizes in our research field. From what I read in the blogosphere, the typical population effect size in social science research is *0.3*. I model this as a normal distribution with *mean=0.3* and small standard deviation, *0.1*. I also do simlulations with a bigger prior, *mean=0.7*, to illustrate the impact of the choice. + +###### Figures 1a-d show the results for small and large samples (*n=10* or *200*) and small and big priors for *dobs=0.5*. Each figure shows a histogram of simulated data, the prior and posterior distributions (blue and red curves), the medians of the two distributions (blue and red dashed vertical lines), and *dobs* (gray dashed vertical line). + +###### TRN1(20191005) + +###### The posteriors and histograms match pretty well, indicating my software works. For *n=10* (left column), the posterior is almost identical to the prior, while for *n=200* (right column), it’s shifted toward the observed. It’s a tug-of-war: for small samples, the prior wins, while for large samples, the data is stronger and keeps the posterior closer to the observation. The small prior (top row) pulls the posterior down; the big one (bottom row) pushes it up. Completely intuitive. + +###### But wait. I forgot an important detail: some of the problems we study are “false” (“satisfy the null”). No worries. I model the null effect sizes as normal with *mean=0* and very small standard deviation, *0.05*, and the totality of effect sizes as a *mixture* of this distribution and the “true” ones as above. To complete the model, I have to specify the proportion of true vs. null problems. To illustrate the impact, I use 25% true for the small prior and 75% for the big one. + +###### Figures 2a-d show the results. + +###### TRN2(20191005) + +###### The priors have two peaks, reflecting the two classes. With 25% true (top row), the false peak is tall and the true peak short; with 75% true (bottom row), it’s the opposite though not as extreme. For small samples (left column), the posterior also has two peaks, indicating that the data does a poor job of distinguishing true from null cases. For big samples (right column), the posterior has a single peak, which is clearly in true territory. As in Figure 1, the small prior (top row) pulls the result down, while the bigger one (bottom row) pushes it up. Again completely intuitive. + +###### What’s not to like? + +###### The devil is in the priors. Small priors yield small answers; big priors yield bigger ones. Assumptions about the proportion of true vs. null problems amplify the impact. Reasonable scientists might choose different priors making it hard to compare results across studies. Unscrupulous scientists may choose priors that make their answers look better, akin to p-hacking. + +###### For this elegant method to become the norm, something has to be done about the priors. Perhaps research communities could adopt standard priors for specific types of studies. Maybe we can use data from reproducibility projects to inform these choices. It seems technically feasible. No doubt I’ve missed some important details, but this seems a promising way to move beyond p-values. + +###### COMMENTS PLEASE! + +###### Please post comments on ***[Twitter](https://twitter.com/gnatgoodman)*** or ***[Facebook](https://www.facebook.com/nathan.goodman.3367)***, or contact me by email [natg@shore.net](mailto:natg@shore.net). + +###### *Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/10/05/goodman-whats-the-true-effect-size-it-depends-what-you-think/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/10/05/goodman-whats-the-true-effect-size-it-depends-what-you-think/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goodman-when-you-re-selecting-significant-findings-you-re-selecting-inflated-estimates.md b/content/replication-hub/blog/goodman-when-you-re-selecting-significant-findings-you-re-selecting-inflated-estimates.md new file mode 100644 index 00000000000..b8f366d82d6 --- /dev/null +++ b/content/replication-hub/blog/goodman-when-you-re-selecting-significant-findings-you-re-selecting-inflated-estimates.md @@ -0,0 +1,57 @@ +--- +title: "GOODMAN: When You’re Selecting Significant Findings, You’re Selecting Inflated Estimates" +date: 2019-02-16 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "effect size bias" + - "Effect sizes" + - "null hypothesis significance testing" + - "publication bias" + - "Sample selection" +draft: false +type: blog +--- + +###### *Replication researchers cite inflated effect sizes as a major cause of replication failure. It turns out this is an inevitable consequence of significance testing. The reason is simple. The p-value you get from a study depends on the observed effect size, with more extreme observed effect sizes giving better p-values; the true effect size plays no role. Significance testing selects studies with good p-values, hence extreme observed effect sizes. This selection bias guarantees that on average, the observed effect size will inflate the true effect size[1]. The overestimate is large, 2-3x, under conditions typical in social science research. Possible solutions are to increase sample size or effect size or abandon significance testing.* + +###### [1] By “inflate” I mean increase the absolute value. + +###### Figure 1 illustrates the issue using simulated data colored by p-value. The simulation randomly selects true effect sizes, then simulates a two group difference-of-mean study with sample size *n=20* for each true effect size. The effect size statistic is standardized difference, aka *Cohen’s d*, and p-values are from the t-test. The figure shows a scatter plot of true vs. observed effect size with blue and red dots depicting nonsignificant and significant studies. P-values are nonsignifiant (blue) for observed effect sizes between about -0.64 and 0.64 and improve as the observed effect size grows. The transition from blue to red at ± 0.64 is a *critical value* that sharply separates nonsignificant from significant results. This value depends only on *n* and is the least extreme significant effect size for a given *n*. + +###### Goodman1(20190216) + +###### Technical note: The sharpness of the boundary is due to the use of Cohen’s d in conjunction with the t-test. This pairing is mathematically natural because both are *standardized*, meaning both are relative to the sample standard deviation. In fact, Cohen’s d and the t-statistic are essentially the same statistic, related by the identities *d = t**∙sqrt(2/n)*and t*= d**∙sqrt(2/n)*(for my simulation scenario). + +###### The average significant effect size depends on both *d* and *n*. I explore this with a simulation that fixes *d* to a few values of interest, sets *n* to a range of values, and simulates many studies for each *d* and *n*. + +###### From what I read in the blogosphere, the typical true effect size in social science research is *d=0.3*. Figure 2 shows a histogram of observed effect sizes for *d=0.3* and *n=20*. The significant results are way out on the tails, mostly on the right tail, which means the average will be large. Figure 3 shows the theoretical equivalent of the histogram (the *sampling distribution*) for the same parameters and two further cases: same *d* but larger *n*, and same *n* but larger *d*. Increasing *n* makes the curve sharper and reduces the critical effect size, causing much more of the area to be under the red (significant) part of the curve. Increasing *d* slides the curve over, again putting more of the area under the red. These changes reduce the average significant effect size bringing it closer to the true value. + +###### Goodman2(20190216) + +###### Goodman3(20190216) + +###### Figure 4 plots the average significant effect size for *d* between 0.3 and 0.7 and *n* ranging from 20 to 200. In computing the average, I only use the right tail, reasoning that investigators usually toss results with the wrong sign whether significant or not, as these contradict the authors’ scientific hypothesis. Let’s look first at *n=20*. For *d=0.3* the average is 0.81, an overestimate of 2.7x. A modest increase in effect size helps a lot. For *d=0.5* (still “medium” in Cohen’s d vernacular), the average is 0.86, an overestimate of 1.7x. For *d=0.7*, it’s 0.93, an overestimate of 1.3x. To reduce the overestimate to a reasonable level, say 1.25x, we need *n=122* for *d=0.3*, but only *n=47* for *d=0.5*, and *n=26* for *d=0.7*. + +###### Goodman4(20190216) + +###### Significance testing is a biased procedure that overestimates effect size. This is common knowledge among statisticians yet seems to be forgotten in the replication literature and is rarely explained to statistics users. I hope this post will give readers a visual understanding of the problem and under what conditions it may be worrisome. Shravan Vasishth offers another good explanation in ***[his excellent TRN post](https://replicationnetwork.com/2018/09/11/vasishth-the-statistical-significance-filter-leads-to-overoptimistic-expectations-of-replicability/)*** and ***[related paper](https://www.sciencedirect.com/science/article/pii/S0749596X18300640)***. + +###### You can mitigate the bias by increasing sample size or true effect size. There are costs to each. Bigger studies are more expensive. They’re also harder to run and may require more study personnel and study days, which may increase variability and indirectly reduce the effect size. Increasing the effect size typically involves finding study conditions that amplify the phenomenon of interest. This may reduce the ability to generalize from lab to real world. All in all, it’s not clear that the net effect is positive. + +###### A cheaper solution is to abandon significance testing. The entire problem is a consequence of this timeworn statistical method. Looking back at Figure 1, observed effect size tracks true effect size pretty well. There’s uncertainty, of course, but that seems an acceptable tradeoff for gaining unbiased effect size estimates at reasonable cost. + +###### Comments Please! + +###### Please post comments on ***[Twitter](https://twitter.com/gnatgoodman)*** or ***[Facebook](https://www.facebook.com/nathan.goodman.3367)***, or contact me by email at [natg@shore.net](mailto:natg@shore.net). + +###### *Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/02/16/goodman-when-youre-selecting-significant-findings-youre-selecting-inflated-estimates/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/02/16/goodman-when-youre-selecting-significant-findings-youre-selecting-inflated-estimates/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/goodman-your-p-values-are-too-small-and-so-are-your-confidence-intervals.md b/content/replication-hub/blog/goodman-your-p-values-are-too-small-and-so-are-your-confidence-intervals.md new file mode 100644 index 00000000000..35d1c9beb60 --- /dev/null +++ b/content/replication-hub/blog/goodman-your-p-values-are-too-small-and-so-are-your-confidence-intervals.md @@ -0,0 +1,63 @@ +--- +title: "GOODMAN: Your p-Values Are Too Small! And So Are Your Confidence Intervals!" +date: 2019-05-01 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "confidence intervals" + - "Heterogeneous effects" + - "Nat Goodman" + - "p-values" + - "Significance inflation" +draft: false +type: blog +--- + +###### *An oft-overlooked detail in the significance debate is the challenge of calculating correct p-values and confidence intervals, the favored statistics of the two sides. Standard methods rely on assumptions about how the data were generated and can be way off when the assumptions don’t hold. Papers on heterogenous effect sizes by **[Kenny and Judd](https://osf.io/qs9xw/)** and **[McShane and Böckenholt](https://doi.org/10.1177/1745691614548513)** present a compelling scenario where the standard calculations are highly optimistic. Even worse, the errors grow as the sample size increases, negating the usual heuristic that bigger samples are better.* + +###### Standard methods like the t-test imagine that we’re repeating a study an infinite number of times, drawing a different sample each time from a population with a fixed true effect size. A competing, arguably more realistic, model is the heterogeneous effect size model (*het*). This assumes that each time we do the study, we’re sampling from a different population with a different true effect size. ***[Kenny and Judd](https://osf.io/qs9xw/)*** suggest that the population differences may be due to “variations in experimenters, participant populations, history, location, and many other factors… we can never completely specify or control.” + +###### In the meta-analysis literature, the *het* model is called the “random effects model” and the standard model the “fixed effects model”. While the distinction is well-recognized, the practical implications may not be. The purpose of this blog is to illustrate the practical consequences of the *het* model for p-values and confidence intervals. + +###### I model the *het* scenario as a two stage random process. The first stage selects a population effect size, *dpop*, from a normal distribution with mean *dhet* and standard deviation *sdhet*. The second carries out a two group difference-of-mean study with that population effect size: it selects two random samples of size *n* from standard normal distributions, one with *mean=0* and the other with *mean=dpop*, and uses standardized difference, aka Cohen’s *d*, as the effect size statistic. The second stage is simply a conventional study with population effect size *dpop*. *dhet*, the first stage mean, plays the role of true effect size. + +###### Figure 1 shows a histogram of simulated *het* results under the null (*dhet=0*) with *sdhet=0.2* for *n=200*. Overlaid on the histogram is the sampling distribution for the conventional scenario colored by conventional p-value along with the 95% confidence interval. Note that the histogram is wider than the sampling distribution. + +###### TRN1(20190501) + +###### Recall that the p-value for an effect *d* is the probability of getting a result as or more extreme than *d* under the null. Since the histogram is wider than the sampling distribution, it has more data downstream of the point where *p=0.05* (where the color switches from blue to red) and so the correct p-value is more than 0.05. In fact the correct p-value is much more: 0.38. The confidence interval also depends on the width of the distribution and is wider than for the conventional case: -0.44 to 0.44 rather than -0.20 to 0.20. + +###### Note that effect size heterogeneity “inflates” both the true p-value and true confidence interval. In this particular example, *p-value inflation* is 7.6 ( 0.38/0.05), and *confidence interval inflation* is 2.2 (0.44/0.20). In general, these inflation factors will change with *sdhet* and *n*. Figures 2 and 3 plot p-value and confidence interval inflation vs. *n* for several values of *sdhet*. The p-value results (Figure 2) show inflation when the conventional p-value is barely significant (*p=0.05*); the confidence interval results (Figure 3) are for *d=0* (same as Figure 1). + +###### TRN2(20190501) + +###### TRN3(20190501) + +###### Not surprisingly, the results get worse as heterogeneity increases. For *n=200*, p-value inflation grows from 1.59 when *sdhet=0.05* to 12.68 for *sdhet=0.4*; over the same range, confidence interval inflation grows from 1.12 to 4.12. + +###### More worrisome is that the problem also gets worse as the sample size increases. For *sdhet=0.05*, p-value inflation grows from a negligible 1.05 when *n=20* to 1.59 for *n=200* and 2.19 for *n=400*; the corresponding values for confidence interval inflation are 1.01, 1.12, and 1.22. For *sdhet=0.2*, p-value inflation grows from 1.90 for *n=20* to 10.26 for *n=400*, while confidence interval inflation increases from 1.18 to 3.00. + +###### What’s driving this sample size dependent inflation is that increasing *n* tightens up the second stage (where we select samples of size *n*) but not the first (where we select *dpop*). As *n* grows and the second stage becomes narrower, the unchanging width of the first stage becomes proportionally larger. + +###### Another way to see it is to compare the sampling distributions. Figure 4 shows sampling distributions for *n=20* and *n=200* for the conventional scenario (colored by p-value) and the *het* scenario (in grey) for *sdhet=0.2*. For *n=20*, the *het* (grey) curve is only slightly wider than the conventional one, while for *n=200* the difference is much greater. In both scenarios, the distributions are tighter for the larger *n*, but the conventional curve gets tighter faster. + +###### TRN4(20190501) + +###### If you believe that the heterogeneous effects model better depicts reality than the conventional model, it follows that p-values and confidence intervals computed by standard statistical packages are too small. Further, it’s impossible to know how much they should be adjusted. + +###### Is this another argument for “retiring statistical significance?” Maybe. But even if one wants to keep significance on the payroll, these results argue for giving less weight to p-values and confidence intervals when assessing the results of statistical tests. More holistic and, yes, subjective interpretations are warranted. + +###### Comments Please! + +###### Please post comments on ***[Twitter](https://twitter.com/gnatgoodman)*** or ***[Facebook](https://www.facebook.com/nathan.goodman.3367)***, or contact me by email at [natg@shore.net](mailto:natg@shore.net). + +###### *Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/05/01/your-p-values-are-too-small-and-so-are-your-confidence-intervals/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/05/01/your-p-values-are-too-small-and-so-are-your-confidence-intervals/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/grunow-notes-from-a-workshop-replications-in-empirical-economics-ways-out-of-the-crisis.md b/content/replication-hub/blog/grunow-notes-from-a-workshop-replications-in-empirical-economics-ways-out-of-the-crisis.md new file mode 100644 index 00000000000..b1c52c36315 --- /dev/null +++ b/content/replication-hub/blog/grunow-notes-from-a-workshop-replications-in-empirical-economics-ways-out-of-the-crisis.md @@ -0,0 +1,46 @@ +--- +title: "GRUNOW: Notes from a Workshop: “Replications in Empirical Economics – Ways Out of the Crisis”" +date: 2017-09-21 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "German Research Foundation (DFG)" + - "Institute of Labor Economics (IZA)" + - "IREE" + - "Journal policies" + - "replication" + - "ZBW Leibniz Information Center for Economics" +draft: false +type: blog +--- + +###### *“Next year, this topic should not be discussed in a pre-conference workshop but in the opening plenum of the conference!”* This statement by a young researcher not only concluded the workshop but also gave bright prospects to replications in Economics. + +###### On September 8, 2017 the ZBW Leibniz Information Center for Economics hosted the workshop [***“Replications in Empirical Economics – Ways out of the Crisis”***](http://vfs2017.univie.ac.at/en/conference-information/conference-programme/sunday-9-3-2017/) at the ***[Annual Conference](https://www.socialpolitik.de/En/annual-conference-2017)*** of the ***[Verein für Socialpolitik](https://www.socialpolitik.de/En/annual-conference-2017)*** in Vienna, Austria. Thirty participants and four speakers engaged in lively and stimulating discussions about replications and the publication of replications in Economics. . + +###### Hilmar Schneider, Director of the ***[Institute of Labor Economics](https://www.iza.org/)*** (IZA, Germany), made an enthusiastic plea for replications in economics and sharply criticized the credibility of much economic research. He argued that without replication research, knowledge can only grow horizontally rather than vertically. According to Schneider, the core problem lies in the pressure for novelty and original research that is exerted by the academic culture and the scientific publication system. As a consequence, pseudo-innovations are preferred to research that may actually have an impact on society. Arbitrary findings that lack verification are insufficient to properly inform and support policy decisions. + +###### Christiane Joerk, Program Director from the ***[German Research Foundation](http://www.dfg.de/en/index.jsp)*** (DFG), summarized the ***[DFG Statement on the Replicability of Research Results](http://www.dfg.de/en/service/press/press_releases/2017/press_release_no_13/index.html)***. The DFG already financially supports projects that strengthen the infrastructure for replications (e.g. IREE – see below) and will also fund replication studies in the future. + +###### *“Is it possible to publish replication studies in the International Journal for Re-Views in Empirical Economics using a pseudonym?”* This question reflects the situation of young researchers, who are caught between good scientific practice and the current academic culture. Alarmingly, good scientific practice and academic culture apparently pose a conflict to researchers. + +###### The wish for the anonymous publication of replication studies also reflects the bad reputation of replication studies in economics. On the one hand, most researchers state that replication studies are a critical part of scientific progress and indispensable for good scientific practice. On the other hand, as documented by Maren Duvendack from the ***[University of East Anglia](https://www.uea.ac.uk/)*** (UK), researchers who do replications are seen to be engaged in “bullying” and “persecution”, and referred to as “research parasites”. Together with her colleagues Richard Palmer Jones and Bob Reed she analyzed the publication market for replications in economics. The authors found that published replications in economics have been increasing since the mid-seventies. However the absolute number is still alarmingly low (with a peak of 22 published replications in economics in 2012). Compared to other disciplines such as psychology and political science, economics lags behind with respect to replication efforts. According to Duvendack, a key obstacle could be the lack of publication outlets for replication studies in economics. + +###### As a partial solution to the replication crisis, Martina Grunow, Project Manager at the ***[ZBW Leibniz Information Center for Economics](http://www.zbw.eu/en/)*** (Germany), concluded the talks with her presentation of the [***International Journal for Re-Views in Empirical Economics*** (IREE)](http://www.iree.eu). She summarized the incentive problems for authors to conduct replication studies and as a result a lack of publication possibilities for this kind of research. IREE is a peer-reviewed and open-access e-journal which is dedicated to the publication of replication studies in empirical economics. Replication studies are published without regard to their results. IREE-papers are made citable with DOIs and are internationally disseminated via ***[EconStor](https://www.econstor.eu/?locale=en)*** and ***[RePEc](https://ideas.repec.org/search.html)***. As the project is funded by the German Research Foundation (DFG) and the ZBW Leibniz Information Center for Economics, IREE does not charge any publication fees. By providing a publication platform for replication studies, IREE aims to encourage the credibility of economics research based on robust and replicable findings. + +###### With the prospect of publishing replication studies, doctoral students participating in the workshop expressed the view that replications should become a mandatory part of their dissertation research. They argued that by doing so, replication research would receive deserved credibility and not be buried under the pressure for pseudo-innovations. This is especially important at the beginning of their academic careers. The participants also agreed on the importance of teaching the value of replications to students. + +###### *Dr. Martina Grunow is Managing Editor of the International Journal for Re-Views in Empirical Economics (IREE) and is an associate researcher at the Canadian Centre for Health Economics (CCHE). She can be contacted by email at m.grunow@zbw.eu.* + +###### REFERENCES: + +###### Duvendack, M., Palmer-Jones, R.W. & Reed, W.R., 2015. Replications in Economics: A Progress Report, *Econ Journal Watch*, 12(2): 164-191. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/09/21/grunow-notes-from-a-workshop-replications-in-empirical-economics-ways-out-of-the-crisis/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/09/21/grunow-notes-from-a-workshop-replications-in-empirical-economics-ways-out-of-the-crisis/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/grunow-say-hello-to-iree-a-new-economics-journal-dedicated-to-the-publishing-of-replication-studies.md b/content/replication-hub/blog/grunow-say-hello-to-iree-a-new-economics-journal-dedicated-to-the-publishing-of-replication-studies.md new file mode 100644 index 00000000000..9f453575f51 --- /dev/null +++ b/content/replication-hub/blog/grunow-say-hello-to-iree-a-new-economics-journal-dedicated-to-the-publishing-of-replication-studies.md @@ -0,0 +1,49 @@ +--- +title: "GRUNOW: Say Hello to IREE – A New Economics Journal Dedicated to the Publishing of Replication Studies" +date: 2017-10-06 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Economics Journals" + - "IREE" + - "Journal policies" + - "Martina Grunow" + - "replication" +draft: false +type: blog +--- + +###### Replications are pivotal for the credibility of empirical economics. Evidence-based policy requires findings that are robust and reproducible. Despite this, there has been a notable absence of serious effort to establish the reliability of empirical research in economics. As ***[Edward Leamer famously noted](https://www.jstor.org/stable/1803924?seq=1#page_scan_tab_contents)***, “Hardly anyone takes data analysis seriously. Or perhaps more accurately, hardly anyone takes anyone else’s data analysis seriously.” This is evidenced by the fact that replication studies ***[are rarely published](https://www.econstor.eu/bitstream/10419/162381/1/879455233.pdf)*** in ***[economic journals](https://econjwatch.org/file_download/866/DuvendackEtAlMay2015.pdf)***. + +###### However, the situation may be changing. Recently, the Deutsche Forschungsgemeinschaft (DFG) released a ***[Statement on the Replicability of Research Results](http://www.dfg.de/en/service/press/press_releases/2017/press_release_no_13/index.html)*** in which it emphasized the importance of replication to ensure the reliability of empirical research. Accordingly, DFG is funding a new scientific journal, the “*International Journal for Re-Views in Empirical Economics (IREE)*”. + +###### IREE is a joint project of Leuphana University of Lüneburg (Joachim Wagner), the German Institute for Economic Research (DIW Berlin) (Gert G. Wagner), the Institute of Labor Economics (Hilmar Schneider), and the ***[ZBW](http://www.zbw.eu)***. Nobel laureate Sir Angus Deaton (Princeton University), Jeffrey M. Wooldridge (Michigan State University), and Richard A. Easterlin (University of Southern California) are members of the advisory board of IREE. + +###### The *International Journal for Re-Views in Empirical Economics* (IREE) is the first journal dedicated to the publication of replication studies based on economic micro-data. Furthermore, IREE publishes synthesizing reviews, micro-data sets and descriptions thereof, as well as articles dealing with replication methods and the development of standards for replications. Up to now, authors of replication studies, data sets and descriptions have had a hard time gaining recognition for their work via citable publications. As a result, the incentives for conducting these important kinds of work were immensely reduced. Richard A. Easterlin notes the paradox when he states: “Replication, though a thankless task, is essential for the progress of social science.” + +###### To make replication a little less thankless, all publications in IREE are citable. Each article, data set, and computer program is assigned a DOI. In addition, data sets are stored in a permanent repository, the ***[ZBW Journal Data Archive](http://www.journaldata.zbw.eu/)***. This provides a platform for authors to gain credit for their replication-related research. + +###### Up to now, publication of replication studies has often been results-dependent, with publication being more likely if the replication study refutes the original research. This induces a severe publication bias. When this happens, replication, rather than improving things, can actually further undermine the reliability of economic research. Compounding this are submission and publication fees which discourage replication research that is unlikely to get published. + +###### IREE is committed to publishing research independent of the results of the study. Publication is based on technical and formal criteria without regard to results. To encourage open and transparent discourse, IREE is open access. There are no publication or submission fees, and the journal is committed to a speedy and efficient peer-review process. + +###### To learn more about IREE, including how to submit replication research for publication, ***[click here](http://www.iree.eu/)***. + +###### *Dr. Martina Grunow is Managing Editor of the International Journal for Re-Views in Empirical Economics (IREE) and is an associate researcher at the Canadian Centre for Health Economics (CCHE). She can be contacted by email at* **m.grunow@zbw.eu***.* + +###### REFERENCES: + +###### Duvendack, M., Palmer-Jones, R.W. & Reed, W.R., 2015. Replications in Economics: A Progress Report, Econ Journal Watch, 12(2): 164-191. + +###### Leamer, Edward E., 1983. Let’s Take the Con Out of Econometrics, The American Economic Review, 73(1): 31-43. + +###### Mueller-Langer, F.,  Fecher, B.,Harhoff, D. & Wagner, G. G., 2017. The Economics of Replication, IZA Discussion Papers 10533, Institute for the Study of Labor (IZA). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/10/06/grunow-say-hello-to-iree-a-new-economics-journal-dedicated-to-the-publishing-of-replication-studies/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/10/06/grunow-say-hello-to-iree-a-new-economics-journal-dedicated-to-the-publishing-of-replication-studies/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/grunow-update-on-iree-the-first-and-only-journal-dedicated-to-replications-in-economics.md b/content/replication-hub/blog/grunow-update-on-iree-the-first-and-only-journal-dedicated-to-replications-in-economics.md new file mode 100644 index 00000000000..a87c87436af --- /dev/null +++ b/content/replication-hub/blog/grunow-update-on-iree-the-first-and-only-journal-dedicated-to-replications-in-economics.md @@ -0,0 +1,60 @@ +--- +title: "GRUNOW: Update on IREE – the First and Only Journal Dedicated to Replications in Economics" +date: 2019-05-11 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "Economics Journals" + - "IREE" + - "Journal policies" + - "Journals" + - "Open Science" + - "Plan S" + - "Project DEAL" + - "replications" + - "ZBW Leibniz Information Center for Economics" +draft: false +type: blog +--- + +###### IREE (***[the International Journal for Re-Views in Empirical Economics](https://www.iree.eu/)***) was launched in September 2017, supported by our prestigious board of academic advisors: Sir Angus Deaton, Richard Easterlin, and Jeffrey Wooldridge. It is the first, and, to date, only journal solely dedicated to publishing replication studies in Economics. As we take stock of where we currently are, I want to take this opportunity to provide an update, and to talk about four areas that will be crucial for IREE’s growth in the future: financing, publications, submissions, and the editorial board. + +###### **Financing of IREE** + +###### We currently live in a time of major upheaval in academic publishing. ***[Plan S](https://en.wikipedia.org/wiki/Plan_S)*** and ***[Project DEAL](https://en.wikipedia.org/wiki/Project_DEAL)*** represent major challenges to the “big publisher/closed access” model that has dominated academic publishing. IREE is part of a new movement in academic publishing that emphasizes open access. All of our content is available online, free of charge, and authors are not charged a submission fee. This is in keeping with the philosophy of “***[Open Science](https://en.wikipedia.org/wiki/Open_science)***”, which fits naturally with our focus on replication. + +###### So who pays our bills? Funding of IREE is provided by the ***[Joachim Herz Foundation](https://www.joachim-herz-stiftung.de/en/)*** and the ***[ZBW – Leibniz Information Centre for Economics](http://www.zbw.eu/en/)***. While the funding has allowed us to get this far, we are currently seeking to strengthen our foundation of financial support by building a consortium of supporting institutions including  universities, central banks, research institutes and libraries from all over the world. To do that, we need to establish a solid track record of publishing high quality and important replication studies. This brings us to our next category. + +###### **Publications in IREE** + +###### We have been pleased with the quality of the replication studies we have published to date. Our list of published studies include replications of research that has appeared in the *American Economic Review*, the *Quarterly Journal of Economics,* the *American Economic Journal: Applied Economics,* the *Review of Economics and Statistics,* the *Oxford Bulletin of Economics and Statistics*, and other prestigious journals. Publications in IREE are distributed via ***[EconStor](https://www.econstor.eu/?&locale=en)***, ***[RePEc](http://repec.org/)***, and the ***[ReplicationWiki](http://replication.uni-goettingen.de/wiki/index.php/Main_Page)***. You can check out our publications ***[here](https://www.iree.eu/publications/publications-in-iree/)***. + +###### **Submissions to IREE** + +###### Quality submissions are the lifeblood of the journal, and the key to us securing future funding. We need the quality replication studies to keep coming in. While we are always pleased to receive submissions from prominent, established researchers, we are also happy to receive submissions from post-graduate students and early career researchers. + +###### Many PhD programs have students perform replications as part of their graduate coursework and empirical training. These should consider IREE as a publication outlet. Our quick review times and online publishing means that it is possible for students to have their work published and “in print” by the time they enter the job market, helping to establish their research record. + +###### A distinctive feature of IREE is that we publish quality replications regardless of the outcome. There are other journals that publish replication studies, but oftentimes they will only publish a replication if it is overturns the results of an original study. For example, the *American Economic Review* has published many replication studies, but all of their replications disconfirm the original studies. IREE will publish a replication study even if the results of that analysis support the findings of the original research. + +###### IREE specializes in the following areas: Microeconomics, Macroeconomics, Experiments, and Finance/Management/Business Administration. We also accept replication studies from adjacent disciplines that are closely related to economics. Please check out our [“***Aims and Scope***”](https://www.iree.eu/aims-and-scope/). + +###### **Editorial Board of IREE** + +###### Along with our growth has come some changes in our editorial board. Hilmar Schneider and Gert G. Wagner, who both helped to found IREE, have since left. Many thanks to both! Martina Grunow and Joachim Wagner are now supported by the new editors Maren Duvendack and Christian Pfeifer. Furthermore, we are very grateful for the active, critical and enthusiastic help from our great co-editors (see ***[here](https://www.iree.eu/who-we-are/editorial-board/)***). + +###### In conclusion, please consider supporting open science and replication in economics by submitting your replication research to IREE. Further, if you have colleagues and students who have done replication research, please encourage them to submit their work to IREE. + +###### Follow IREE on Twitter: [@IreeJournal](https://twitter.com/IREEJournal?lang=de). + +###### *Dr. Martina Grunow is Managing Editor of the International Journal for Re-Views in Empirical Economics (IREE) and is an associate researcher at the Canadian Centre for Health Economics (CCHE). She can be contacted by email at m.grunow@zbw.eu.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/05/11/grunow-update-on-iree-the-first-and-only-journal-dedicated-to-replications-in-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/05/11/grunow-update-on-iree-the-first-and-only-journal-dedicated-to-replications-in-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/hamermesh-replications-enough-already.md b/content/replication-hub/blog/hamermesh-replications-enough-already.md new file mode 100644 index 00000000000..2035626d8bd --- /dev/null +++ b/content/replication-hub/blog/hamermesh-replications-enough-already.md @@ -0,0 +1,47 @@ +--- +title: "HAMERMESH: Replications – Enough Already" +date: 2017-01-20 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Daniel Hamermesh" + - "Labor economics" + - "replication" +draft: false +type: blog +--- + +###### *NOTE: This entry is based on, “Replication in Labor Economics: Evidence from Data, and What It Suggests,” American Economic Review, 107 (May 2017)* + +###### In Hamermesh (2007) I bemoaned the paucity of “hard-science” style replication in applied economics. I shouldn’t have, as my examination of the citation histories of 10 leading articles in empirical labor economics published between 1990 and 1996 shows. Each selected article had to have been published in a so-called “Top 5” journal and to have accumulated at least 1000 Google Scholar (GS) citations. I examined every publication that the Web of Science had recorded as having cited the work, reading first the abstract and then, if necessary, skimming through the citing paper itself. I classified each citing article by whether it was: 1) Related to; 2) Inspired by; 3) Very similar to but using different data; or 4) A direct replication at least partly using the same data.[[1]](#_ftn1) + +###### The distribution of the over 3000 citing papers in the four categories was: Related, 92.9 percent; inspired, 5.0 percent; similar, 1.5 percent; replicated, 0.6 percent. Replication, even defined somewhat loosely, is fairly rare even of these most highly visible studies. However, 7 of the 10 articles were replicated (coded 3 or 4) at least 5 times, with the remaining 3 replicated 1, 2 and 4 times. Published replications of these most heavily-cited papers are performed, so that one might view the replication glass as 100 percent full. + +###### Replications of most studies, even those appearing in Top 5 journals, are not published, nor should they be: The majority of articles in those journals are (Hamermesh, 2017), essentially ignored, so that the failure to replicate them is unimportant. But the most important studies (judged by market responses) are replicated as they should be—by taking the motivating economic idea and examining its implications using a set of data describing a different time and/or economy. The empirical validity of these ideas, after their relevance is first demonstrated for a particular time and place, can only be usefully replicated at other times and places: If they are general descriptions of behavior, they should hold up beyond their original testing ground. Simple laboratory-style replication is important in catching errors in influential work; but the more important replication goes beyond this and, as I’ve shown, is done. + +###### The evidence suggests that the system is not broken and does not need fixing. But what if one believes that more replication, using mostly the same data as in the original study, is necessary? A bit of history: During the 1960s the *American Economic Review* was replete with replication-like papers, in the form of Comments (often in the form of replications on the same or other data), Replies and even Rejoinders. For example, in the four regular issues of the 1966 volume, 16 percent of the space went to such contributions. In the first four regular issues of the 2013 volume only 4 percent did, reflecting a change that began by the 1980s. Editors shifted away from cluttering the *Review*’s pages with Comments, etc., perhaps reflecting their desire to maximize its impact on the profession in light of their realization that pages devoted to this type of exercise generate little attention from other authors (Whaples, 2006). + +###### We have had replications or approximations thereof in the past, but the market for scholarship—as indicated by their impact—has exhibited little interest in them. And we still publish replications, but, as I have shown, in the more appropriate and worthwhile form of examinations of data from other times and places. Overall the evidence suggests that the system is not broken and does not need fixing; and that the most obvious way of fixing this unbroken system has already been rejected by the market. + +###### *Daniel Hamermesh is Professor of Economics at Royal Holloway, University of London, Research Associate at the National Bureau of Economic Research, and Research Associate at the Institute for the Future of Labor (IZA).* + +###### **REFERENCES** + +###### Daniel S. Hamermesh, 2007. “Viewpoint: Replication in Economics.” *Canadian Journal of Economics* 40 (3): 715-33. + +###### \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_, 2017. “Citations in Economics: Measurement, Impacts and Uses.” *Journal of Economic Literature*, 55, forthcoming. + +###### Robert M. Whaples. 2006. “The Costs of Critical Commentary in Economics Journals.” *Econ Journal Watch* 3 (2): 275-82 + +###### ———————————— + +###### [[1]](#_ftnref1)To be classified as “inspired” the citing paper had to refer repeatedly to the original paper and/or had to make clear that it was inspired by the methodology of the original work. To be noted as “similar” the citing paper had to use the exact same methodology but on a different data set, while a study classified as a “replication” went further to include at least some of the data in the original study. Thus even a “replication” in many cases involved more than simply re-estimating models in the original article using the same data. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/01/20/hamermesh-replications-enough-already/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/01/20/hamermesh-replications-enough-already/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/hirschauer-et-al-twenty-steps-towards-an-adequate-inferential-interpretation-of-p-values-in-economet.md b/content/replication-hub/blog/hirschauer-et-al-twenty-steps-towards-an-adequate-inferential-interpretation-of-p-values-in-economet.md new file mode 100644 index 00000000000..339dcfcdee0 --- /dev/null +++ b/content/replication-hub/blog/hirschauer-et-al-twenty-steps-towards-an-adequate-inferential-interpretation-of-p-values-in-economet.md @@ -0,0 +1,76 @@ +--- +title: "HIRSCHAUER et al.: Twenty Steps Towards an Adequate Inferential Interpretation of p-Values in Econometrics" +date: 2019-03-22 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Inferential errors" + - "Inverse probability error" + - "Multiple testing" + - "null hypothesis significance testing" + - "p-values" + - "Statistical inference" +draft: false +type: blog +--- + +###### *This blog is based on the homonymous paper by Norbert Hirschauer, Sven Grüner, Oliver Mußhoff, and Claudia Becker in the **[Journal of Economics and Statistics](https://www.degruyter.com/printahead/j/jbnst)**. It is motivated by prevalent inferential errors and the intensifying debate on p-values – as expressed, for example in the activities of the American Statistical Association including its p-value* ***[symposium](http://ww2.amstat.org/meetings/ssi/2017/index.cfm)*** *in 2017 and the March 19 Special Issue on **[Statistical inference in the 21st century: A world beyond P < 0.05](https://www.tandfonline.com/toc/utas20/73/sup1)**. A related petition in **[Nature](https://www.nature.com/articles/d41586-019-00857-9)** arguing that it is time to retire statistical significance was supported by more than **[800 scientists](https://www.nature.com/magazine-assets/d41586-019-00857-9/data-and-list-of-co-signatories)**. While we provide more details and practical advice, our 20 suggestions are essentially in line with this petition.* + +###### Even if one is aware of the fundamental pitfalls of null hypothesis statistical testing (NHST), it is difficult to escape the categorical reasoning that is so entrancingly suggested by its dichotomous significance declarations. With a view to the *p*-value’s deep entrenchment in current research practices and the apparent need for a basic consensus on how to do things in the future, we suggest twenty immediately actionable steps to reduce widespread inferential errors. + +###### Our propositions aim at fostering the logical consistency of inferential arguments, which is the prerequisite for understanding what we can and what we cannot conclude from both original studies and replications. They are meant to serve as a discussion base or even tool kit for editors of economics journals who aim at revising their guidelines to increase the quality of published research. + +###### **Suggestion 1:** Refrain from using *p*-values if you have data of the *whole population of interest*. In this case, no generalization (inference) from the sample to the population is necessary. Do not use *p*-values either if you have a *non-random sample* that you chose for convenience reasons instead of using probability methods: *p*-values conceptually require a random process of data generation. + +###### **Suggestion 2:** Distinguish the function of the *p*-value depending on the type of the data generating process. In the random sampling case, you are concerned with generalizing from the sample to the population (*external validity*). In the random assignment case, you are concerned with the causal treatment effects in an experiment with random assignment (*internal validity*). + +###### **Suggestion 3:** When using *p*-values as an inferential aid in the random sampling case, provide convincing arguments that your sample represents at least approximately a random sample. To avoid misunderstandings, transparently state how and from which population the random sample was drawn and, consequently, *to which target population you want to generalize*. + +###### **Suggestion 4:** Do use wordings that ensure that the *p*-value is understood as a *graded measure of the strength of evidence against the null*, and that *no* particular information is associated with a *p*-value being below or above some arbitrary threshold such as 0.05. + +###### **Suggestion 5:** Do *not* insinuate that the *p*-value denotes an epistemic (posterior) probability of a scientific hypothesis given the evidence in your data. Stating that you found an effect with an “error probability” of *p* is misleading. It erroneously suggests that the *p*-value is the probability of the null – and therefore the probability of being “in error” when rejecting it. + +###### **Suggestion 6:** Do *not* insinuate that a low *p*-value indicates a large or even practically relevant effect size. Use wordings such as “large” or “relevant” but refrain from using “significant” when discussing the effect size – at least as long dichotomous interpretations of *p*-values linger on in the scientific community. + +###### **Suggestion 7:** Do *not* suggest that high *p*-values can be interpreted as an indication of no effect (“evidence of absence”). Do *not* even suggest that high *p*-values can be interpreted as “absence of evidence.” Doing so would negate the evident findings from your data. + +###### **Suggestion 8:** Do *not* suggest that *p*-values below 0.05 can be interpreted as evidence in favor of the just-estimated coefficient. Formulations saying that you found a “statistically significant effect of size z” should be avoided because they mix up *estimating* and *testing*. The strength of evidence against the null cannot be translated into evidence in favor of the estimate that you happened to find in your sample. + +###### **Suggestion 9:** Avoid the terms “hypothesis *testing*” and “*confirmatory* analysis,” or at least put them into proper perspective and communicate that it is impossible to infer from the *p*-value whether the null hypothesis or an alternative hypothesis is true. In any ordinary sense of the terms, a *p*-value cannot “test” or “confirm” a hypothesis, but only describe data frequencies under a certain statistical model including the null. + +###### **Suggestion 10:** Restrict the use of the word “evidence” to the concrete findings in your data and clearly distinguish this *evidence* from your *inferential conclusions*, i.e., the generalizations you make based on your study and all other available evidence. + +###### **Suggestion 11:** Do explicitly state whether your study is *exploratory* (i.e. aimed at generating new hypotheses) or whether you aim at producing new evidence for *pre-specified* (ex ante) hypotheses. + +###### **Suggestion 12:** In *exploratory* search for potentially interesting associations, do never use the term “hypotheses *testing*” because you have no testable ex ante hypotheses. + +###### **Suggestion 13:** If your study is (what would be traditionally called) “confirmatory” (i.e., aimed at producing evidence regarding *pre-specified* hypotheses), exactly report in your paper the hypotheses that you drafted as well as the model you specified *before* seeing the data. In the results section, clearly relate findings to these ex ante hypotheses. + +###### **Suggestion 14:** When studying pre-specified hypotheses, clearly distinguish two parts of the analysis: (i) the description of the *empirical* *evidence* that you found in your study (What is the evidence in the data?) and (ii) the *inferential reasoning* that you base on this evidence (What should one reasonably believe after seeing the data?). If applicable, a third part should outline the recommendations or *decisions* that you would make all things considered, including the weights attributed to type I and type II errors (What should one do after seeing the data?). + +###### **Suggestion 15:** If you fit your model to the data even though you are concerned with pre-specified hypotheses, explicitly demonstrate that your data-contingent model specification does *not* constitute “*hypothesizing after the results are known*.” When using *p*-values as an inferential aid, *explicitly* consider and comment on *multiple comparisons*. + +###### **Suggestion 16:** Explicitly distinguish statistical and scientific inference. *Statistical* *inference* is about generalizing from a random sample to its parent population. This is only the first step of *scientific* *inference*, which is the totality of reasoned judgments (inductive generalizations) that we make in the light of the total body of evidence. Be clear that a *p*-value, can do *nothing* to assess the generalizability of results beyond a random sample’s parent population. + +###### **Suggestion 17:** Provide information regarding the *size of your estimate*. In many regression models, a meaningful representation of magnitudes will require going beyond coefficient estimates and displaying marginal effects or other measures of effect size. + +###### **Suggestion 18:** Do *not* use asterisks (or the like) to denote different levels of “statistical significance.” Doing so could instigate erroneous categorical reasoning. + +###### **Suggestion 19:** Provide *p*-values if you use the graded strength of evidence against the null as an inferential aid (amongst others). However, do *not* classify results as being “statistically significant” or not. That said, avoid using the terms “statistically significant” and “statistically non-significant” altogether. + +###### **Suggestion 20:** Provide *standard errors* for all effect size estimates. Additionally, provide *confidence intervals* for the focal variables associated with your pre-specified hypotheses. + +###### While following these suggestions would prevent overconfident yes/no conclusions both in original and replication studies, we do not expect that all economists will endorse all of them at once. Some, such as providing effect size measures and displaying standard errors, are likely to cause little controversy. Others, such as renouncing dichotomous significance declarations and giving up the term “statistical significance” altogether, will possibly be questioned. + +###### Opposition against giving up conventional yes/no declarations is likely to be fueled by the fact that no joint understanding and consensus has yet been reached as to which formulations are appropriate to avoid cognitive biases and communicate the correct but per se limited informational content of frequentist concepts such as *p*-values and confidence intervals. Such joint understandings and consensus regarding best practice are in dire need. + +###### *Prof. **[Norbert Hirschauer](https://www.landw.uni-halle.de/prof/lu/?lang=en)**, Dr. **[Sven Grüner](https://www.landw.uni-halle.de/prof/lu/mitarbeiter___doktoranden/gruener/)**, and Prof. **[Oliver Mußhoff](https://www.uni-goettingen.de/en/66131.html)** are agricultural economists in Halle (Saale) and Göttingen, Germany. Prof. **[Claudia Becker](https://statistik.wiwi.uni-halle.de/personal/?lang=en)** is an economic statistician in Halle (Saale). The authors are interested in connecting with economists who have an interest to further concrete steps that help prevent inferential errors associated with conventional significance declarations in econometric studies.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/03/22/hirschauer-et-al-twenty-steps-towards-an-adequate-inferential-interpretation-of-p-values-in-econometrics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/03/22/hirschauer-et-al-twenty-steps-towards-an-adequate-inferential-interpretation-of-p-values-in-econometrics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/hirschauer-et-al-why-replication-is-a-nonsense-exercise-if-we-stick-to-dichotomous-significance-thin.md b/content/replication-hub/blog/hirschauer-et-al-why-replication-is-a-nonsense-exercise-if-we-stick-to-dichotomous-significance-thin.md new file mode 100644 index 00000000000..de65b31c717 --- /dev/null +++ b/content/replication-hub/blog/hirschauer-et-al-why-replication-is-a-nonsense-exercise-if-we-stick-to-dichotomous-significance-thin.md @@ -0,0 +1,67 @@ +--- +title: "HIRSCHAUER et al.: Why replication is a nonsense exercise if we stick to dichotomous significance thinking and neglect the p-value’s sample-to-sample variability" +date: 2018-10-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Expost Power" + - "p-values" + - "Post-hoc Power" + - "Power" + - "replications" + - "Reproducibility" + - "significance testing" +draft: false +type: blog +--- + +###### *[This blog is based on the paper “**[Pitfalls of significance testing and p-value variability: An econometrics perspective](https://projecteuclid.org/euclid.ssu/1538618436)**” by Norbert Hirschauer, Sven Grüner, Oliver Mußhoff, and Claudia Becker, Statistics Surveys 12(2018): 136-172.]* + +###### Replication studies are often regarded as *the* means to scrutinize scientific claims of prior studies. They are also at the origin of the scientific debate on what has been labeled “replication crisis.” The fact that the results of many studies cannot be “replicated” in subsequent investigations is seen as casting serious doubts on the quality of empirical research. Unfortunately, the interpretation of replication studies is itself plagued by two intimately linked problems: first, the conceptual background of different types of replication often remains unclear. Second, inductive inference often follows the rationale of conventional significance testing with its misleading dichotomization of results as being either “significant” (positive) or “not significant” (negative). A poor understanding of inductive inference, in general, and the *p*-value, in particular, will cause inferential errors in all studies, be they initial ones or replication studies. + +###### Amalgamating taxonomic proposals from various sources, we believe that it is useful to distinguish three types of replication studies: + +###### **1. Pure replication** is the most trivial of all replication exercises. It denotes a subsequent “study” that is limited to verifying computational correctness. It therefore uses the *same data (sample)* and the *same statistical model* as the initial study. + +###### **2. Statistical replication** (or reproduction) applies the *same statistical model* as used in the initial study to *another random sample* of the *same population*. It is concerned with the random sampling error and statistical inference (generalization from a random sample to its population). Statistical replication is the very concept upon which frequentist statistics and therefore the *p*-value are based. + +###### **3. Scientific replication** comprises two types of robustness checks: (i) The first one uses a *different statistical model* to reanalyze the *same sample* as the initial study (and sometimes also *another random sample* of the *same* *population*). (ii) The other one extends the perspective beyond the initial population and uses the *same statistical model* for analyzing a *sample* from a *different population*. + +###### **Statistical replication** is probably the most immediate and most frequent association evoked by the term “replication crisis.” It is also the focus of this blog in which we illustrate that re-finding or not re-finding “statistical significance” in statistical replication studies does not tell us whether we fail to replicate a prior scientific claim or not. + +###### In the wake of the 2016 ASA-statement on *p*-values, many economists realized that *p*-values and dichotomous significance declarations do not provide a clear rationale for statistical inference. Nonetheless, many economists seem still to be reluctant to renounce dichotomous yes/no interpretations; and even those who realize that the *p*-value is but a graded measure of the strength of evidence against the null are often not fully aware that an informed inferential interpretation of the *p*-value requires considering its sample-to-sample variability. + +###### We use two simulations to illustrate how misleading it is to neglect the *p*-value’s sample-to-sample variability and to evaluate replication results based on the positive/negative dichotomy. In each simulation, we generated 10,000 random samples (statistical replications) based on the linear “reality” *y =* 1 + *βx + e*, with *β =* 0.2. The two realities differ in their error terms: *e~N*(0;3), and *e~N*(0;5). Sample size is *n =* 50, with *x* varying from 0.5 to 25 in equal steps of 0.5. For both the *σ* = 3 and *σ* = 5 cases, we ran OLS-regressions for each of the 10,000 replications, which we then ordered from the smallest to the largest *p*-value. + +###### Table 1 shows selected *p*-values and their cumulative distribution *F(p)* together with the associated coefficient estimates b and standard error estimates s.e. (and their corresponding *Z* scores under the null).The last column displays the power estimates based on the naïve assumption that the coefficient *b* and the standard error s.e. that we happened to estimate in the respective sample were true. + +###### Table 1: *p*-values and associated coefficients and power estimates for five out of 10,000 samples (*n =* 50 each)† + +###### Capture + +###### Our simulations illustrate one of the most essential features of statistical estimation procedures, namely that our best unbiased estimators estimate correctly *on average*. We would therefore need *all* estimates from frequent replications – *irrespective* of their *p*-values and their being large or small – to obtain a good idea of the population effect size. While this fact should be generally known, it seems that many researchers, cajoled by statistical significance language, have lost sight of it. Unfortunately, this cognitive blindness does not seem to stop short of those who, insinuating that replication implies a reproduction of statistical significance, lament that many scientific findings cannot be replicated. Rather, one should realize that each well-done replication adds an additional piece of knowledge. The very dichotomy of the question whether a finding *can be* *replicated* or *not*, is therefore grossly misleading. + +###### Contradicting many neat, plausible, and wrong conventional beliefs, the following messages can be learned from our simulation-based statistical replication exercise: + +###### 1. While conventional notation abstains from advertising that the *p*-value is but a summary statistic of a noisy random sample, the *p*-value’s variability over statistical replications can be of considerable magnitude. This is paralleled by the variability of estimated coefficients. We may easily find a large coefficient in one random sample and a small one in another. + +###### 2. Besides a single study’s *p*-value, its variability –and, in dichotomous significance testing, the statistical power (i.e., the zeroth order lower partial moment of the *p*-value distribution at 0.05) – determines the repeatability in statistical replication studies. One needs an assumption regarding the true effect size to assess the *p*-value’s variability. Unfortunately, economists often lack information regarding the effect size prior to their own study. + +###### 3. If we rashly claimed a coefficient estimated in a single study to be true, we would not have to be surprised at all if it cannot be “replicated” in terms of re-finding statistical significance. For example, if an effect size and standard error estimate associated with a *p*-value of 0.05 were real, we would *necessarily* have a mere 50% probability (statistical power) of finding a statistically significant effect in replications in a one-sided test. + +###### 4. Low *p*-values do not indicate results that are more trustworthy than others. Under reasonable sample sizes and population effect sizes, it is the *abnormally* large sample effect sizes that produce “highly significant” *p*-values. Consequently, even in the case of a highly significant result, we cannot make a direct inference regarding the true effect. And by averaging over “significant” replications only, we would necessarily overestimate the effect size because we would right-truncate the distribution of the *p*-value which, in turn, implies a left-truncation of the distribution of the coefficient over replications. + +###### 5. In a single study, we have no way of identifying the *p*-value below which (above which) we overestimate (underestimate) the effect size. In the *σ* = 3 case, a *p*-value of 0.001 was associated with a coefficient estimate of 0.174 (underestimation). In the *σ* = 5 case, it was linked to a coefficient estimate of 0.304 (overestimation). + +###### 6. Assessing the replicability (trustworthiness) of a finding by contrasting the tallies of “positive” and “negative” results in replication studies has long been deplored as a serious fallacy (“vote counting”) in meta-analysis. Proper meta-analysis shows that finding non-significant but same-sign effects in a large number of replication studies may represent overwhelming evidence for an effect. Immediate intuition for this is provided when looking at confidence intervals instead of *p*-values. Nonetheless, vote counting seems frequently to cause biased perceptions of what is a “replication failure.” + +###### *Prof. **[Norbert Hirschauer](https://www.landw.uni-halle.de/prof/lu/?lang=en)**, Dr. **[Sven Grüner](https://www.landw.uni-halle.de/prof/lu/mitarbeiter___doktoranden/gruener/)**, and Prof. **[Oliver Mußhoff](https://www.uni-goettingen.de/en/66131.html)** are agricultural economists in Halle (Saale) and Göttingen, Germany. Prof. **[Claudia Becker](https://statistik.wiwi.uni-halle.de/personal/?lang=en)** is an economic statistician in Halle (Saale). The authors are interested in connecting with economists who have an interest to further concrete steps that help prevent inferential errors associated with conventional significance declaration in econometric studies. Correspondence regarding this blog should be directed to Prof. Hischauer at norbert.hirschauer@landw.uni-halle.de.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/10/15/hirschauer-et-al-why-replication-is-a-nonsense-exercise-if-we-stick-to-dichotomous-significance-thinking-and-neglect-the-p-values-sample-to-sample-variability/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/10/15/hirschauer-et-al-why-replication-is-a-nonsense-exercise-if-we-stick-to-dichotomous-significance-thinking-and-neglect-the-p-values-sample-to-sample-variability/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/hirschauer-gr-ner-mu-hoff-fundamentals-of-statistical-inference-what-is-the-meaning-of-random-error.md b/content/replication-hub/blog/hirschauer-gr-ner-mu-hoff-fundamentals-of-statistical-inference-what-is-the-meaning-of-random-error.md new file mode 100644 index 00000000000..d328c202acc --- /dev/null +++ b/content/replication-hub/blog/hirschauer-gr-ner-mu-hoff-fundamentals-of-statistical-inference-what-is-the-meaning-of-random-error.md @@ -0,0 +1,110 @@ +--- +title: "HIRSCHAUER, GRÜNER, & MUßHOFF: Fundamentals of Statistical Inference: What is the Meaning of Random Error?" +date: 2022-07-30 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "Journal policies" + - "NHST" + - "p-values" + - "Random Error" + - "Randomized Controlled Trials" + - "Statistical inference" + - "Statistical reform" +draft: false +type: blog +--- + +*This blog is based on the book of the same name by Norbert Hirschauer, Sven Grüner, and Oliver Mußhoff that was published in **[SpringerBriefs in Applied Statistics and Econometrics](https://link.springer.com/book/10.1007/978-3-030-99091-6)** in August 2022. Starting from the premise that a lacking understanding of the probabilistic foundations of statistical inference is responsible for the inferential errors associated with the conventional routine of null-hypothesis-significance-testing (NHST), the book provides readers with an effective intuition and conceptual understanding of statistical inference. It is a resource for statistical practitioners who are confronted with the methodological debate about the drawbacks of “significance testing” but do not know what to do instead. It is also targeted at scientists who have a genuine methodological interest in the statistical reform debate.* + +**I. BACKGROUND** + +Data-based scientific propositions about the world are extremely important for sound decision-making in organizations and society as a whole. Think of climate change or the Covid-19 pandemic with questions such as of how face masks, vaccines or restrictions on people’s movements work. That said, it becomes clear that the debate on *p*-values and statistical significance tests addresses probably the most fundamental question of the data-based sciences: **How can we learn from data and come to the most reasonable belief (proposition) regarding a real-world state of interest given the available evidence (data) and the remaining uncertainty?** Answering that question and understanding when and how statistical measures (i.e., summary statistics of the given dataset) can help us evaluate the knowledge gain that can be obtained from a particular sample of data is extremely important in any field of science. + +In 2016, the *American Statistical Association* (ASA) issued an unprecedented [***methodological warning on p-values***](https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108) that set out what *p*-values are, and what they can and can’t tell us. It also contained a clear statement that, despite the delusive “hypothesis-testing” terminology of conventional statistical routines, *p*-values can neither be used to determine whether a hypothesis is true nor whether a finding is important. Against a background of persistent inferential errors associated with significance testing, the ASA felt compelled to pursue the issue further. In October 2017, it organized a symposium on the future of statistical inference whose major outcome was a [***special issue “Statistical Inference in the 21st Century: A World Beyond p < 0.05***](https://www.tandfonline.com/toc/utas20/73/sup1)[***”***](https://www.tandfonline.com/toc/utas20/73/sup1) in *The American Statistician.* Expressing their hope that this special issue would lead to a major rethink of statistical inference, the guest editors concluded that it is time to stop using the term “statistically significant” entirely. Almost simultaneously, a widely supported [***call to retire statistical significance***](https://www.nature.com/articles/d41586-019-00857-9) was published in *Nature*. + +Empirically working economists might be perplexed by this fundamental debate. They are usually not trained statisticians but statistical practitioners. As such, they have a keen interest in their respective field of research but “only apply statistics” – usually by following the unquestioned routine of reporting *p*-values and asterisks. Due to the thousands of critical papers that have been written over the last decades, even most statistical practitioners will, by now, be aware of the severe criticism of NHST-practices that many used to follow much like an automatic routine. Nonetheless, all those without a methodological bent in statistics – and this is likely to be the majority of empirical researchers – are likely to be puzzled and ask the question: **What is going on here and what should I do now?** + +While the debate is highly visible now, many empirical researchers are likely to ignore that fundamental criticism of NHST have been voiced for decades – basically ever since significance testing became the standard routine in the 1950s (see Key References below). + +**KEY REFERENCES: Reforming statistical practices** + +**2022 – [Why and how we should join the shift from significance testing to estimation](https://onlinelibrary.wiley.com/doi/10.1111/jeb.14009)**: *Journal of Evolutionary Biology* (Berner and Amrhein) + +**2019 – [Embracing uncertainty: The days of statistical significance are numbered](https://onlinelibrary.wiley.com/doi/10.1111/pan.13721)**: *Pediatric Anesthesia* (Davidson) + +**2019 – [Call to retire statistical significance](https://www.nature.com/articles/d41586-019-00857-9)**: *Nature* (Amrhein et al.) + +**2019 –** [**Special issue editorial:** **“[I]t is time to stop using the term ‘statistically significant’ entirely. Nor should variants such as ‘significantly different,’ ‘*p* < 0.05,’ and ‘nonsignificant’ survive, […].”**](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913) *The American Statistician* (Wasserstein et al.) + +**2018** – **[Statistical Rituals: The Replication Delusion and How We Got There](https://journals.sagepub.com/doi/full/10.1177/2515245918771329)**: *Advances in Methods and Practices in Psychological Science* (Gigerenzer) + +**2017 – [ASA symposium on “Statistical Inference in the 21st Century: A World Beyond *p* < 0.05”](https://imstat.org/meetings-calendar/asa-symposium-on-statistical-inference/)** + +**2016 – [ASA warning: “The *p*-value can neither be used to determine whether a scientific hypothesis is true nor whether a finding is important.”](https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108)** *The American Statistician* (Wasserstein and Lazar) + +**2016 – [Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations](https://pubmed.ncbi.nlm.nih.gov/27209009/)**: *European Journal of Epidemiology* (Greenland et al.) + +**2015 – [Editorial ban on using NHST](https://www.tandfonline.com/doi/full/10.1080/01973533.2015.1012991)**: *Basic and Applied Social Psychology* (Trafimow and Marks) + +**≈ 2015 –** [**Change of reporting standards in journal guidelines:** **“Do not use asterisks to denote significance of estimation results. Report the standard errors in parentheses.”**](https://www.aeaweb.org/journals/aer/styleguide#IVH) *American Economic Review* + +**2014 –** **[The Statistical Crisis in Science](https://www.americanscientist.org/article/the-statistical-crisis-in-science)**: *American Scientist* (Gelman and Loken) + +**2011 – [The Cult of Statistical Significance – What Economists Should and Should not Do to Make their Data Talk](https://elibrary.duncker-humblot.com/article/7433/the-cult-of-statistical-significance-what-economists-should-and-should-not-do-to-make-their-data-talk)**: *Schmollers Jahrbuch* (Krämer) + +**2008 – [The Cult of Statistical Significance. How the Standard Error Costs Us Jobs, Justice, and Lives](https://www.jstor.org/stable/10.3998/mpub.186351)**: *University of Michigan Press* (Ziliak and McCloskey). + +**2007 – [Statistical Significance and the Dichotomization of Evidence](https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1289846)**: *Journal of the American Statistical Association* (McShane and Gal) + +**2004 – [The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask](https://methods.sagepub.com/book/the-sage-handbook-of-quantitative-methodology-for-the-social-sciences/n21.xml)**: *SAGE handbook of quantitative methodology for the social sciences* (Gigerenzer et al.) + +**2000 – [Null hypothesis significance testing: A review of an old and continuing controversy](https://pubmed.ncbi.nlm.nih.gov/10937333/)**: *Psychological Methods* (Nickerson) + +**1996 – [A task force on statistical inference of the American Psychological Association dealt with calls for banning *p*-values but rejected the idea as too extreme](https://psycnet.apa.org/record/1999-03403-008)**: *American Psychologist* (Wilkinson and Taskforce on Statistical Inference) + +**1994 – [The earth is round (*p* < 0.05): “[A *p*-value] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!”](https://psycnet.apa.org/record/1995-12080-001)** *American Psychologist* (Cohen) + +… + +**1964 – [How should we reform the teaching of statistics? “[Significance tests] are popular with non-statisticians, who like to feel certainty where no certainty exists.”](https://www.jstor.org/stable/2344003?origin=crossref&seq=1)** *Journal of the Royal Statistical Society* (Yates and Healy) + +**1960 – [The fallacy of the null-hypothesis significance test](https://doi.apa.org/doiLanding?doi=10.1037%2Fh0042040)**: *Psychological Bulletin* (Rozeboom) + +**1959 – [Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance–Or Vice Versa](https://www.jstor.org/stable/2282137?origin=crossref&seq=1)**: *Journal of the American Statistical Association* (Sterling) + +**1951 – [The Influence of *Statistical Methods for Research Workers* on the Development of […] Statistics](https://www.tandfonline.com/doi/abs/10.1080/01621459.1951.10500764)**: *Journal of the American Statistical Association* (Yates) + +Already back in the 1950s, scientists started expressing severe criticisms of NHST and called for reforms that would shift reporting practices away from the dichotomy of significance tests to the estimation of effect sizes and uncertainty. It is safe to say that the core criticisms and reform suggestions remained largely unchanged over those last seven decades. This is because – unfortunately – the inferential errors that they address have remained the same. Nonetheless, after the intensified debate in the last decade and some institutional-level efforts, such as the revision of author guidelines in some journals, some see signs that a paradigm shift from testing to estimation is finally under way. Unfortunately, however, reforms seem to lag behind in economics compared to many other fields. + +**II. THE METHODOLOGICAL DEBATE IN A NUTSHELL** + +The present debate is concerned with the usefulness of *p*-values and statistical significance declarations for making inferences about a broader context based only on a limited sample of data. Simply put, two crucial questions can be discerned in this debate: + +**Question 1 – Transforming information:** What we can extract – at best – from a sample is an unbiased *point estimate* (“signal”) of an unknown population effect size (e.g., the relationship between education and income) and an unbiased estimation of the uncertainty (“noise”), caused by random error, of that point estimation (i.e., the *standard error*). We can, of course, go through various mathematical manipulations. **But why should we transform two intelligible and meaningful pieces of information – point estimate and standard error – into a *p*-value or even a dichotomous significance statement?** + +**Question 2 – Drawing inferences from non-random samples:** Statistical inference is based on probability theory and a formal chance model that links a randomly generated dataset to a broader target population. More pointedly, statistical assumptions are empirical commitments and acting as if one obtained data through random sampling does *not* create a random sample. **How should we then make inferences about a larger population in the many cases where there is only a convenience sample, such as a group of haphazardly recruited survey respondents, that researchers could get hold of in one way or the other?** + +It seems that, given the loss of information and the inferential errors that are associated with the NHST-routine, no convincing answer to the first question can be provided by its advocates. Even worse prospects arise when looking at the second question. Severe assumptions violations as regards data generation are quite common in empirical research and, particularly, in survey-based research. From a logical point of view, using inferential statistical procedures for non-random samples would have to be justified by running sample selection models that remedy selection bias. Alternatively, one would have to *postulate* that those samples are *approximately* random samples, which is often a heroic but deceptive assumption. This is evident from the mere fact that other probabilistic sampling designs (e.g., cluster sampling) can lead to standard errors that are several times larger than the default which presumes simple random sampling. Therefore, standard errors and *p*-values that are just based on a bold assumption of random sampling – contrary to how data were actually collected – are virtually worthless. Or more bluntly: **Proceeding with the conventional routine of displaying *p*-values and statistical significance even when the random sampling assumption is grossly violated is tantamount to pretending to have better evidence than one has. This is a breach of good scientific practice** that provokes overconfident generalizations beyond the confines of the sample. + +**III. WHAT SHOULD WE DO?** + +Peer-reviewed journals should be the gatekeepers of good scientific practice because they are key to what is publicly available as body of knowledge. Therefore, the most decisive statistical reform is to revise journal guidelines and include adequate inferential reporting standards. Around 2015, six prestigious economics journals (*Econometrica*, the *American Economic Review,* and the four *AE Journals*) adopted guidelines that require authors to refrain from using asterisks or other symbols to denote statistical significance. Instead, they are asked to report point estimates and standard errors. It seems that it would also make sense for other journals to reform their guidelines based on the understanding that the assumptions regarding random data generation must be met and that, if they are met, reporting point estimates and standard errors is a better summary of the evidence than *p*-values and statistical significance declarations. In particular, the Don’ts and Do’s listed in the box below should be communicated to authors *and* reviewers. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2022/08/image.webp) + +Journal guidelines with inferential reporting standards similar to the ones in the box above would have several benefits: (1) They would effectively communicate necessary standards to authors. (2) They would help reviewers assess the credibility of inferential claims. (3) They would provide authors with an effective defense against unqualified reviewer requests. The latter is arguably be even the most important benefit because it would also mitigate publication bias that results from the fact that many reviewers still prefer statistically significant results and pressure researchers to report *p*-values and “significant novel discoveries” without even taking account of whether data were randomly generated or not. + +Despite the previous focus on random sampling, a last comment on inferences from randomized controlled trials (RCTs) seems appropriate: reporting standards similar to those in the box above should also be specified for RCTs. In addition, researchers should be required to communicate that, in RCTs, the standard error deals with the uncertainty caused by “randomization variation.” Therefore, in common settings where the experimental subjects are not randomly drawn from a larger population, the standard error only quantifies the uncertainty associated with the estimation of the ***sample* average treatment** effect, i.e., the effect that the treatment produces in the *given* group of experimental subjects. Only if the randomized experimental subjects have also been randomly sampled from a population, statistical inference can be used as auxiliary tool for making inferences about that population based on the sample. In this case, and only in this case, the adequately estimated standard error can be used to assess the uncertainty associated with the estimation of the ***population* average treatment** effect. + +*Prof. **[Norbert Hirschauer](https://www.landw.uni-halle.de/prof/lu/?lang=en)**, Dr. **[Sven Grüner](https://www.landw.uni-halle.de/prof/lu/mitarbeiter___doktoranden/gruener/)**, and Prof. **[Oliver Mußhoff](https://www.uni-goettingen.de/en/66131.html)** are agricultural economists in Halle (Saale) and Göttingen, Germany. The authors are interested in statistical reforms that help shift inferential reporting practices away from the dichotomy of significance tests to the estimation of effect sizes and uncertainty.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2022/07/30/hirschauer-gruner-mushoff-fundamentals-of-statistical-inference-what-is-the-meaning-of-random-error/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2022/07/30/hirschauer-gruner-mushoff-fundamentals-of-statistical-inference-what-is-the-meaning-of-random-error/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/hou-xue-zhang-replication-controversies-in-finance-accounting.md b/content/replication-hub/blog/hou-xue-zhang-replication-controversies-in-finance-accounting.md new file mode 100644 index 00000000000..e9b6095fada --- /dev/null +++ b/content/replication-hub/blog/hou-xue-zhang-replication-controversies-in-finance-accounting.md @@ -0,0 +1,117 @@ +--- +title: "HOU, XUE, & ZHANG: Replication Controversies in Finance & Accounting" +date: 2017-06-14 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Accounting" + - "Anomalies" + - "Finance" + - "Lu Zhang" + - "replication" +draft: false +type: blog +--- + +###### *[NOTE:**This entry is based on the article “Replicating Anomalies” (SSRN, updated in June 2017,* **[*https://papers.ssrn.com/sol3/papers.cfm?abstract\_id=2961979*](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2961979)*]*** + +###### Finance academics have started to take replication studies seriously. As hundreds of factors have been documented in recent decades, the concern over p-hacking has become especially acute. In a pioneering meta-study, Harvey, Liu, and Zhu (2016) introduce a multiple testing framework into empirical asset pricing. The threshold *t*-cutoff increases over time as more factors have been data-mined. A new factor today should have a *t*-value exceeding 3. + +###### Reevaluating 296 significant factors in published studies, Harvey et al. report that 80-158 (27%-53%) are false discoveries. Two publication biases are likely responsible for the high percentage of false positives. First, it is difficult to publish a negative result in top academic journals. Second, more subtly, it is difficult to publish replication studies in finance, while in many other disciplines replications routinely appear in top journals. As a result, finance and accounting academics tend to focus on publishing new factors rather than rigorously verifying the validity of published factors. + +###### Harvey (2017) elaborates the complex agency problem behind the publication biases. Journal editors compete for citation-based impact factors, and prefer to publish papers with the most significant results. In response to this incentive, authors often file away papers with results that are weak or negative, instead of submitting them for publication. More disconcertingly, authors often engage in, consciously or subconsciously, p-hacking, i.e., selecting sample criteria and test procedures until insignificant results become significant. The outcome is an embarrassingly large number of false positives that cannot be replicated in the future. + +###### We conduct a massive replication of the published factors by compiling a largest-to-date data library with 447 variables. The list includes 57, 68, 38, 79, 103, and 102 variables from the momentum, value-versus-growth, investment, profitability, intangibles, and trading frictions categories, respectively. We use a consistent set of replication procedures throughout. To control for microcaps (stocks that are smaller than the 20th percentile of market equity for New York Stock Exchange, or NYSE, stocks), we form testing deciles with NYSE breakpoints and value-weighted returns. We treat a variable as a replication success if its average return spread is significant at the 5% level. + +###### Our replication indicates rampant p-hacking in the published literature. Out of 447 factors, 286 (64%) are insignificant at the 5% level. Imposing the *t*– cutoff of 3 per Harvey, Liu, and Zhu (2016) raises the number of insignificance to 380 (85%). + +###### The biggest casualty is the liquidity literature. In the trading frictions category, 95 out of 102 variables (93%) are insignificant. Prominent variables that do not survive our replication include  Jegadeesh’s (1990) short-term reversal; Datar-Naik-Radcliffe’s (1998) share turnover; Chordia-Subrahmanyam-Anshuman’s (2001) coefficient of variation for dollar trading volume; Amihud’s (2002) absolute return-to-volume; Acharya-Pedersen’s (2005) liquidity betas; Ang-Hodrick-Xing-Zhang’s (2006) idiosyncratic volatility, total volatility, and systematic volatility; Liu’s (2006) number of zero daily trading volume; and Corwin-Schultz’s (2012) high-low bid-ask spread. Several recent friction variables that have received much attention are also insignificant, including Bali-Cakici-Whitelaw’s (2011) maximum daily return; Adrian-Etula-Muir’s (2014) intermediary leverage beta; and Kelly-Jiang’s (2014) tail risk. + +###### The much researched distress anomaly is virtually nonexistent. Campbell-Hilscher-Szilagyi’s (2008) failure probability, the O-score and Z-score in Dichev (1998), and Avramov-Chordia-Jostova-Philipov’s (2009) credit rating all produce insignificant average return spreads. + +###### Other influential but insignificant variables include Bhandari’s (1988) debt-to-market; Lakonishok-Shleifer-Vishny’s (1994) five-year sales growth; several of Abarbanell-Bushee’s (1998) fundamental signals; Diether-Malloy-Scherbina’s (2002) dispersion in analysts’ forecast; Gompers-Ishii-Metrick’s (2003) corporate governance index; Francis-LaFond-Olsson-Schipper’s (2004) earnings attributes, including persistence,  smoothness, value relevance, and conservatism; Francis et al.’s (2005) accruals quality; Richardson-Sloan-Soliman-Tuna’s (2005) total accruals; and Fama-French’s (2015) operating profitability, which is a key variable in their 5-factor model. + +###### Even for significant anomalies, their magnitudes are often much lower than originally reported. Famous examples include Jegadeesh-Titman’s (1993) price momentum; Lakonishok-Shleifer-Vishny’s (1994) cash flow-to-price; Sloan’s (1996) operating accruals; Chan-Jegadeesh-Lakonishok’s (1996) standardized unexpected earnings, abnormal returns around earnings announcements, and revisions in analysts’ earnings forecasts; Cohen-Frazzini’s (2008) customer momentum; and Cooper-Gulen-Schill’s (2008) asset growth. + +###### Why does our replication differ so much from the original studies? The key word is microcaps. Microcaps represent only 3% of the total market capitalization of the NYSE-Amex-NASDAQ universe, but account for 60% of the number of stocks. Microcaps not only have the highest equal-weighted returns, but also the largest cross-sectional standard deviations in returns and anomaly variables. Many studies overweight microcaps with equal-weighted returns, and often together with NYSE-Amex-NASDAQ breakpoints, in portfolio sorts. + +###### Hundreds of studies use cross-sectional regressions of returns on anomaly variables, assigning even higher weights to microcaps. The reason is that regressions impose a linear functional form, making them more susceptible to outliers, which most likely are microcaps. Alas, due to high costs in trading these stocks, anomalies in microcaps are more apparent than real. More important, with only 3% of the total market equity, the economic importance of microcaps is small, if not trivial. + +###### Our low replication rate of only 36% is not due to our extended sample relative to the original studies. Repeating our replication in the original samples, we find that 293 (66%) factors are insignificant at the 5% level, including 24, 44, 13, 38, 81, and 93 across the momentum, value-versus-growth, investment, profitability, intangibles, and trading frictions categories, respectively. Imposing the *t*-cutoff of three raises the number of insignificance to 387 (86.6%). The total number of insignificance at the 5% level, 293, is even higher than 286 in the extended sample. In all, the results from the original samples are close to those from the full sample. + +###### We also use the Hou, Xue, and Zhang (2015) *q*-factor model to explain the 161 significant anomalies in the full sample. Out of the 161, the *q*-factor model leaves 115 alphas insignificant (150 with *t*<3). In all, capital markets are more efficient than previously recognized. + +###### *Kewei Hou is Fisher College of Business Distinguished Professor of Finance at The Ohio State University. Chen Xue is Assistant Professor of Finance at University of Cincinnati. Lu Zhang is the John W. Galbreath Chair, Professor of Finance, at The Ohio State University. Correspondence about this blog should be addressed to Lu Zhang at* [*zhanglu@fisher.osu.edu*](mailto:zhanglu@fisher.osu.edu)*.* + +###### **REFERENCES** + +###### Abarbanell, J. S., & Bushee, B. J. (1998). Abnormal returns to a fundamental analysis strategy. *The Accounting Review*, 73, 19-45. + +###### Acharya, V. V., & Pedersen, L. H. (2005). Asset pricing with liquidity risk. *Journal  of Financial Economics*, 77, 375-410. + +###### Adrian, T., Etula, E., & Muir, T. (2014). Financial intermediaries and the cross-section of asset returns. *Journal of Finance*, 69, 2557-2596. + +###### Amihud, Y. (2002). Illiquidity and stock returns: Cross-section and time series evidence. *Journal of Financial Markets*, 5, 31-56. + +###### Ang, A., Hodrick, R. J., Xing, Y., & Zhang, X. (2006). The cross-section of volatility and expected returns. *Journal of Finance*, 61, 259-299. + +###### Avramov, D., Chordia, T., Jostova, G., & Philipov, A. (2009). Credit ratings and the cross-section of stock returns. *Journal of Financial Markets*, 12, 469-499. + +###### Bali, T. G., Cakici, N., & Whitelaw, R. F. (2011). Maxing out: Stocks as lotteries and the cross-section of expected returns. *Journal of Financial Economics*, 99, 427-446. + +###### Bhandari, L. C. (1988). Debt/equity ratio and expected common stock returns: Empirical evidence. *Journal of Finance*, 43, 507-528. + +###### Campbell, J. Y., Hilscher, J., & Szilagyi, J. (2008). In search of distress risk. *Journal of Finance*, 63, 2899-2939. + +###### Chan, L. K. C., Jegadeesh, N., & Lakonishok, J. (1996). Momentum strategies, *Journal of Finance*, 51, 1681-1713. + +###### Chordia, T., Subrahmanyam, A., & Anshuman, V. R. (2001). Trading activity and expected stock returns. *Journal of Financial Economics*, 59, 3-32. + +###### Cohen, L., & Frazzini, A. (2008). Economic links and predictable returns, *Journal of Finance*, 63, 1977-2011. + +###### Cooper, M. J., Gulen, H., & Schill, M. J. (2008). Asset growth and the cross-section of stock returns, *Journal of Finance*, 63, 1609-1652. + +###### Corwin. S. A., & Schultz, P. (2012). A simple way to estimate bid-ask spreads from daily high and low prices. *Journal of Finance*, 67, 719-759. + +###### Datar, V. T., Naik, N. Y., & Radcliffe, R. (1998). Liquidity and stock returns: An alternative test. *Journal of Financial Markets*, 1, 203-219. + +###### Dichev, I. (1998). Is the risk of bankruptcy a systematic risk? *Journal of Finance*, 53, 1141-1148. + +###### Diether, K. B., Malloy, C. J., &Scherbina, A. (2002). Differences of opinion and the cross section of stock returns, *Journal of Finance*, 57, 2113-2141. + +###### Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model, *Journal of Financial Economics*, 116, 1-22. + +###### Francis, J., LaFond, R., Olsson, P. M., & Schipper, K. (2004). Cost of equity and earnings attributes, *The Accounting Review*, 79, 967-1010. + +###### Francis, J., LaFond, R., Olsson, P. M., & Schipper, K. (2005). The market price of accruals quality, *Journal of Accounting and Economics*, 39, 295-327. + +###### Gompers, P., Ishii, J., & Metrick, A. (2001). Corporate governance and equity prices, *Quarterly Journal of Economics,* 118, 107-155. + +###### Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics. *Journal of Finance*, forthcoming. + +###### Harvey, C. R., Liu, Y., & Zhu, H. (2016). …and the cross-section of expected returns. *Review of Financial Studies*, 29, 5-68. + +###### Hou, K., Xue, C., & Zhang, L. (2015). Digesting anomalies: An investment approach. *Review of Financial Studies*, 28, 650-705. + +###### Jegadeesh, N. (1990). Evidence of predictable behavior of security returns. *Journal of Finance*, 45, 881-898. + +###### Jegadeesh, N. & Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. *Journal of Finance*, 48, 65-91. + +###### Kelly, B., & Jiang, H. (2014). Tail risk and asset prices. *Review of Financial Studies*, 27, 2841-2871. + +###### Lakonishok, J., Shleifer, A., & Vishny, R. W. (1994). Contrarian investment, extrapolation, and risk, *Journal of Finance*, 49, 1541-1578. + +###### Liu, W. (2006). A liquidity-augmented capital asset pricing model. *Journal of Financial Economics*, 82, 631-671. + +###### Richardson, S. A., Sloan, R. G., Soliman, M. T., & Tuna, I. (2005). Accrual reliability, earnings persistence and stock prices, *Journal of Accounting and Economics*, 39, 437-485. + +###### Sloan, R. G. (1996). Do stock prices fully reflect information in accruals and cash flows about future earnings? *The Accounting Review,* 71, 289-315. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/06/14/hou-xue-zhang-replication-controversies-in-finance-accounting/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/06/14/hou-xue-zhang-replication-controversies-in-finance-accounting/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/hubbard-a-common-sense-typology-of-replications.md b/content/replication-hub/blog/hubbard-a-common-sense-typology-of-replications.md new file mode 100644 index 00000000000..9879bb9453d --- /dev/null +++ b/content/replication-hub/blog/hubbard-a-common-sense-typology-of-replications.md @@ -0,0 +1,91 @@ +--- +title: "HUBBARD: A Common-Sense Typology of Replications" +date: 2017-04-03 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Open Science Collaboration" + - "Raymond Hubbard" + - "replications" + - "Reproducibility crisis" + - "Typology" +draft: false +type: blog +--- + +###### *[NOTE: This entry is based on the book “Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science” by Raymond Hubbard]* + +###### Psychology’s “reproducibility crisis” (Open Science Collaboration, 2015) has drawn attention to the need for replication research. However, focusing on the reproducibility of findings, while clearly important, is a much too narrow interpretation of replication’s role in the scientific enterprise. This account outlines some additional roles. + +###### Based on the two dimensions of (1) data sources and (2) research methods, the table below lists six different kinds of replications, each with its own part to play. + +![Hubbard](/replication-network-blog/hubbard.webp) + +###### (1)   *Checking of Analysis: Determining the Accuracy of Results* + +###### Independent reexaminations of the original data, using the same methods of analysis. Are the results error-free? + +###### (2)   *Reanalysis of Data: Determining Whether Results Hold Up Using Different Analytical Methods* + +###### Independent reexaminations of the original data, using different methods of analysis. Are the results the “same”? + +###### Using the above approaches, many “landmark” results—e.g., the Hawthorne effect, J.B. Watson’s conditioning of Little Albert, Sir Cyril Burt’s “twins” research, and Durkheim’s theory of suicide—have been found to be invalid. + +###### I do not consider (1) and (2) to be authentic forms of replication. They clearly, however, play a vital role in protecting the integrity of the empirical literature. + +###### (3)   *Exact Replications: Determining Whether Results are Reproducible* + +###### An authentic form of replication, one which most people see as THE definition of replication. Here, we follow as closely as possible the same procedures used in the earlier study on a new sample drawn from the same population. This was the approach adopted by the Open Science Collaboration (2015) project. + +###### (4)   *Conceptual Extensions: Determining Whether Results Hold Up When Constructs and Their Interrelationships are Measured/Analyzed Differently* + +###### These differences lie in how theoretical constructs are measured, and how they interrelate with other constructs. Conceptual extensions address the issue of the *construct validity* of the entities involved. This can only be done by replications assessing a construct’s (a) Convergent, (b) Discriminant, and (c) Nomological validities. + +###### Otherwise expressed, replication research is crucial to *theory development.* First, it is replication research which is essential to the initial measurement, and further refinement, of the theoretical constructs themselves. Second, it is replication research which is responsible for monitoring the linkages (theoretical consistency) between these constructs. Third, it is replication research which judges the adequacy of this system of constructs for explaining some of what we see in the world around us. + +###### (5)   *Empirical Generalizations: Determining Whether Results Hold Up in New Domains* + +###### Here the focus is on the external validity, or generalizability, of results when changes in persons, settings, treatments, outcomes, and time periods are made (Shadish, Cook, and Campbell, 2002). For example, Helmig, et al.’s (2012) successful replication using Swiss data of Jacobs and Glass’s (2002) U.S. study on media publicity and nonprofit organizations. + +###### (6)   *Generalizations and Extensions: Determining Whether Results Hold Up in New Domains and With New Methods of Measurement and/or Analysis* + +###### Typically, these do not constitute authentic replications. Many of them are mainstream studies dealing with theory testing. That is, the emphasis is on *theory* extension, and not on extensions to previous *empirical* findings (Hubbard and Lindsay, 2002, p. 399). + +###### *Replication and Validity Generalization* + +###### Replication research underlies the validity generalization process. + +###### Exact Replications allow appraisal of the *internal* validity of a study. They also enable the establishment of facts and the causal theories underlying them. + +###### Conceptual Replications extend the development of causal theory by examining the validity of hypothetical constructs and their interrelationships. Specifically, they make possible the evaluation of a construct’s *convergent, discriminant,* and *nomological* validity. What could be more important than this? + +###### Empirical Generalizations permit investigations of whether the same (similar) findings hold up across (sub)populations so addressing the neglected topic of a study’s *external* validity. + +###### It is for good reason that replication research is said to be at the heart of scientific progress. + +###### *Raymond Hubbard is Thomas F. Sheehan Distinguished Professor of Marketing, Emeritus, at Drake University. Correspondence about this blog should be addressed to* [*drabbuhyar@aol.com*](mailto:drabbuhyar@aol.com)*.* + +###### REFERENCES + +###### Helmig, B., Spraul, K., & Tremp, K. (2012). Replication studies in nonprofit research: A generalization and extension of findings regarding the media publicity of nonprofit organizations. *Nonprofit and Voluntary Sector Quarterly,* 41, 360‑385. + +###### Hubbard, R. (2016). *Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science.* (2016). Sage Publications: Thousand Oaks, CA. + +###### Hubbard, R. & Lindsay, R.M. (2002). How the emphasis on “original” empirical marketing research impedes knowledge development. *Marketing Theory,* 2, 381‑402. + +###### Jacobs, R.N. & Glass, D.J. (2002). Media publicity and the voluntary sector: The case of nonprofit organizations in New York City. *Voluntas: International Journal of Voluntary and Nonprofit Organizations,* 13, 235‑252. + +###### Open Science Collaboration (2015). Estimating the reproducibility of psychological science. *Science,* 349, aac4716‑1‑8. + +###### Shadish, W.R., Cook, T.D., & Campbell, D.T. (2002). *Experimental and quasi-experimental designs for generalized causal inference.* Houghton Mifflin: Boston, MA. + +###### Tsang, E.W.K. & Kwan, K.-M. (1999). Replication and theory development in organizational science: A critical realist perspective. *Academy of Management Review,* 24, 759‑780. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/04/03/hubbard-a-common-sense-typology-of-replications/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/04/03/hubbard-a-common-sense-typology-of-replications/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/iso-ahola-on-reproducibility-and-replication-in-psychological-and-economic-sciences.md b/content/replication-hub/blog/iso-ahola-on-reproducibility-and-replication-in-psychological-and-economic-sciences.md new file mode 100644 index 00000000000..ae20ffbca37 --- /dev/null +++ b/content/replication-hub/blog/iso-ahola-on-reproducibility-and-replication-in-psychological-and-economic-sciences.md @@ -0,0 +1,61 @@ +--- +title: "ISO-AHOLA: On Reproducibility and Replication in Psychological and Economic Sciences" +date: 2017-07-11 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "negative results" + - "Physics" + - "Psychology" + - "Replicability" + - "Reproducibility" +draft: false +type: blog +--- + +###### *[This blog is a summary of a longer treatment of the subject that was published in Frontiers in Psychology in June 2017.  To read that article, **[click here](http://journal.frontiersin.org/article/10.3389/fpsyg.2017.00879/full)**.]* + +###### Physicists have asked “why is there something rather than nothing?” They have theorized that it had to do with the formation of an asymmetry between matter and antimatter in the fractions of milliseconds after the Big Bang. Psychologists and economists could ask a similar question, “why do  psychological and economic phenomena exist?” + +###### A simple answer is because people exist as psychological and economic entities, therefore so do psychological and economic phenomena. Since “nothingness” is not a phenomenon, there are no logical or philosophical reasons to empirically cast some phenomena into the trash bin of nothingness. Moreover, empirical science does not have the tools to do so because a negative, such as “God does not exist”, cannot ever be proved. This state of affairs is the case because: + +###### (1) The presence of evidence for **X** does not necessarily mean the absence of evidence for **Y. X’s** existence is not a precondition for **Y’s** nonexistence unless they are mutually exclusive effects, which is rare. + +###### (2) Empiricists can never test all of the conditions and groups of people on earth; they cannot even think of all conditions that could give rise to a phenomenon. Further, “there is an infinite number of ideas and ways” to test phenomena, and consequently, “no idea ever achieves the status of final truth” (McFall). + +###### (3) Empiricists do not have perfectly reliable and valid measures. + +###### (4) Human events and behaviors are multi-causal in real life, even in lab experiments. This means that the manipulation of a focal independent variable affects other causal factors. Moreover, researchers cannot control for all the possible confounds or even think of all of them, since “everything is correlated with everything else, more or less”( Meehl); thus, a theoretically established phenomenon under study is never zero. + +###### (5) Humans are fickle and elusive, sometimes unbearably simple and at other times, irreducibly complex in their thinking, and sometimes both at the same time. Thus, the human mind is unreproducible from situation to situation. Unlike those in physics (e.g., speed of light), psychological and economic phenomena are not fixed constants in space and time. There are no cognitive dissonance particles or “Phillips curve” particles that could irrevocably be verified by empirical data and subsequently declared universal constants. + +###### In short, the nonexistence of phenomena is not logically viable. Since there are infinite ways of measuring and studying a phenomenon, logically, it should be possible to devise experiments both to demonstrate that a phenomenon exists and that it does not exist in specific conditions and during specific times when using specific methods and specific tasks. The former finding means that the phenomenon has been demonstrated to exist and it cannot be retracted. + +###### On the one hand, conceptual replications can establish a phenomenon’s boundary conditions (i.e., when it is more and less likely to occur). On the other hand, the finding of nonexistence would not invalidate the phenomenon — only that it is not strong enough to register in a specific condition. For example, psychologically, people do not “choke” under all stressful conditions because they have learned to deal with pressure. Similarly, if the Phillips curve does not explain the inverse relationship between unemployment and inflation in the present low-interest economic environment particularly well, it can do so under different economic circumstances. + +###### Does the conditional nature of these phenomena mean they should disappear into a “black hole” of nonexistent phenomena. Of course, not. The best that can be done is to empirically test and conceptually replicate well-developed theories (their tenets) with the best tools available, but never claim that the phenomena they describe and explain do not exist. + +###### The present emphasis on reproducibility in psychological and economic science, unfortunately, stems from the application of the physics model of independent verification of precise numeric values for phenomena, such as three recent independent confirmations of “gravity waves” predicted by Einstein’s theory. This type of replication is possible only if there are universal constants to be verified. However, there are none in psychology and economics, and not even in biology. + +###### Physics seeks to discover the laws of *nature*, whereas in other sciences, both *nature and nurture* have to be taken into account. This Person-Environment interaction in human cognition, performance and behavior makes direct and precise replications impossible. Human conditions vary, for one thing, because individuals (investors, policy-makers and politicians) are not invariant and rational in their decisions and judgments. “Behavioral economists” (e.g., Thaler and Kahneman) have shown that the rational-agent model is a poor explanation for financial judgments and decisions, or economic growth more generally. People may consider all the information provided but they are also influenced by their self-generated and environmentally-induced emotions in their judgments and behaviors. Individual investors (e.g., prospective homeowners) can also be led to make boneheaded decisions, resulting in “collective blindness” (Kahneman) that can in turn create national and international financial crises, as was seen in the 2008 financial calamity. + +###### All of this means that there will be deviations from overall patterns of individual financial behaviors, and therefore direct and precise replications are impossible. However, conceptual or constructive replications are helpful in elucidating the boundary conditions for the overall pattern, conditions under which a phenomenon is strong and weak. But if we insist on precise replications, then no psychological or economic phenomena exist because it impossible to have and create identical conditions to those of the original testing. + +###### There are no universal constants to be precisely replicated outside the laws of nature and physics. If the conditions are not the same at the individual level, they are not the same at the macro level either. History does not exactly repeat itself, it only rhymes (Twain). The conditions that led to a recession at one time will not be the same causes for the next recession. At the macro level, researchers can build theoretical models trying to predict the next recession, but they  conceivably cannot consider all relevant variables, especially exogenous ones, and thus precise predictions (replications) are not possible. Nevertheless, this does not prevent pundits from arguing that it is “different this time”, it is “a new normal”. + +###### A replication’s success is typically determined by statistical means (traditionally p-value and now Effect Size). But psychological and economic phenomena cannot be reduced to statistical phenomena, and theoretical and methodological deficiencies cannot be saved by statistical analyses. Science mainly advances by theory building and model construction, not by empirical testing and replication of the statistical null hypothesis. Psychological and economic phenomena are largely theoretical constructs, not unlike those in physics. Just think where physics would be today without Einstein’s theories. The Higgs boson particle was theorized to exist in 1964 but not verified until 2012. Did the particle not exist in the meantime? + +###### Thus, empirical studies are mainly evaluated for their theoretical relevance and importance, and less for their success or failure in exactly reproducing original findings. It is not empirical data but theory that has generally made scientific progress possible, which is as true of physics as it is of psychology and economics. Along the way, empirical data have complemented and contributed to the expansion of theoretical models, and theories have made data more useful. Of course, theories are eventually abandoned in Kuhnian-like paradigm shifts. In the meantime, there are only “temporary winners” as scientific knowledge is “provisional” and “propositional” in nature. + +###### *Seppo Iso-Ahola is Professor of Psychology in Kinesiology at the School of Public Health, University of Maryland.  He can be contacted at [isoahol@umd.edu](mailto:isoahol@umd.edu).* + +###### \*The author thanks Roger C. Mannell for his helpful comments and suggestions. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/07/11/iso-ahola-on-reproducibility-and-replication-in-psychological-and-economic-sciences/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/07/11/iso-ahola-on-reproducibility-and-replication-in-psychological-and-economic-sciences/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/john-cochrane-secret-data.md b/content/replication-hub/blog/john-cochrane-secret-data.md new file mode 100644 index 00000000000..6ae674f3b63 --- /dev/null +++ b/content/replication-hub/blog/john-cochrane-secret-data.md @@ -0,0 +1,107 @@ +--- +title: "JOHN COCHRANE: Secret Data" +date: 2015-12-31 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "AER" + - "John Cochrane" + - "replication" +draft: false +type: blog +--- + +###### (REPOST FROM JOHN COCHRANE’S BLOG, *[THE GRUMPY ECONOMIST](http://johnhcochrane.blogspot.co.nz/2015/12/secret-data.html)*) + +###### On replication in economics. Just in time for bar-room discussions at the annual meetings. + +###### **“I have a truly marvelous demonstration of this proposition which this margin is too narrow to contain.”   –*Fermat*** + +###### **“I have a truly marvelous regression result, but I can’t show you the data and won’t even show you the computer program that produced the result”  – *Typical paper in economics and finance*.** + +###### THE PROBLEM  Science demands transparency. Yet much research in economics and finance uses secret data. The journals publish results and conclusions, but the data and sometimes even the programs are not available for review or inspection.  Replication, even just checking what the author(s) did given their data, is getting harder. Quite often, when one digs in, empirical results are nowhere near as strong as the papers make them out to be. + +###### – Simple coding errors are not unknown. Reinhart and Rogoff are a famous example — which only came to light because they were honest and ethical and posted their data. + +###### – There are data errors. + +###### – Many results are driven by one or two observations, which at least tempers the interpretation of the results. Often a simple plot of the data, not provided in the paper, reveals that fact. + +###### – Standard error computation is a dark art, producing 2.11 t statistics and the requisite two or three stars suspiciously often. + +###### – Small changes in sample period or specification destroy many “facts.” + +###### – Many regressions involve a large set of extra right hand variables, with no strong reason for inclusion or exclusion, and the fact is often quite sensitive to those choices. Just which instruments you use and how to transform variables changes results. + +###### – Many large-data papers difference, difference differences, add dozens of controls and fixed effects, and so forth, throwing out most of the variation in the data in the admirable quest for cause-and-effect interpretability. Alas, that procedure can load the results up on measurement errors, or slightly different and equally plausible variations can produce very different results. + +###### – There is often a lot of ambiguity in how to define variables,  which proxies to use, which data series to use, and so forth, and equally plausible variations change the results. + +###### I have seen many examples of these problems, in papers published in top journals. Many facts that you think are facts are not facts. Yet as more and more papers use secret data, it’s getting harder and harder to know. The solution is pretty obvious: to be considered peer-reviewed “scientific” research, authors should post their programs and data. If the world cannot see your lab methods, you have an anecdote, an undocumented claim, you don’t have research. An empirical paper without data and programs is like a theoretical paper without proofs. + +###### RULES + +###### Faced with this problem, most economists jump to rules and censorship. They want journals to impose replicability rules, and refuse to publish papers that don’t meet those rules. The American Economic Review has followed this suggestion, and other journals such as the Journal of Political Economy, are following. On reflection, that instinct is a bit of a paradox. Economists, when studying everyone else, by and large value free markets, demand as well as supply, emergent order, the marketplace of ideas, competition, entry, and so on, not tight rules and censorship. Yet in running our own affairs, the inner dirigiste quickly wins out. In my time at faculty meetings, were few problems that many colleagues did not want to address by writing more rules. And with another moment’s reflection (much more below), you can see that the rule-and-censorship approach simply won’t work.  There isn’t a set of rules we can write that assures replicability and transparency, without the rest of us having to do any work. And rule-based censorship invites its own type I errors.  Replicability is a squishy concept — just like every other aspect of evaluating scholarly work. Why do we think we need referees, editors, recommendation letters, subcommittees, and so forth to evaluate method, novelty, statistical procedure, and importance, but replicability and transparency can be relegated to a set of mechanical rules? DEMAND So, rather than try to restrict supply and impose censorship, let’s work on demand.  If you think that replicability matters, what can you do about it? A lot: + +###### – When a journal with a data policy asks you to referee a paper, check the data and program file. Part of your job is to see that this works correctly. + +###### – When you are asked to referee a paper, and data and programs are not provided, see if data and programs are on authors’ websites. If not, ask for the data and programs. If refused, refuse to referee the paper. You cannot properly peer-review empirical work without seeing the data and methods. + +###### – I don’t think it’s necessary for referees to actually do the replication for most papers, any more than we have to verify arithmetic. Nor, in my view, do we have to dot is and cross t’s on the journal’s policy, any more than we pay attention to their current list of referee instructions. Our job is to evaluate whether we think the authors have done an adequate and reasonable job,  as standards are evolving, of making the data and programs available and documented. Run a regression or two to let them know you’re looking, and to verify that their posted data actually works. Unless of course you smell a rat, in which case, dig in and find the rat. + +###### – Do not cite unreplicable articles. If editors and referees ask you to cite such papers, write back “these papers are based on secret data, so should not be cited.” If editors insist, cite the paper as “On request of the editor, I note that Smith and Jones (2016) claim x. However, since they do not make programs / data available, that claim is not replicable.” + +###### – When asked to write a promotion or tenure letter, check the author’s website or journal websites of the important papers for programs and data. Point out secret data, and say such papers cannot be considered peer-reviewed for the purposes of promotion. (Do this the day you get the request for the letter. You might prompt some fast disclosures!) + +###### – If asked to discuss a paper at a conference, look for programs and data on authors’ websites. If not available, ask for the data and programs. If they are not provided, refuse. If they are, make at least one slide in which you replicate a result, and offer one opinion about its robustness. By example, let’s make replication routinely accepted. + +###### – A general point: Authors often do not want to post data and programs for unpublished papers, which can be reasonable. However, such programs and data can be made available to referees, discussants, letter writers, and so forth, in confidence. + +###### – If organizing a conference, do not include papers that do not post data and programs. If you feel that’s too harsh, at least require that authors post data and programs for published papers and make programs and data available to discussants at your conference. + +###### – When discussing candidates for your institution to hire, insist that such candidates disclose their data and programs. Don’t hire secret data artists. Or at least make a fuss about it. + +###### – If asked to serve on a committee that awards best paper prizes, association presidencies, directorships, fellowships or other positions and honors, or when asked to vote on those, check the authors’ websites or journal websites. No data, no vote. The same goes for annual AEA and AFA elections. Do the candidates disclose their data and programs? + +###### – Obviously, lead by example. Put your data and programs on your website. + +###### – Value replication. One reason we have so little replication is that there is so little reward for doing it. So, if you think replication is important, value it. If you edit a journal, publish replication studies, positive and negative. (Especially if your journal has a replication policy!) When you evaluate candidates, write tenure letters, and so forth, value replication studies, positive and negative. If you run conferences, include a replication session. + +###### In all this, you’re not just looking for some mess on some website, put together to satisfy the letter of a journal’s policy. You’re evaluating whether the job the authors have done of documenting their procedures and data rises to the standards of what you’d call replicable science, within reason, just like every other part of your evaluation.  Though this issue has bothered me a long time, I have not started doing all the above. I will start now. Here, some economists I have talked to jump to suggesting a call to coordinated action. That is not my view I think this sort of thing can and should emerge gradually, as a social norm. If a few of us start doing this sort of thing, others might notice. They think “that’s a good idea,” and start doing it too. They also may feel empowered to start doing it. The first person to do it will seem like a bit of a jerk. But after you read three or four tenure letters that say “this seems like fine research, but without programs and data we won’t really know,” you’ll feel better about writing that yourself. Like “would you mind putting out that cigarette.” Also, the issues are hard, and I’m not sure exactly what is the right policy.  Good social norms will evolve over time to reflect the costs and benefits of transparency in all the different kinds of work we do. If we all start doing this, journals won’t need to enforce  long rules. Data disclosure will become as natural and self-enforced part of writing a paper as is proving your theorems. Conversely, if nobody feels like doing the above, then maybe replication isn’t such a problem at all, and journals are mistaken in adding policies. RULES WON’T WORK WITHOUT DEMAND + +###### Journals are treading lightly, and rightly so. Journals are competitive too. If the JPE refuses a paper because the author won’t disclose data, and the QJE publishes it, the paper goes on to great acclaim, wins its author the Clark medal and the Nobel Prize, then the JPE falls in stature and the QJE rises. New journals will spring up with more lax policies. Journals themselves are a curious relic of the print age. If readers value empirical work based on secret data, academics will just post their papers on websites, working paper series, ssrn, repec, blogs, and so forth. + +###### So if there is no demand, why restrict supply? If people are not taking the above steps on their own — and by and large they are not — why should journals try to shove it down authors’ throats? Replication is not an issue about which we really can write rules. It is an issue — like all the others involving evaluation of scientific work — for which norms have to evolve over time and users must apply some judgement. Perfect, permanent replicability is impossible. If replication is done with programs that access someone else’s database, those databases change and access routines change. Within a year, if the programs run at all, they give different numbers. New versions of software give different results. The best you can do is to  freeze the data you actually use, hosted on a virtual machine that uses the same operating system, software version, and so on. Even that does not last forever. And no journal asks for it. Replication is a small part of a larger problem, data collection itself.  Much data these days is collected by hand, or scraped by computer. We cannot and should not ask for a webcam or keystroke log of how data was collected, or hand-categorized. Documenting this step so it can be redone is vital, but it will always be a fuzzy process. In response to “post your data,” authors respond that they aren’t allowed to do so, and journal rules allow that response. You have only to post your programs, and then a would-be replicator must arrange for access to the underlying data.  No surprise, very little replication that requires such extensive effort is occurring. And rules will never be enough. Regulation invites just-within-the-boundaries games. Provide the programs, but no poor documentation.  Provide the data with no headers. Don’t write down what the procedures are. You can follow the letter and not the spirit of rules. Demand invites serious effort towards transparency. I post programs and data. Judging by emails when I make a mistake, these get looked at maybe once every 5 years. The incentive to do a really good job is not very strong right now. + +###### Poor documentation is already a big problem. My modal referee comment these days is “the authors did not write down what they did, so I can’t evaluate it.” Even without posting programs and data, the authors simply don’t write down the steps they took to produce the numbers. The demand for such documentation has to come from readers, referees, citers, and admirers, and posting the code is only a small part of that transparency. A hopeful thought: Currently, one way we address these problems is by endless referee requests for alternative procedures and robustness checks.  Perhaps these can be answered in the future by “the data and code are online, run them yourself if you’re worried!” I’m not arguing against rules, such as the AER has put in. I just think that they will not make a dent in the issue until we economists show by our actions some interest in the issue. PROPRIETARY DATA, COMMERCIAL DATA, GOVERNMENT DATA Many data sources explicitly prohibit public disclosure of the data. Disclosing such secret data remains beyond the current journal policies, or policies that anyone imagines asking journals to impose. Journals can require that you post code, but then a replicator has to arrange for access to the data. That can be very expensive, or require a coauthor who works at the government agency. No surprise, such replication doesn’t happen very often. However, this is mostly not an insoluble problem, as there is almost never a fundamental reason why the data needed for verification and robustness analysis cannot be disclosed. Rules and censorship is not strong enough to change things. Widespread demand for transparency might well be. To substantiate much research, and check its robustness to small variations in statistical method,  you do not need full access to the underlying data. An extract is enough, and usually the nature of that extract makes it useless for other purposes. The extract needed to verify one paper is usually useless for writing other papers. The terms for using posted data could be, you cannot use this data to publish new original work, only for verification and comment on the posted paper.  Abiding by this restriction is a lot easier to police than the current replication policies. Even if the slice of data needed to check a paper’s results cannot be public, it can be provided to referees or discussants, after signing a stack of non-use and non-disclosure agreements. (That is a less-than-optimal outcome of course, since in the end real verification won’t happen unless people can publish verification papers.) Academic papers take 3 to 5 years or more for publication. A 3 to 5 year old slice of data is useless for most purposes, especially the commercial ones that worry data providers. Commercial and proprietary (banks) data sets are designed for paying customers who want up-to-the-minute data. Even CRSP data, a month old, is not much used commercially, because traders need up to the minute data useful for trading.  Hedge fund and mutual fund data is used and paid for by people researching the histories of potential investments. Two-year old data is useless to them — so much so that getting the providers to keep old slices of data to overcome survivor bias is a headache. In sum, the 3-5 year old, redacted, minimalist small slice of data needed to substantiate the empirical work in an academic paper are in fact seldom a substantial threat to the commercial, proprietary, or genuine privacy interest of the data collectors. + +###### The problem is fundamentally about contracting costs. We are in most cases secondary or incidental users of data, not primary customers. Data providers’ legal departments don’t want to deal with the effort of writing contracts that allow disclosure of data that is 99% useless but might conceivably be of value or cause them trouble.  Both private and government agency lawyers naturally adopt a CYA attitude by just saying no. + +###### But that can change.  If academics can’t get a paper conferenced, refereed, read and cited with secret data,  if they can’t get tenure, citations, or a job on that basis, the academics will push harder.  Our funding centers and agencies (NSF)  will allocate resources to hire some lawyers. Government agencies respond to political pressure.  If their data collection cannot be used in peer-reviewed research, that’s one less justification for their budget. If Congress hears loudly from angry researchers who want their data, there is a force for change. But so long as you can write famous research without pushing, the apparently immovable rock does not move. + +###### The contrary argument is that if we impose these costs on researchers, then less research will be done, and valuable insights will not benefit society. But here you have to decide whether research based on secret data is really research at all. My premise is that, really, it is not, so the social value of even apparently novel and important claims based on secret data is not that large. + +###### Clearly, nothing of this sort will happen if journals try to write rules, in a profession in which nobody is taking the above steps to demand replicability. Only if there is a strong, pervasive, professional demand for transparency and replicability will things change. AUTHOR’S INTEREST Authors often want to preserve their use of data until they’ve fully mined it. If they put in all the effort to produce the data, they want first crack at the results. This valid concern does not mean that they cannot create redacted slices of data needed to substantiate a given paper. They can also let referees and discussants access such slices, with the above strict non-disclosure and agreement not to use the data. In fact, it is usually in authors’ interest to make data available sooner rather than later. Everyone who uses your data is a citation. There are far more cases of authors who gained notoriety and long citation counts from making data public early then there are of authors who jealously guarded data so they would get credit for the magic regression that would appear 5 or more years after data collection. Yet this property right is up to the data collector to decide. Our job is to say “that’s nice, but we won’t really believe you until you make the data public, at least the data I need to see how you ran this regression.” If you want to wait 5 years to mine all the data before making it public, then you might not get the glory of “publishing” the preliminary results. That’s again why voluntary pressure will work, and rules from above will not work. SERVICE One  empiricist who I talked to about these issues does not want to make programs public, because he doesn’t want to deal with the consequent wave of emails from people asking him to explain bits of code, or claiming to have found errors in 20-year old programs. Fair enough. But this is another reason why a loose code of ethics is better than a set of rules for journals. You should make a best faith effort to document code and data when the paper is published. You are not required to answer every email from every confused graduate student for eternity after that point. Critiques and replication studies can be refereed in the usual way, and must rise to the usual standards of documentation and plausibility. WHY REPLICATION MATTERS FOR ECONOMICS Economics is unusual. In most experimental sciences, once you collect the data, the fact is there or not. If it’s in doubt, collect more data. Economics features large and sophisticated statistical analysis of non-experimental data. Collecting more data is often not an option, and not really the crux of the problem anyway. You have to sort through the given data in a hundred or more different ways to understand that a cause and effect result is really robust. Individual authors can do some of that — and referees tend to demand exhausting extra checks. But there really is no substitute for the social process by which many different authors, with different priors, play with the data and methods. Economics is also unusual, in that the practice of redoing old experiments over and over, common in science, is rare in economics. When Ben Franklin stored lighting in a condenser, hundreds of other people went out to try it too, some discovering that it wasn’t the safest thing in the world. They did not just read about it and take it as truth. A big part of a physics education is to rerun classic experiments in the lab. Yet it is rare for anyone to redo — and question — classic empirical work in economics, even as a student. Of course everything comes down to costs. If a result is important enough, you can go get the data, program everything up again, and see if it’s true.  Even then, the question comes, if you can’t get x’s number, why not?  It’s really hard to answer that question without x’s programs and data. But the whole thing is a whole lot less expensive and time consuming, and thus a whole lot more likely to happen, if you can use the author’s programs and data. WHERE WE ARE The American Economic Review has a strong [*data and programs disclosure policy*.](https://www.aeaweb.org/aer/data.php) The *[JPE adopted the AER data policy](http://www.press.uchicago.edu/journals/jpe/datapolicy.html?journal=jpe)*. A [*good John Taylor blog post*](http://economicsone.com/2013/07/05/make-replication-easy-in-economics/)on replication and the history of the AER policy. The QJE has decided not to; I asked an editor about it and heard very sensible reasons. Here is a very good *[review article on data policies at journals](http://openeconomics.net/resources/data-policies-of-economic-journals/)* by By Sven Vlaeminck The AEA is running a survey about its journals, and asks some replication questions. If you’re an AEA member, you got it. Answer it. I added to mine, “if you care so much about replication, you should show you value it by routinely publishing replication articles.” How is it working? The *[Report on the American Economic Review Data Availability Compliance Project](https://www.aeaweb.org/aer/2011_Data_Compliance_Report.pdf)* + +###### *“All authors submitted something to the data archive. Roughly 80 percent of the submissions satisfied the spirit of the AER’s data availability policy, which is to make replication and robustness studies possible independently of the author(s). The replicated results generally agreed with the published results. There remains, however, room for improvement both in terms of compliance with the policy and the quality of the materials that authors submit.”* + +###### However, Andrew Chang and Phillip Li disagree, in the nicely titled “*[Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say `Usually Not](http://www.federalreserve.gov/econresdata/feds/2015/files/2015083pap.pdf)*‘” + +###### *“We attempt to replicate 67 papers published in 13 well-regarded economics journals using author-provided replication files that include both data and code. … Aside from 6 papers that use confidential data, we obtain data and code replication files for 29 of 35 papers (83%) that are required to provide such files as a condition of publication, compared to 11 of 26 papers (42%) that are not required to provide data and code replication files. We successfully replicate the key qualitative result of 22 of 67 papers (33%) without contacting the authors. Excluding the 6 papers that use confidential data and the 2 papers that use software we do not possess, we replicate 29 of 59 papers (49%) with assistance from the authors. Because we are able to replicate less than half of the papers in our sample even with help from the authors, we assert that economics research is usually not replicable.”* + +###### I read this as confirmation that replicability must come from a widespread social norm, demand, not journal policies. The quest for rules and censorship reflects a world-view that once we get procedures in place, then everything published in a journal will be correct. Of course, once stated, you know how silly that is. Most of what gets published is wrong. Journals are for communication. They should be invitations to replication, not carved in stone truths.  Yes, peer-review sorts out a lot of complete garbage, but the balance of type 1 and type 2 errors will remain. A few touchstones: *[Mitch Petersen](http://rfs.oxfordjournals.org/cgi/content/full/22/1/435)* tallied up all papers in the top finance journals for 2001–2004. Out of 207 panel data papers, 42% made no correction at all for cross-sectional correlation of the errors.  This is a fundamental error, that typically cuts standard errors by as much as a factor of 5 or more. If firm i had an unusually good year, it’s pretty likely firm j had a good year as well. Clearly, the empirical refereeing process is far from perfect, despite the endless rounds of revisions they typically ask for. (Nowadays the magic wand “cluster” is waved over the issue. Whether it’s being done right is a ripe topic for a similar investigation.) “*[Why Most Published Research Findings are False](http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124)*”  by John Ioannidis. Medicine, but relevant A link on the  *[controversy on replicability in psychology](http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html)* There will be a *[workshop on replication and transparency in economic research](http://ineteconomics.org/community/events/replication-and-transparency-in-economic-research)* following the ASSA meetings in San Francisco I anticipate an interesting exchange in the comments. I especially more links to and summaries of existing writing on the subject  UPDATE + +###### *[On the need for a replication journal](https://ideas.repec.org/p/fip/fedlwp/2015-016.html)* by Christian Zimmermann + +###### *“There is very little replication of research in economics, particularly compared with other sciences. This paper argues that there is a dire need for studies that replicate research, that their scarcity is due to poor or negative rewards for replicators, and that this could be improved with a journal that exclusively publishes replication studies. I then discuss how such a journal could be organized, in particular in the face of some negative rewards some replication studies may elicit.”* + +###### But why is that better than a dedicated “replication” section of the AER, especially if the AEA wants to encourage replication? I didn’t see an answer, though it may be a second best proposal given that the AER isn’t doing it. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/12/31/john-cochrane-secret-data/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/12/31/john-cochrane-secret-data/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/joop-hartog-yes-the-tide-is-turning.md b/content/replication-hub/blog/joop-hartog-yes-the-tide-is-turning.md new file mode 100644 index 00000000000..92a05da4e11 --- /dev/null +++ b/content/replication-hub/blog/joop-hartog-yes-the-tide-is-turning.md @@ -0,0 +1,41 @@ +--- +title: "JOOP HARTOG: Yes, the tide is turning!" +date: 2015-10-03 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Editor" + - "Joop Hartog" + - "Journals" + - "Labour Economics" +draft: false +type: blog +--- + +###### In 1981 I was appointed as a professor of economics at the University of Amsterdam. One of my colleagues was Joop Klant, professor of economic methodology. When he retired, in 1986, at the farewell dinner, he reminded us of his opportunity cost: we received a copy of his novel *De fiets* (The bicycle), a booklet that had brought him literary fame in his youth. He signed my copy with the encouragement: “Test, test, test, Hartog: never stop”. Yes, we shared a belief in the Popperian assignment and I have been trying to test theory whenever I could. In fact, the paper I am now working on is an empirical test of the theory that employers safeguard young graduates from wage reduction when their productivity turns out below standard (the present verdict is: reject!). + +###### In 1994, my friend Jules Theeuwes and I launched the journal *Labour Economics*.  We wanted a balanced mix of theory and empirical work and we opened a separate section for Replications, with a separate editor. As replication submissions were barely forthcoming (one paper in 3 years), we decided to beat the drum. We were firm believers. “The basic premise of econometric research is the existence of stable parameter values in equations that relate economic variables. Yet, we do not have a great deal of information about parameter values and their empirical distributions”, we wrote in 1997 as introduction to a set of invited papers on replication (*Labour Economics*, 4(2), 99). We guaranteed publication of replication studies, provided they would meet some mild conditions (aim to replicate key findings of an original article in a leading journal, and contain no methodological flaws). + +###### To our regret, we never got a single submission. We understood the reason quite well. Replication does not lead to academic prestige. There are some famous cases of replications  that  failed to reproduce the original findings (Harberger’s tax on capital, the *Journal of  Money, Credit and Banking Project*, see our *Labour Economics* introduction), but basically, the profession and in particular journal editors, were not interested. But the times, they are a’changing. + +###### Interest in the reliability and credibility of empirical work has been mounting. For nine years, I have been a member of LOWI, a national board for research integrity founded by Dutch universities and The Netherlands Academy of Sciences. It’s a board of appeal on university rulings on research integrity. When we started, a decade ago, we got a few cases annually. In 2014 the board got 24 cases, generally much more serious than initially, with heavy impact on the accused. In The Netherlands we experienced some spectacular cases of data fraud (not in economics though) and awareness of the ofen very shaky basis of econometric results has strongly increased. The dangers of an emphasis on originality, on new methods, new models, new approaches, rather than on the painstaking patient search for reliable, reproducable results are now clearly appreciated. Data must be made easily  available. I remember a phrase that struck me long ago: “Often, a researcher’s mind is more fruitful than his database”. Data transparancy and a new attitude should change that. + +###### Some time ago, it occurred to me that professional organisations or journals should create a replication  archive. I mentioned that, visiting Waikato University, to Jacques Poot, who told me: my dear friend, that exists! An excellent initiative. It proves again that the new day starts in New Zealand: that ‘s where the sun first rises. + +###### + +###### Here’s evidence on my sincerity: + +1. ###### Arulampalam, J. Hartog, T. MaCurdy and J.Theeuwes (1997), Replication and re-analysis, *Labour Economics*, 4(2), pp. 99-105 +2. ###### Cabral Vieira, L. Diaz Serrano, J. Hartog and E. Plug (2003), Risk compensation in wages: a replication, *Empirical Economics*, 28, pp. 639-647 +3. ###### Mazza, H. van Ophem and J. Hartog (2013), Unobserved heterogeneity and risk in wage variance: Does more schooling reduce earnings risk? *Labour Economics*, 24 (), 323-338 +4. ###### Budria, L. Diaz Serrano, A. Ferrer and J. Hartog (2013), Risk Attitude and Wage Growth, Replicating Shaw (1996), *Empirical Economics*, 44 (2), 981-1004 + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/10/03/joop-hartog-yes-the-tide-is-turning/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/10/03/joop-hartog-yes-the-tide-is-turning/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/karabag-berggren-misconduct-and-marginality-in-management-business-and-economics-journals.md b/content/replication-hub/blog/karabag-berggren-misconduct-and-marginality-in-management-business-and-economics-journals.md new file mode 100644 index 00000000000..3c249b3437a --- /dev/null +++ b/content/replication-hub/blog/karabag-berggren-misconduct-and-marginality-in-management-business-and-economics-journals.md @@ -0,0 +1,62 @@ +--- +title: "KARABAG & BERGGREN: Misconduct and Marginality in Management, Business, and Economics Journals" +date: 2016-09-24 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Journals" + - "Misconduct" + - "retraction" + - "salami-slicing" +draft: false +type: blog +--- + +###### The problems of publication misconduct – manipulation, fabrication and plagiarism – and other dodgy practices such as salami-style publications are attracting increasing attention.  In the newly published paper “Misconduct, Marginality, and Editorial Practices in Management, Business, and Economics” (full text ***[available here](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0159492)***), we present findings on these problems in MBE-journals and the diffusion of editorial practices to combat them (Karabag and Berggren, 2016). + +###### The data were collected by bibliometric searches of retracted papers in the seven major databases that cover almost all MBE journals and from two surveys to editors of these journals; the first to 60 journals with at least one public retraction, then to all journals investigated in the bibliometric study. A total of 298 journals editors answered the second survey. + +###### The bibliometric study identified a strongly increasing trend of retraction, from 1 retraction in 2005 to 60 in 2015. Compared to the number of published papers, the figure is still very small, but since there are strong disincentives for editors to engage in retractions, the reported number is only the tip of the misconduct iceberg. As for the problem of marginality, more than half of the editors reported experiences of salami publications, which can be defined as the slicing of output into the smallest publishable unit. + +###### The survey also enquired about specific practices to deal with these problems. We found that 42% of the journals have started to use software to detect possible plagiarism, while 30% ask authors to provide a data file. Only 6% ask the author(s) to provide information on each author’s specific roles. Many editors stated that they rely on the reviewers to detect and prevent misconduct. Despite this importance of the reviewers, less than 40% of the journals have public rewards for good reviewers, and less than half add good reviewers to the advisory board. + +###### Only 10% of the journals publish replication studies, and the exact meaning of these answers may be doubted, in view of other studies which show a much lower percentage publishing replications.  According to Duvendack et al. 2015, for example, only 3% of the studied journals actually do publish replication papers. + +###### The discrepancy between the findings may be explained by social desirability. Editors might think or believe that it is socially desirable to publish replication studies and/or they may plan to publish them. Our survey had a free space for comments, but the editors did not comment on replication studies. Our interpretation is that editors are still not particularly interested in replication studies, even though many affirm the theoretical importance of replication for building theory and to disclose manipulation. + +###### The paper also presents positive ideas to combat marginality by means of supporting more creative contributions. 98 journal editors provided a rich menu of ideas. We classified them into 14 sub-themes, grouped under four major themes. The first major theme, “editorial vision and engaged board”, included suggestions that editors should open up the journal and take risks, be visionary, have engaged editorial board members and change the editorial teams regularly. The second theme, “curate papers and connect authors,” comprised three sub-themes: curating and developing manuscripts, connecting authors, and constructive screening.  The third major theme, labeled “open up the discussion”   contained suggestions such as publish criticism instead of rejecting the papers, invite comments and involve industry specialists. The fourth theme, “go beyond the mainstream,” involved ideas on mixing disciplines, limiting individual authors and avoiding perfection. + +###### Overall the findings indicate a broad editorial engagement with misconduct and marginality; however, we remain concerned with the amount of undetected misconduct. Several authors have argued that there is a knowledge gap in MBE due to the small number of replications and papers testing previously presented models (c.f. Duvendack et al., 2015; Kacmar and Whitfield, 2000). + +###### Two recent cases show how valuable replication studies are. In 2010, Reinhart and Rogoff, based on a comparative sample of countries, alleged that public debt beyond a very specific level would stifle growth during periods of crisis. This conclusion was widely cited both within and outside the academic community, but when a graduate student at the University of Massachusett Amherst tried to replicate the study, he uncovered serious data flaws in the original paper which undermined both its conclusions and the theoretical assumption. The critical study was published (Herndon et al., 2014), but not in the American Economic Review, which had carried the original paper. + +###### In the management field, Lepore (2014) re-studied the famous disk drive industry cases in Christensen´s (1997) theory on disruptive innovation. Extending the time period, the re-study found a very different pattern to the one suggested by Christensen. This finding calls for a re-assessment of the theoretical framework of disruptive innovation (Bergek et al., 2013). + +###### The crucial point here is that none of these important and critical replication papers were published in the journals which published the original papers, and nor did these journals actively encourage submission of other replication studies. Public retractions of published papers will always be a means of last resort for a journal. Therefore it is so important to encourage close reading, reflection and replications! + +###### **REFERENCES** + +###### Bergek, A., Berggren, C., Magnusson, T. and Hobday, M., 2013. Technological discontinuities and the challenge for incumbent firms: Destruction, disruption or creative accumulation? *Research Policy*, 42(6), pp.1210-1224. + +###### Christensen C. M. 1997. The innovator’s dilemma: When new technologies cause great firms to fail, Boston: Harvard Business School Press. + +###### Duvendack, M., Palmer-Jones, R.W. and Reed, W.R., 2015. Replications in Economics: A progress report. *Econ Journal Watch*, *12*(2), pp.164-191. + +###### Herndon, T., Ash, M. and Pollin, R., 2014. Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. *Cambridge Journal of Economics*, 38,, pp. 257-279. + +###### Kacmar, K.M. and Whitfield, J.M., 2000. An additional rating method for journal articles in the field of management. *Organizational Research Methods*, *3*(4), pp.392-406. + +###### Karabag, S. F., & Berggren, C. 2016. Misconduct, Marginality and Editorial Practices in Management, Business and Economics Journals. *PloS One*, 11(7), pp. 1-25,  e0159492. DOI: + +###### Lepore J. The disruption machine, *The New Yorker*. 2014 June 23. Available: . + +###### Reinhart, C. M., and Rogoff, K.S.  2010. Growth in a Time of Debt. *American Economic Review*, 100, pp. 573-578. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/09/24/karabag-berggren-misconduct-and-marginality-in-management-business-and-economics-journals/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/09/24/karabag-berggren-misconduct-and-marginality-in-management-business-and-economics-journals/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/kim-robinson-the-problem-isn-t-just-the-p-value-it-s-also-the-point-null-hypothesis.md b/content/replication-hub/blog/kim-robinson-the-problem-isn-t-just-the-p-value-it-s-also-the-point-null-hypothesis.md new file mode 100644 index 00000000000..c655515d4fe --- /dev/null +++ b/content/replication-hub/blog/kim-robinson-the-problem-isn-t-just-the-p-value-it-s-also-the-point-null-hypothesis.md @@ -0,0 +1,91 @@ +--- +title: "KIM & ROBINSON: The problem isn’t just the p-value, it’s also the point-null hypothesis!" +date: 2019-06-07 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Halloween effect" + - "Interval-based hypothesis testing" + - "Minimum effect tests" + - "non-central t distribution" + - "null hypothesis significance testing" + - "p-values" + - "Point null hypothesis" +draft: false +type: blog +--- + +###### In Frequentist statistical inference, the *p-*value is used as a measure of how incompatible the data are with the null hypothesis.  When the null hypothesis is fixed at a point, the test statistic reports a distance from the sample statistic to this point. A low (high) *p*-value means that this distance is large (small), relative to the sampling variability. Hence, the *p*-value also reports this distance, but in a different scale, namely the scale of probability. + +###### In a recent article published in *Econometrics (“****[Interval-Based Hypothesis Testing and Its Applications to Economics and Finance”)](https://www.mdpi.com/2225-1146/7/2/21)***, we highlight the problems with the *p-*value criterion in concert with point null hypotheses.  We propose that researchers move to interval-based hypothesis testing instead.  While such a proposal is not new (Hodges and Lehmann, 1954) and such tests are being used in biostatistics and psychology (Welleck, 2010), it is virtually unknown to economics and business disciplines. In what follows, we highlight problems with the *p*-value criterion and show how they can be overcome by adopting interval-based hypothesis testing. + +###### The first problem concerns economic significance; namely, whether the distance between the sample statistic and the point under the null has any economic importance. The *p*-value has nothing to say about this: it only reports whether the distance is large relative to the variability — not the relevance of the distance. It is certainly possible to have statistically significant results that are economically or operationally meaningless. + +###### The second problem is that the implied critical value of the test changes very little with sample size, while the test statistic generally increases with sample size. This is because the sampling distribution does not change, or does not change much. Stated differently, the distribution from which one obtains the critical value is (nearly) the same, regardless of how large or small sample size (or statistical power) is. + +###### And last but not least, since the population parameter can never be exactly equal to the null value under the point-null hypothesis, the sampling distribution under the null hypothesis is never realized or observed in reality. As a result, when a researcher calculates a *t*-statistic, it is almost certain that she obtains the value from a distribution under the alternative (a non-central *t*-distribution), and not from the distribution under the null (a central *t*-distribution). This can cause the critical value from the central distribution to be misleading, especially when the sample size is large. + +###### Consider a simple *t*-test for *H0:θ=0,* where *θ* represents the population mean. Assuming a random sample from a normal distribution with unknown standard deviation *σ*, the test statistic is *t = n0.5Xbar/σ* where *σ* denotes the sample standard deviation. The test statistic follows a Student-*t* distribution with (*n-1*) degrees of freedom and non-centrality parameter *λ**=n0.5θ/σ*, denoted as *t*(*n-1;* *λ*). Under *H0*, the *t*-statistic follows a central *t*-distribution with *λ**=0*. + +###### In reality, the value of *θ* cannot exactly and literally be *0*. As Delong and Lang (1992) point out, “almost all tested null hypothesis in economics are false”. The consequence is that, with observational data, a *t*-statistic is (almost) always generated from a non-central *t*-distribution, not from a central one. + +###### Suppose the value of *θ* was in fact *0.1* and that this value implies no economic or practical importance. That is, *H0* practically holds. Let us assume for simplicity that *σ=1*. When *n = 10*, the *t*-statistic is in fact generated from *t*(*9; λ=0.316*), not from *t*(*9; λ=0*). When the sample size is small — for example, *10* — the two distributions are very close, so the *p*-value can be a reasonable indicator for the evidence against *H0*. When the sample size is larger, say *1000*, the *t*-statistic is generated from *t*(*999; λ=3.16*). When it is *5000*, it is generated from *t*(*4999; λ=7.07*). Under the point-null hypothesis, the distribution is fixed at *t*(*n-1; λ=0*). At this large sample size, every *t*-statistic is larger than the critical value at a conventional level; and every *p*-value is virtually equal to 0. Hence, although economically insignificant, a sample estimate of *θ=0.1* will be statistically significant with a large *t*-statistic and a near-zero *p*-value. + +###### This situation is illustrated in Figure 1, where the black curve plots the central *t*-distribution; and red and blue curves show non-central distributions respectively for *λ=0.316* and *λ=7.07*. The blue curve is an essentially a normal distribution, but for the purpose of illustration, we maintain it as a *t*-distribution with *λ > 0*. The point-null hypothesis fixes the sampling distribution at the black curve which is *t*(*n-1; λ=0*), so the 5% critical value does not change (no more than 1.645) regardless of sample size. + +![TRN1(20190607)](/replication-network-blog/trn120190607.webp) + +###### The consequence is that, when the value of  *λ* is as large as *7.07* with a large sample size, the null hypothesis is almost always rejected with the *p*-value virtually 0, even though *θ=0.1* is economically negligible. The problem may not be serious when the sample size is small, but it is when the sample size is large. We now show how adopting an interval hypothesis allows one to overcome this problem. + +###### Consider *H0: 0 < θ ≤ 0.5* against *H1: θ > 0.5*. Let the value of *θ = 0.5* be the minimum value of economic importance. Under the null hypothesis, the mean is economically negligible or practically no different from 0; while it makes a difference economically under *H1*. This class of interval-based tests is called minimum-effect tests. + +###### The decision rule is to reject *H0* at the 5% level if the *t*-statistic is greater than the critical value from *t*(*n-1, λ=n1/20.5*) distribution. That is, the critical value increases with the sample size. If this distribution were the blue curve in Figure 1, the corresponding 5% critical value would be 8.72, indicated by the cut-off value corresponding to the red-shaded area. + +###### In conducting an interval-based test, choosing the interval of economic significance is crucial for the credibility of the test. It should be set by the researcher, based on economic analysis or value judgment, desirably with a consensus from other researchers and ideally before she observes the data. + +###### With such an interval hypothesis, a clear statement is made on the economic significance of the parameter and is taken into account for decision-making. In addition, the critical value and the sampling distribution of the test change with sample size; and the *p*-value is not necessarily a decreasing function of sample size. + +###### As a practical illustration, we analyze the Halloween effect (Bouman and Jacobsen, 2002), where it is claimed that stock returns are consistently higher from the period of November to April. They fit a simple regression model of the form + +###### *Rt = γ0 + γ1 Dt + ut*, + +###### where *Rt* is stock return in percentage and *Dt* is a dummy variable which takes *1* from November to April; and *0* otherwise. + +###### Using monthly data for a large number of stock markets around the world, Bouman and Jacobsen (2002; Table 1) report positive and statistically significant values of *γ1*. For the U.S. market using 344 monthly observations, they report that the estimated value of *γ1* is 0.96 with a *t*-statistic of 1.95. + +###### Using daily data from 1950 to 2016 (16,819 observations), we replicate their results with an estimated value of *γ1=0.05* and a *t*-statistic of 3.44. For the point-null hypothesis *H0: γ1 = 0*; *H1: γ1  > 0*, we reject *H0* at the 5% level of significance for both monthly and daily data, with *p*-values of 0.026 and 0.0003 respectively. The Halloween effect is statistically clear and strong, especially when the sample size is larger. + +###### Now we conduct minimum-effect tests. We assume that the stock return should be at least 1% higher per month during the period of November to April to be considered economically significance. This value is conservative considering trading costs and the volatility of the market. + +###### The corresponding null and alternative hypotheses are *H0: 0 < γ1 ≤ 1*; *H1: γ1 > 1*. The 5% critical value of this test is 3.66 obtained from *t*(*342, λ=2.06*). On a daily basis, it is equivalent to *H0: 0 < γ1 ≤ 0.05*; *H1: γ1 > 0.05*, assuming 20 trading days per month. The 5% critical value of this test is 5.28 obtained from *t*(*16,817, λ=3.60*). For both cases, the null hypothesis of no economic significance cannot be rejected at the 5% level of significance. That is, the Halloween effect is found to be economically negligible with interval-based hypothesis testing. + +###### In Figure 2, we present the Box-Whisker plot of the daily returns against *D*. It appears that there are a lot more outliers during the non-Halloween period, but the median and the quartile values are nearly identical for the two periods. + +![TRN2(20190607)](/replication-network-blog/trn220190607.webp) + +###### This plot provides further evidence that the Halloween effect is negligible, apart from these outliers. It is likely that the effect size estimates of the above Halloween regressions are over-stated by ignoring these outliers. This application highlights the problems of the point-null hypothesis and demonstrates how interval hypothesis testing can overcome them. + +###### As Rao and Lovric (2016) argue, the paradigm of point-null hypothesis testing is no longer viable in the era of big data. Now is the time to adopt a new paradigm for statistical decision-making. In our article, we demonstrate that testing for an interval-null hypothesis can be a way forward. + +###### *Jae (Paul) Kim is a Professor of Finance in the Department of Finance at La Trobe University. Andrew Robinson is Director of the Centre of Excellence for Biosecurity Risk Analysis and Associate Professor in the School of Mathematics and Statistics, both at the University of Melbourne. Comments and/or questions about this blog can be directed to Professor Kim at J.Kim@latrobe.edu.au.* + +###### **References** + +###### Bouman, S. Jacobsen, B. 2002. The Halloween Indicator, “Sell in May and Go Away”: Another Puzzle,  *American Economic Review*, 92(5), 1618-1635. + +###### DeLong, J.B. and K. Lang, 1992, Are All Economic Hypotheses False? *Journal of Political Economy*, Vol. 100, No. 6, pp. 1257-72. + +###### Hodges, J. L. Jr. and E.L. Lehmann 1954, Testing the Approximate Validity of Statistical Hypotheses, *Journal of the Royal Statistical Society, Series B (Methodological)*, Vol. 16, No. 2, pp. 261–268. + +###### Rao, C. R. and Lovric, M. M., 2016, Testing Point Null Hypothesis of a Normal Mean and the Truth: 21st Century Perspective, *Journal of Modern Applied Statistical Methods*, 15 (2), 2–21. + +###### Wellek, S., 2010, *Testing Statistical Hypotheses of Equivalence and Noninferiority*, 2nd edition, CRC Press, New York. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/06/07/kim-robinson-the-problem-isnt-just-the-p-value-its-also-the-point-null-hypothesis/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/06/07/kim-robinson-the-problem-isnt-just-the-p-value-its-also-the-point-null-hypothesis/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/krawczyk-good-news-bad-news-on-who-gets-the-data.md b/content/replication-hub/blog/krawczyk-good-news-bad-news-on-who-gets-the-data.md new file mode 100644 index 00000000000..60e385c2990 --- /dev/null +++ b/content/replication-hub/blog/krawczyk-good-news-bad-news-on-who-gets-the-data.md @@ -0,0 +1,38 @@ +--- +title: "KRAWCZYK: Good News/Bad News on Who Gets the Data" +date: 2016-10-09 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "compliance rates" + - "economics" + - "experimental economics" + - "Sharing data" +draft: false +type: blog +--- + +###### Providing access to the data is a prerequisite for replication of empirical analysis. Unfortunately, this access is not always granted to everyone (see ***[here](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0021101)***, and ***[here](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4533904/)***).  There is evidence that some of this may be due to concerns about requestors’ qualifications (see ***[here](https://books.google.ch/books?hl=en&lr=&id=ijUrAAAAYAAJ&oi=fnd&pg=PR1&dq=Fienberg+and+Martin+1985&ots=bG1rHqkTvk&sig=xxkl267n6UZbMBQc1X3o8ASc7PQ#v=onepage&q=Fienberg%20and%20Martin%201985&f=false)***). + +###### In two recent papers, we investigated how willingness to share depended on the identity of the requestor. In a paper entitled, **[*“(Un)available upon request: field experiment on researchers’ willingness to share supplementary materials*](https://www.ncbi.nlm.nih.gov/pubmed/22686633)***”*,  Ernesto Reuben and I  found that only 44% of economists were willing to share supplementary research materials of a published study they had promised to send “upon request”.  This number was slightly higher if the request came from a high prestigious university rather than one that was less prestigious. + +###### In “**[*Gender, beauty and support networks in academia: evidence from a field experiment*](http://econpapers.repec.org/paper/warwpaper/2015-43.htm)** (Study 1)”, Magdalena Smyk and I asked experimental economists to share data they used in their published paper. The focus in this experiment was the role of gender: half the requests were sent by a female student, the other half by a male. Overall compliance rate was 34.7%, with no gender effect. + +###### The fact that requestor’s identity matters little is good news of course, as it suggests a roughly level playing field. However, while compliance rates in economics are higher than many other disciplines (see ***[here](https://orgtheory.wordpress.com/2015/08/11/sociologists-need-to-be-better-at-replication-a-guest-post-by-cristobal-young/#comment-140734)*** and ***[here](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007078)*** and ***[here](http://www.cell.com/current-biology/abstract/S0960-9822(13)01400-0)***), there is still much room for improvement. + +###### *Michal Krawczyk is an Assistant Professor of Economic Sciences at the University of Warsaw, Poland.* + +###### REFERENCES + +###### Krawczyk, M., & Reuben, E. (2012). (Un) Available upon Request: Field Experiment on Researchers’ Willingness to Share Supplementary Materials.Accountability in research, 19(3), 175-186. + +###### Krawczyk, M., & Smyk, M. (2015). Gender, beauty and support networks in academia: evidence from a field experiment (No. 2015-43). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/10/09/krawczyk-good-newsbad-news-on-who-gets-the-data/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/10/09/krawczyk-good-newsbad-news-on-who-gets-the-data/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/lakens-examining-the-lack-of-a-meaningful-effect-using-equivalence-tests.md b/content/replication-hub/blog/lakens-examining-the-lack-of-a-meaningful-effect-using-equivalence-tests.md new file mode 100644 index 00000000000..49c63a3d315 --- /dev/null +++ b/content/replication-hub/blog/lakens-examining-the-lack-of-a-meaningful-effect-using-equivalence-tests.md @@ -0,0 +1,50 @@ +--- +title: "LAKENS: Examining the Lack of a Meaningful Effect Using Equivalence Tests" +date: 2017-05-01 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Daniel Lakens" + - "equivalence tests" + - "replication" + - "TOST" +draft: false +type: blog +--- + +###### When we perform a study, we would like to conclude there is an effect, when there is an effect. But it is just as important to be able to conclude there is no effect, when there is no effect. So how can we conclude there is no effect? Traditional null-significance hypothesis tests won’t be of any help here. If you observe a p > 0.05, concluding that there is no effect is a common erroneous interpretation of p-values. + +###### One solution is *equivalence testing*. In an equivalence test, you statistically test whether the observed effect is smaller than anything you care about. One commonly used approach is the two-one-sided test (TOST) procedure (Schuirmann, 1987). Instead of rejecting the null-hypothesis that the true effect size is zero, as we traditionally do in a statistical test, the null-hypothesis in the TOST procedure is that there *is* an effect. + +###### For example, when examining a correlation, we might want to reject an effect as large, or larger, than a medium effect in either direction (r = 0.3 or r = -0.3). In the two-sided test approach, you would test whether the observed correlation is significantly smaller than r = 0.3, and test whether the observed correlation is significantly larger than r = -0.3. If both these tests are statistically significant (or, because these are one-sided tests, when the 90% confidence interval around our correlation does not include the equivalence bounds of -0.3 and 0.3) we can conclude the effect is ‘statistically equivalent’. Even if the effect is not exactly 0, we can reject the hypothesis that the true effect is large enough to care about. + +###### Setting the equivalence bounds requires that you take a moment to think about which effect size you expect, and which effect sizes you would still consider support for your theory, or which effects are large enough to matter in practice. Specifying the effect you expect, or the smallest effect size you are still interested in, is good scientific practice, as it makes your hypothesis *falsifiable*. If you don’t specify a smallest effect size that is still interesting, it is impossible to falsify your hypothesis (if only because there are not enough people in the world to examine effects of r = 0.0000001). + +###### Furthermore, when you specify which effects are too small to matter, it is possible to find an effect is both significantly different from zero, and significantly smaller than anything you care about. In other words, the finding lacks ‘practical significance’, solving another common problem with overreliance on traditional significance tests. You don’t have to determine the equivalence bounds for every other researcher – you can specify which effect sizes you would still find worthwhile to examine, perhaps based on the resources (e.g., the number of participants) you have available. + +###### You can use equivalence tests in addition to null-hypothesis significance tests. This means there are now four possible outcomes of your data analysis, and these four cases are illustrated in the figure below (adapted from Lakens, 2017). A mean difference of Cohen’s d = 0.5 (either positive or negative) is specified as a smallest effect size of interest in an independent t-test (see the vertical dashed lines at -0.5 and 0.5). Data is collected, and one of four possible outcomes is observed (squares are the observed effect size, thick lines the 90% CI, and thin lines the 95% CI). + +![lakens](/replication-network-blog/lakens.webp) + +###### We can conclude statistical equivalence if we find the pattern indicated by A: The *p*-value from the traditional NHST is not significant (p > 0.05), and the p-value for the equivalence test is significant (p ≤ 0.05). However, if the p-value for the equivalence test is also > 0.05, the outcome matches pattern D, and we can not reject an effect of 0, nor an effect that is large enough to care about. We thus remain undecided. Using equivalence tests, we can also observe pattern C: An effect is statistically significant, but also smaller than anything we care about, or equivalent to null (indicating the effect lacks practical significance). We can also conclude the effect is significant, and that the possibility that the effect is large enough to matter can not be rejected, under pattern B, which means we can reject the null, and the effect might be large enough to care about. + +###### Testing for equivalence is just as simple as performing the normal statistical tests you already use today. You don’t have to learn any new statistical theory. Given how easy it is to use equivalence tests, and how much they improve your statistical inferences, it is surprising how little they are used, but I’m confident that will change in the future. + +###### To make equivalence tests for *t*-tests (one-sample, independent, and dependent), correlations, and meta-analyses more accessible, I’ve created an easy to use [spreadsheet](https://osf.io/qzjaj/), and an R package (‘[***TOSTER***’](https://cran.r-project.org/web/packages/TOSTER/index.html), available from CRAN), and incorporated equivalence test as a module in the free software ***[jamovi](https://www.jamovi.org/)***. Using these spreadsheets, you can perform equivalence tests either by setting the equivalence bound to an effect size (e.g., d = 0.5, or r = 0.3) or to raw bounds (e.g., a mean difference of 200 seconds). Extending your statistical toolkit with equivalence tests is an easy way to improve your statistical and theoretical inferences. + +###### *Daniël Lakens is an Assistant Professor in Applied Cognitive Psychology at the Eindhoven University of Technology  in the Netherlands.  He blogs at **[The 20% Statistician](http://daniellakens.blogspot.co.nz/)** and can be contacted at D.Lakens@tue.nl.* + +###### **REFERENCES** + +###### Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. DOI: 10.1177/1948550617697177 ****** + +###### Schuirmann, Donald J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. *Journal of Pharmacokinetics and Pharmacodynamics* 15(6): 657-680. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/05/01/lakens-replicators-dont-do-post-hoc-power-analyses-do-equivalence-testing/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/05/01/lakens-replicators-dont-do-post-hoc-power-analyses-do-equivalence-testing/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/lampach-morawetz-a-primer-on-how-to-replicate-propensity-score-matching-studies.md b/content/replication-hub/blog/lampach-morawetz-a-primer-on-how-to-replicate-propensity-score-matching-studies.md new file mode 100644 index 00000000000..dc295165465 --- /dev/null +++ b/content/replication-hub/blog/lampach-morawetz-a-primer-on-how-to-replicate-propensity-score-matching-studies.md @@ -0,0 +1,81 @@ +--- +title: "LAMPACH & MORAWETZ: A Primer on How to Replicate Propensity Score Matching Studies" +date: 2016-08-05 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Lampach" + - "Morawetz" + - "propensity score matching" + - "PSM" + - "replication" +draft: false +type: blog +--- + +###### Propensity Score Matching (PSM) approaches have become increasingly popular in empirical economics. These methods are intuitively appealing.  PSM procedures are available in well-known software packages such as R or Stata. + +###### The fundamental idea behind PSM is that treated observations are compared with un-treated control observations only if the two observations are otherwise identical. This frees the researcher from having to specify a functional form explicitly relating outcomes to control variables.  In its place, PSM requires a matching algorithm. The choice of the best matching algorithm is, however an ongoing debate. + +###### A quick tour through the literature is provided by a pair of articles from Dehejia and Wahba (2002, 1999), along with replications by Smith and Todd (2005) and Diamond and Sekhon (2013), which highlights different matching approaches.  As these studies use data from a randomized controlled trial (RCT) by LaLonde (1986), they provide an illuminating comparison between different variants of PSM. Other comparisons using data RCTs can be found in Peikes et al. (2008) and Wilde and Hollister (2007).  Huber et al. (2013) uses Monte Carlo experiments to compare the performance of different matching methods. + +###### Two studies that replicate PSM research are Duvendack and Palmer (2012) and our own recent study, Lampach and Morawetz (2016).  Both reach a similar conclusion: the key issue is identification. Without the appropriate research design, matching will be misleading. + +###### A good replication needs to do more than just check if the results are robust to an alternative matching algorithm. + +###### How can one determine whether a given research design is appropriate? We find Chapter 1.2 “Cochran’s Basic Advice” in the classic book by Paul Rosenbaum (2010) helpful. He distinguishes between “Better observational studies” and “Poorer observational studies” by stressing the importance of four main points: + +###### — Clearly defined treatments (including the starting point of a treatment), covariates and outcomes + +###### — The treatment should be close to random + +###### — Good comparability of treatment and control observations + +###### — Explicit testing of plausible alternatives explanations for the measured effect. + +###### Starting from here, researchers will also find helpful the guidelines by Caliendo and Kopeinig (2008) and Imbens (2015). + +###### Researchers interested in replicating PSM studies may also find helpful our recent paper in *Applied Economics* (Lampach and Morawetz, 2016).  We provide a step-by-step guide for how to undertake a PSM study in the context of a replication by following Caliendo and Kopeinig (2008).  PSM studies are particularly rewarding studies to replicate because they incorporate many decisions during the process of implementing the research (even given an appropriate research design).  A replication of PSM studies will be illuminating both because it allows one to better appreciate the many decisions that must be made, and because it allows one to determine the robustness of the results to alternative choices in research design. + +###### We learned a lot from our replication experience and are grateful to the authors of the original work to provide us with data and code, the authors who wrote the useful guidelines, the journal which made it possible to publish the article, and finally to the organizers of *The Replication Network* for inviting us to write this blog. + +###### REFERENCES: + +###### Caliendo, M., Kopeinig, S., 2008. Some Practical Guidance for the Implementation of Propensity Score Matching. J. Econ. Surv. 22, 31–72. doi:10.1111/j.1467-6419.2007.00527.x + +###### Chemin, M., 2008. The Benefits and Costs of Microfinance: Evidence from Bangladesh. J. Dev. Stud. 44, 463–484. doi:10.1080/00220380701846735 + +###### Dehejia, R.H., Wahba, S., 2002. Propensity Score-Matching Methods for Nonexperimental Causal Studies. Rev. Econ. Stat. 84, 151–161. doi:10.1162/003465302317331982 + +###### Dehejia, R.H., Wahba, S., 1999. Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. J. Am. Stat. Assoc. 94, 1053–1062. doi:10.1080/01621459.1999.10473858 + +###### Diamond, A., Sekhon, J.S., 2013. Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies. Rev. Econ. Stat. 95, 932–945. doi:10.1162/REST\_a\_00318 + +###### Duvendack, M., Palmer-Jones, R., 2012. High Noon for Microfinance Impact Evaluations: Re-investigating the Evidence from Bangladesh. J. Dev. Stud. 48, 1864–1880. doi:10.1080/00220388.2011.646989 + +###### Huber, M., Lechner, M., Wunsch, C., 2013. The performance of estimators based on the propensity score. J. Econom. 175, 1–21. doi:10.1016/j.jeconom.2012.11.006 + +###### Imbens, G.W., 2015. Matching Methods in Practice: Three Examples. J. Hum. Resour. 50, 373–419. doi:10.3368/jhr.50.2.373 + +###### Jena, P.R., Chichaibelu, B.B., Stellmacher, T., Grote, U., 2012. The impact of coffee certification on small-scale producers’ livelihoods: a case study from the Jimma Zone, Ethiopia. Agric. Econ. 43, 429–440. doi:10.1111/j.1574-0862.2012.00594.x + +###### LaLonde, R.J., 1986. Evaluating the Econometric Evaluations of Training Programs with Experimental Data. Am. Econ. Rev. 76, 604–620. + +###### Lampach, N., Morawetz, U.B., 2016. Credibility of propensity score matching estimates. An example from Fair Trade certification of coffee producers. Appl. Econ. 48, 4227–4237. doi:10.1080/00036846.2016.1153795 + +###### Peikes, D.N., Moreno, L., Orzol, S.M., 2008. Propensity Score Matching. Am. Stat. 62, 222–231. doi:10.1198/000313008X332016 + +###### Rosenbaum, P.R., 2010. Design of observational studies. Springer, New York. + +###### Smith, J.A., Todd, P.E., 2005. Does matching overcome LaLonde’s critique of nonexperimental estimators? J. Econom., Experimental and non-experimental evaluation of economic policy and models 125, 305–353. doi:10.1016/j.jeconom.2004.04.011 + +###### Wilde, E.T., Hollister, R., 2007. How close is close enough? Evaluating propensity score matching using data from a class size reduction experiment. J. Policy Anal. Manage. 26, 455–477. doi:10.1002/pam.20262 + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/08/05/lampach-morawetz-a-primer-on-how-to-replicate-propensity-score-matching-studies/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/08/05/lampach-morawetz-a-primer-on-how-to-replicate-propensity-score-matching-studies/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/lebel-curate-science-2017-year-in-review-and-upcoming-plans-for-2018.md b/content/replication-hub/blog/lebel-curate-science-2017-year-in-review-and-upcoming-plans-for-2018.md new file mode 100644 index 00000000000..feaffe49f73 --- /dev/null +++ b/content/replication-hub/blog/lebel-curate-science-2017-year-in-review-and-upcoming-plans-for-2018.md @@ -0,0 +1,55 @@ +--- +title: "LEBEL: Curate Science – 2017 Year in Review and Upcoming Plans for 2018" +date: 2017-12-21 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "CurateScience.org" + - "Etienne LeBel" + - "Psychology" + - "replications" + - "WORD" +draft: false +type: blog +--- + +###### Curate Science (***[CurateScience.org](http://curatescience.org/)***) is an online platform to track, organize, and interpret replications of published findings in the social sciences, with a current focus on the psychology literature. + +###### We had a very productive year in 2017. Here are some highlights of our accomplishments: + +###### – With N=1,008 replications, we became (to our knowledge) the world’s largest ***[database of curated replications](http://curatescience.org/#replications-section)*** in the social sciences, covering all replications from the *Reproducibility Project: Psychology*, *Many Labs 1* and *3*, the *Social Psychology special issue*, and *Registered Replication Reports* 1 through 6). + +###### – Several new major features, most important one being a new searchable (and sortable) table of curated replications. One can search by topic, effect, keyword, method used and can sort by sample size and effect size (for both original and replication studies), and many more fields. Curated study characteristics include links to PDFs, open/public data, open/public materials, pre-registered/registered protocols, IVs, DVs, replication type, replication differences, replication active sample evidence, and links to a replication’s associated ***[evidence collection](http://curatescience.org/#evidence-collections-section)*** (when available). + +###### – Several important feature improvements (e.g., an improved replication taxonomy and replication outcome categories, each with improved diagrams; improved articulation of our goals and value of curating and tracking replications; see our ***[about section](http://curatescience.org/#about-section)***) + +###### – New and expanded ***[curation framework outlined in a manuscript](https://osf.io/preprints/psyarxiv/uwmr8)*** submitted to *Advances in Methods and Practices in Psychological Science* (which received a “revise & resubmit” on November 13, 2017) + +###### – New partnerships with not-for-profit organizations ***[Meta-Lab](https://meta-lab.co/)*** and ***[IGDORE](https://igdore.org/)*** (see ***[announcement](https://us8.campaign-archive.com/?u=0833383918fc50773891d363a&id=b79663294f)***) + +###### – New collaboration with the ***[Psychological Science Accelerator](https://psysciacc.wordpress.com/)*** (see ***[announcement](https://twitter.com/PsySciAcc/status/940656885421748225)***) + +###### – Submitted/involved in two large grants (outcome to be known in January-February 2018) and we have initiated several new public and not-for-profit grant application opportunities. + +###### – And much more (see ***[here](https://us8.campaign-archive.com/home/?u=0833383918fc50773891d363a&id=aaad5734e3)*** for a list of all of our 2017 announcements; go ***[here](http://curatescience.org/#signup-section)*** to sign up to receive our newsletter)! + +###### Upcoming plans for 2018: + +###### – Continue seeking additional grants to expand our curation capacities (paid curators) and implement our next round of new major features (next point). + +###### – Finalize designs and implement next round of major features currently in development: (1) meta-analyze selected replications, (2) enhanced visualization of complex designs, (3) curating and visualizing multiple outcomes, and (4) public crowdsourcing and replication alerts (see our ***[current developments section](http://curatescience.org/#current-developments-section)*** for more details). + +###### – Continue development of our two main future directions: (1) analytic reproducibility endorsements and (2) curate and search open/public components for any study (not just replications; see our ***[future directions section](http://curatescience.org/#future-directions-section)*** for more details). + +###### It’s been a great year for Curate Science, here’s to an even better one in 2018! + +###### ***[Etienne LeBel](http://etiennelebel.com/)****is an independent meta-scientist affiliated with the University of Western Ontario. Dr. LeBel was awarded a 2015 Leamer-Rosenthal Prize in the Emerging Researcher category for his leadership founding and directing **[Curate Science](http://curatescience.org/)**, an online platform to curate, track, and interpret replications of published findings in the social sciences and **[PsychDisclosure.org](http://psychdisclosure.org/)**, a grassroots transparency initiative that contributed to raising reporting standards at leading journals in psychology.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/12/21/lebel-curate-science-2017-year-in-review-and-upcoming-plans-for-2018/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/12/21/lebel-curate-science-2017-year-in-review-and-upcoming-plans-for-2018/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/maren-duvendack-replications-in-economics-a-progress-report.md b/content/replication-hub/blog/maren-duvendack-replications-in-economics-a-progress-report.md new file mode 100644 index 00000000000..7f32c4c8432 --- /dev/null +++ b/content/replication-hub/blog/maren-duvendack-replications-in-economics-a-progress-report.md @@ -0,0 +1,33 @@ +--- +title: "MAREN DUVENDACK: Replications in Economics: A Progress Report" +date: 2014-12-09 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "maren duvendack" + - "replications" +draft: false +type: blog +--- + +###### Economists aspire to adopt practices of the natural sciences where replication is seen as a crucial and necessary activity. In the natural sciences results are seldom given much credibility unless they have been replicated. Replication contributes in a crucial way to quality. So it is rather strange that replication has traditionally not been seen as a rewarding activity among economists. + +###### Nevertheless, replication has been an ongoing concern among economists; manifest perhaps most prominently the [American Economic Review’s](https://www.aeaweb.org/aer/data.php "AER data policy") requirement that datasets are made available to other researchers prior to accepting an article for publication. This practice has been praised by other branches of the social sciences and there have been various movements to encourage replication (see blogroll on this website). + +###### Given these development we felt it is time to revisit the progress that has been made in the area of replication in economics. Our paper reports the findings of a survey which was administered to the editors of 333 economics journals. It also provides an [analysis of 162 replication studies](http://replicationnetwork.wordpress.com/reed_et_al_2014/ "Duvendack, Palmer-Jones & Reed (2014)") that have been published in peer-reviewed economics journals from 1977-2014. + +###### We find that the publication of replication studies has been slow and that few journals publish them, though this is thankfully changing now. It is not unusual to find replication studies that do not confirm the results of the original studies. This might explain why journals are slow to publish these as they are afraid of the potentially contentious nature of exchanges between replicator and replicatee that might follow. + +###### However, in times of transparency, accountability and easy accessibility of data sources, the economics profession cannot afford to continue to neglect the area of replication. Recent cases such as [Reinhart and Rogoff](http://www.theatlantic.com/business/archive/2013/04/forget-excel-this-was-reinhart-and-rogoffs-biggest-mistake/275088/ "Reinhart and Rogoff") or [Piketty](http://www.ft.com/cms/s/2/e1f343ca-e281-11e3-89fd-00144feabdc0.html#axzz39aqj3dMn "Piketty") have received a lot of media attention naming and shaming leading academics as well as putting pressure on the profession as a whole to account for their work. + +###### We hope that our paper can lead to renewed discussions of replication in economics, especially among editors of economics journals. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2014/12/09/replications-in-economics-a-progress-report/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2014/12/09/replications-in-economics-a-progress-report/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/maren-duvendack-what-are-registered-replication-reports.md b/content/replication-hub/blog/maren-duvendack-what-are-registered-replication-reports.md new file mode 100644 index 00000000000..ce287c84602 --- /dev/null +++ b/content/replication-hub/blog/maren-duvendack-what-are-registered-replication-reports.md @@ -0,0 +1,47 @@ +--- +title: "MAREN DUVENDACK: What are Registered Replication Reports?" +date: 2016-08-30 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "3ie" + - "BITSS" + - "Brian Nosek" + - "Comparative Political Studies" + - "Daniel Simons" + - "Gottingen University" + - "Institute for New Economic Thinking" + - "Perspectives on Psychological Science" + - "Registered Replication Reports" + - "Registering Protocols" + - "Replication in Economics" + - "Reproducibility Initiative" + - "Results-Free" +draft: false +type: blog +--- + +###### Academia has been abuzz in recent years with new initiatives focusing on research transparency, replication and reproducibility of research. Notable in this regard are the ***[Berkeley Initiative for Transparency in the Social Sciences](http://www.bitss.org/)***, and the ***[Reproducibility Initiative](http://validation.scienceexchange.com/#/)*** which PLOS and Science Exchange are involved, but there are many others. Psychology and political science have had a number of new initiatives that are shaking up the scientific research and publication process.  In economics, there are laudable endeavors by The Institute for New Economic Thinking, which funds the ***[“Replication in Economics” project](http://replication.uni-goettingen.de/wiki/index.php/Main_Page)*** at Gottingen University; and 3ie, which initiated a ***[replication initiative](http://www.3ieimpact.org/en/evaluation/impact-evaluation-replication-programme/)*** that includes funding for replication studies.  And, of course, there is ***[The Replication Network](https://replicationnetwork.com/)***, which started a little over a year ago. + +###### In this blog I would like to highlight a particular initiative that is concerned with the distorted incentive structure of the academic peer-review process. It is well known that the scientific literature rewards novel, ground-breaking findings that are sometimes at odds with how the scientific research process works. Novel findings are exciting, but we can only judge the true effects of something if we amass evidence from a variety of sources and these might not always be novel or exciting. + +###### This is where the idea of registered reports, and relatedly, registered replication reports. The way the registered report models works is very simple: Researchers submit a report setting out the research questions and proposed methodology before embarking on any data collection and analysis. This report is peer-reviewed to ensure certain quality criteria are met.  Once the submission is accepted, publication in the journal where it was accepted is almost guaranteed, assuming researchers have followed through with their registered methodology. + +###### This initiative is the brain child of Alex Holcombe, Bobbie Spellman and Daniel Simons and was started in 2013 in collaboration with the journal **[*Perspectives on Psychological Science*](http://pps.sagepub.com/content/9/5/556.full)**.  The ***[first registered replication report](http://pps.sagepub.com/content/9/5/556.full)*** was published in 2014.  The ***[Center for Open Science](https://cos.io/pr/2014-05-19/)*** has actively promoted registered reports. According to Daniel Simons, Professor of Psychology at the University of Illinois, “Registered reports eliminate the bias against negative results in publishing because the results are not known at the time of review”.  Adds Chris Chambers, chair of the COS-associated Registered Reports Committee, “Because the study is accepted in advance, the incentives for authors change from producing the most beautiful story to producing the most accurate one.”  The idea of registered reports has quickly gained much traction.  Brian Nosek, Professor of Psychology at the University of Virginia, is now piloting registered reports with over ***[20 journals](https://osf.io/8mpji/wiki/home/)***. + +###### A related initiative is that of “results-free” reviewing (RFR) where studies are reviewed without reviewers knowing the results of the analysis. The journal *Comparative Political Studies* recently published a ***[special issue](https://replicationnetwork.com/2016/08/16/comparative-political-studies-tries-results-free-submissions/)*** that featured a pilot study of RFR. + +###### The move towards registered replication reports somewhat mirrors that of registered trials in the medical sciences where trials are registered before embarking on the study to minimise reporting biases, enhance transparency and accountability ([see here](https://clinicaltrials.gov/)). 3ie has established a similar registry with the aim to register international development impact evaluations ([see here](http://www.3ieimpact.org/en/evaluation/ridie/)). + +###### All these initiatives are important in the quest for more research transparency.  The medical sciences, psychology, and political science have been at the forefront of these efforts.  It would be good to see similar initiatives in economics. + +###### *Maren Duvendack is a lecturer in development economics at the University of East Anglia and co-organizer of The Replication Network.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/08/30/maren-duvendack-what-are-registered-replication-reports/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/08/30/maren-duvendack-what-are-registered-replication-reports/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/mcmillan-cogent-economics-finance-is-now-publishing-replications.md b/content/replication-hub/blog/mcmillan-cogent-economics-finance-is-now-publishing-replications.md new file mode 100644 index 00000000000..14a5217e3ce --- /dev/null +++ b/content/replication-hub/blog/mcmillan-cogent-economics-finance-is-now-publishing-replications.md @@ -0,0 +1,43 @@ +--- +title: "MCMILLAN: Cogent Economics & Finance is Now Publishing Replications" +date: 2018-02-13 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Cogent Economics and Finance" + - "Journal policies" + - "Journals" + - "Open access" + - "Replication section" +draft: false +type: blog +--- + +###### As of the start of 2018, the journal *Cogent Economics and Finance* is introducing a replication section. *Cogent Economics and Finance* is an open access journal publishing high-quality, peer-reviewed research. It is indexed in Scopus, Web of Science’s Emerging Sources Citation Index (ESCI), and has a B rating in the Australian Business Deans Council (ABDC) ranking. You can read more about the journal ***[here](https://www.cogentoa.com/journal/economics-and-finance/about)***. + +###### As an online journal, it has the advantage of no page restrictions. This makes it advantageous for publishing replication studies, as many traditional journals are reluctant to publish these given scarce hard print, journal space. + +###### But why introduce a replication section? Traditional journals in economics and finance are quick to dismiss replication studies. The question that every paper faces as it enters to review process is the ‘So what…?’ question – what is the contribution of this particular paper and is that contribution sufficient to merit publication – papers are rejected because the contribution is not sufficiently big, the paper is not novel, the results are similar to those reported elsewhere. Replication papers, therefore, do not stand a chance in this environment. Many papers will be consigned to lie in a bottom drawer, perhaps only given air in a classroom to compare results from published work to that with updated data + +###### Why are replication studies important? Of course, papers that introduce new ideas, new econometric methodologies and new data sets are important. But so too are replication papers! Replication studies refer to those that replicate a previous piece of research but generally under a different situation e.g., with different data or over a different time period. These studies help determine if the key findings from the original study can indeed be applied to other situations. + +###### Replication studies are important as they essentially perform a check on work in order to verify the previous findings and to make sure, for example, they are not specific to one set of data or circumstance. Hence, replication ensures that reported results are valid and reliable, are generalisable and can provide a sound base for future research. Replication studies thus provide robustness to the findings of research work and the interactions that they report. This matters as research can form the foundation of public policy, of regulatory acts and of corporate behaviour. New ideas formed in research today, end up in the textbooks of tomorrow and are taught to future generations. It is therefore important that such research and ideas are fully validated. + +###### Why aren’t traditional journals more open to publishing replications? Even a brief look at the aims and scope of a range of journals in economics and finance (and no doubt beyond), reveal the words ‘original’, ‘new’, ‘meaningful insights’, ‘impact’, innovative’. All of these are, of course, laudable and desired but equally set a very high bar for replication studies, which may then encounter difficulty in finding an appropriate outlet. + +###### Journals in economic and finance are also set in a journal ranking race – lists compiled by the Australian Business Deans Council, the Chartered Associated of Business Schools, the FT – determine the quality of journals, which affect the submission choices of authors as well as their promotion and job prospects. A journal that seeks to promote replication studies may find that such an approach does not help in these journal rankings, which are often determined by perception of quality and whether that journal publishes ground breaking work + +###### The view taken by *Cogent Economics and Finance* is that it recognises the importance of replication studies and now seeks research papers that focus on replication and whose ultimate acceptance depends on the accuracy and thoroughness of the work rather than seeking a ‘new’ result. We believe such replications should not merely repeat existing work but to extend them through their application to, for example, updated data sets and to provide a comparison with the previously published work. It is not just pushing buttons on a computer software package, but involves a research-focused process, with all the academic rigour that entails. + +###### We hope this will foster a great appreciation of replication studies and their importance, a stronger culture of verification, validity and robustness checking, and an encouragement to authors to engage with such work. + +###### *David McMillan is Professor Finance at the University of Stirling and a Senior Editor of Cogent Economics and Finance. He can be contacted at [david.mcmillan@stir.ac.uk](mailto:david.mcmillan@stir.ac.uk).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/02/13/mcmillan-cogent-economics-finance-is-now-publishing-replications/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/02/13/mcmillan-cogent-economics-finance-is-now-publishing-replications/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/mcshane-gal-statistical-significance-and-dichotomous-thinking-among-economists.md b/content/replication-hub/blog/mcshane-gal-statistical-significance-and-dichotomous-thinking-among-economists.md new file mode 100644 index 00000000000..28b3178645d --- /dev/null +++ b/content/replication-hub/blog/mcshane-gal-statistical-significance-and-dichotomous-thinking-among-economists.md @@ -0,0 +1,84 @@ +--- +title: "MCSHANE & GAL: Statistical Significance and Dichotomous Thinking Among Economists" +date: 2017-11-06 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "American Statistical Association" + - "null hypothesis significance testing" + - "p-value" + - "Statistical practice" +draft: false +type: blog +--- + +###### *[Note: This blog is based on our articles **[“Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence”](http://www.blakemcshane.com/Papers/mgmtsci_pvalue.pdf)** (Management Science, 2016) and **[“Statistical Significance and the Dichotomization of Evidence”](http://www.tandfonline.com/doi/full/10.1080/01621459.2017.1289846)** (Journal of the American Statistical Association, 2017).]* + +###### ***Introduction*** + +###### The null hypothesis significance testing (NHST) paradigm is the dominant statistical paradigm in the biomedical and social sciences. A key feature of the paradigm is the dichotomization of results into the different categories “statistically significant” and “not statistically significant” depending on whether the p-value is, respectively, below or above the size alpha of the test, where alpha is conventionally set to 0.05. Although prior research has oft criticized this dichotomization for, *inter alia*, having “no ontological basis” (Rosnow and Rosenthal, 1989) and the arbitrariness of the 0.05 cutoff value, the impact of this dichotomization on the judgments and decision making of academic researchers has received relatively little attention. + +###### Our articles examine this question. We find that the dichotomization intrinsic to the NHST paradigm leads expert researchers from a variety of fields (including medicine, epidemiology, cognitive science, psychology, business, economics, and even statistics) to make errors in reasoning. In particular, when presented with a hypothetical study summary with a p-value experimentally manipulated to be either above or below the 0.05 threshold for statistical significance, we show: + +###### [1] Academic researchers interpret evidence dichotomously primarily based on whether the p-value is below or above 0.05. + +###### [2] They fixate on whether a p-value reaches the threshold for statistical significance even when p-values are irrelevant (e.g., when asked about descriptive statistics). + +###### [3] These findings apply to likelihood judgments about what might happen to future subjects as well as to choices made based on the data. + +###### [4] Researchers’ judgments reflect a tendency to ignore effect size. + +###### We briefly review these findings with a focus, given the audience of this blog, on our results for economists. + +###### ***Study 1: Descriptive Statements*** + +###### In our first series of studies, the hypothetical study summary described a clinical trial of two treatments where the outcome of interest was the number of months lived by the patients (average of 8.2 and 7.5 months for treatments A and B respectively). Our subjects were asked a multiple choice question about whether the number of months lived by those who received treatment A was greater, less, or no different than the number of months lived by those who received treatment B or whether it could not be determined. + +![FIGURE1](/replication-network-blog/figure1.webp) + +###### The correct answer is, of course, that the average number of post-diagnosis months lived by the patients who received treatment A was greater than that lived by the patients who received treatment B (i.e., 8.2 > 7.5) regardless of the p-value. However, as illustrated in Figure 1, subjects were much more likely to answer the question correctly when the p-value in the question was set to 0.01 than to 0.27. Similar results held for researchers in psychology, business, and, to a lesser extent, statistics. + +###### ***Study 2: Likelihood Judgments and Choices*** + +###### In our second series of studies, the hypothetical study summary described a clinical trial of two drugs where the outcome of interest was whether or not patients recovered from a disease (e.g., recovery rate of 52% and 44% for Drugs A and B respectively). Our subjects were asked two multiple choice questions: first, a likelihood judgment question about whether a hypothetical patient would be more likely, less likely, or equally likely to recover if given Drug A versus Drug B or whether it could not be determined, and, second, a choice question asking, if they were a patient, whether they would prefer to take Drug A, Drug B, or were indifferent. + +###### The issue at variance in both the likelihood judgment question and choice question is fundamentally a predictive one: they both ask about the relative likelihood of a new patient recovering if given Drug A rather than Drug B. This in turn clearly depends on whether or not Drug A is more effective than Drug B. The p-value is of course one measure of the strength of the evidence regarding the likelihood that it is. However, the level of the p-value does not alter the “correct” response option for either question: the correct answer is option A as Drug A is more likely to be more effective than Drug B (under the non-informative prior encouraged by the question wording this probability is one minus half the two-sided p-value). + +###### FIGURE2 + +###### As illustrated in Figure 2, the proportion of subjects who chose Drug A for either question dropped sharply once the p-value rose above 0.05 but it was relatively stable thereafter and the magnitude of the treatment difference had no substantial impact on the results. However, the effect of statistical significance was attenuated for the choice question, consistent with the notion that making matters more personally consequential shifts the focus away from concerns about statistical significance and towards whether an option is superior. Similar results held for researchers in cognitive science, psychology, and, to a lesser extent, statistics. + +###### FIGURE3 + +###### We repeated similar studies on economists. As illustrated in Figure 3, similar results held. However, as illustrated in Figure 3e, the effect is attenuated when the researchers were presented with not only a p-value but also with a posterior probability based on a non- informative prior. This is interesting because, objectively, the posterior probability is a redundant piece of information: as noted above, under a non-informative prior it is one minus half the two-sided p-value. + +###### ***Conclusion*** + +###### Researchers from a wide variety of fields, including both statistics and economics, interpret p-values dichotomously depending upon whether or not they fall below the hallowed 0.05 threshold. This is in direct contravention of the third principal of the recent American Statistical Association *Statement on Statistical Significance and p- values* (Wasserstein and Lazar, 2016)—“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold”—as well as countless other similar warnings. + +###### What can be done? Our suggestions are not particularly new or original. We should emphasize that evidence, particularly that based on p-values and other purely statistical measures, lies on a continuum. We would go further and say that, in many cases, it does not make sense to calibrate scientific evidence as a function of the p-value, given that this statistic is defined relative to the generally uninteresting and implausible null hypothesis of zero effect and zero systematic error (McShane et al., 2017). + +###### We suggest looking beyond purely statistical considerations and taking a more holistic and integrative view of evidence that includes prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. Most importantly, perhaps, we should move away from dichotomous or categorical reasoning whether in the form of NHST or otherwise. + +###### *Blakeley B. McShane is an associate professor at the Kellogg School of Management, Northwestern University. David Gal is a professor at the University of Chicago at Illinois College of Business Administration. Correspondence regarding this blog post can be directed to either or both at* [*b-mcshane@kellogg.northwestern.edu*](mailto:b-mcshane@kellogg.northwestern.edu) *and* [*dgaluic@gmail.com*](mailto:dgaluic@gmail.com) *respectively.* + +###### ***References*** + +###### [1] McShane, B.B., and Gal, D. (2016), “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.” *Management Science*,62(6), 1707-1718. + +###### [2] McShane, B.B. and Gal, D. (2017), “Statistical Significance and the Dichotomization of Evidence.” *Journal of the American Statistical Association*, 112(519), 885-895. + +###### [3] Rosnow RL, Rosenthal R (1989) Statistical procedures and the justification of knowledge in psychological science. Amer. Psychologist 44:1276–1284. + +###### [4] Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s statement on p-values: context, process, and purpose,” The American Statistician, 70(2), 129–133. + +###### [5] McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon statistical significance. arXiv preprint arXiv:1709.07588. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/11/06/mcshane-gal-statistical-significance-and-dichotomous-thinking-among-economists/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/11/06/mcshane-gal-statistical-significance-and-dichotomous-thinking-among-economists/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/menclova-is-it-time-for-a-journal-of-insignificant-results.md b/content/replication-hub/blog/menclova-is-it-time-for-a-journal-of-insignificant-results.md new file mode 100644 index 00000000000..d5fec239502 --- /dev/null +++ b/content/replication-hub/blog/menclova-is-it-time-for-a-journal-of-insignificant-results.md @@ -0,0 +1,77 @@ +--- +title: "MENCLOVA: Is it Time for a Journal of Insignificant Results?" +date: 2017-03-24 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Insignificant results" + - "Journals" + - "negative results" +draft: false +type: blog +--- + +###### It is well known that there is a bias towards publication of statistically significant results. In fact, we have known this for at least 25 years since the publication of De Long and Lang (JPE 1992): + +###### *“Economics articles are sprinkled with very low t-statistics – marginal significance levels very close to one – on nuisance coefficients. […] Very low t-statistics appear to be systematically absent – and therefore null hypotheses are overwhelmingly false – only when the universe of null hypotheses considered is the central themes of published economics articles. This suggests, to us, a publication bias explanation of our findings.” (pp. 1269-1270)* + +###### While statistically insignificant results are less “sexy”, they are often not less important. Failure to reject the null hypothesis can be interesting in itself, is a valuable data point in meta-analyses, or can indicate to future researchers where they are unlikely to find an effect. As McCloskey (2002) famously puts it: + +###### *“[…] statistical significance is neither necessary nor sufficient for a result to be scientifically significant.” (p. 54)* + +###### This problem is not unique to Economics but several other disciplines have moved faster than us to try and address it. For example, the following disciplines already have journals dedicated to publishing “insignificant” results: + +###### Psychology: **[*Journal of Articles in Support of the Null Hypothesis*](http://www.jasnh.com/)** + +###### Biomedicine: ***[Journal of Negative Results in Biomedicine](http://www.jnrbm.com/))*** + +###### Ecology and Evolutionary Biology: ***[Journal of Negative Results](http://www.jnr-eeb.org/index.php/jnr/index)*** + +###### Is it time for Economics to catch up? I suggest it is and I know that I am not alone in this view. In fact, a number of prominent Economists have endorsed this idea (even if they are not ready to pioneer the initiative). So, imagine… a call for papers along the following lines: + +###### **Series of Unsurprising Results in Economics (SURE)** + +###### Is the topic of your paper interesting, your analysis carefully done, but your results are not “sexy”? If so, please consider submitting your paper to SURE. An e-journal of high-quality research with “unsurprising” findings. + +###### How does it work: + +###### — We accept papers from all fields of Economics… + +###### — Which have been rejected at a journal indexed in EconLit… + +###### — With the ONLY important reason being that their results are statistically insignificant or otherwise “unsurprising”. + +###### To document that your paper meets the above eligibility criteria, please send us all referee reports and letters from the editor from the journal where your paper has been rejected.  Two independent referees will read these reports along with your paper and evaluate whether they indicate that: 1. the paper is of high quality and 2. the only important reason for rejection was the insignificant/unsurprising nature of the results.  Submission implies that you (the authors) give permission to the SURE editor to contact the editor of the rejecting journal regarding your manuscript. + +###### SURE benefits writers by: + +###### — Providing an outlet for interesting, high-quality, but “risky” (in terms of uncertain results) research projects; + +###### — Decreasing incentives to data-mine, change theories and hypotheses ex post, exclusively focus on provocative topics. + +###### SURE benefits readers by: + +###### — Mitigating the publication bias and thus complementing other journals in an effort to provide a complete account of the state of affairs; + +###### — Serving as a repository of potential (and tentative) “dead ends” in Economics research. + +###### Feedback is definitely invited! Please submit your comments here or email me at *[andrea.menclova@canterbury.ac.nz](mailto:andrea.menclova@canterbury.ac.nz)*. + +###### *Andrea Menclova is a Senior Lecturer at the University of Canterbury in New Zealand.* + +###### + +###### REFERENCES: + +###### De Long J. Bradford and Kevin Lang. 1992. “Are all Economic Hypotheses False?” *Journal of Political Economy*, 100:6, pp.1257-1272 + +###### McCloskey, Deirdre. 2002. *The Secret Sins of Economics*. Prickly Paradigm Press, Chicago. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/03/24/menclova-is-it-time-for-a-journal-of-insignificant-results/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/03/24/menclova-is-it-time-for-a-journal-of-insignificant-results/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/menclova-sure-journal-is-now-open-for-submissions.md b/content/replication-hub/blog/menclova-sure-journal-is-now-open-for-submissions.md new file mode 100644 index 00000000000..e6a92df9242 --- /dev/null +++ b/content/replication-hub/blog/menclova-sure-journal-is-now-open-for-submissions.md @@ -0,0 +1,55 @@ +--- +title: "MENCLOVA: SURE Journal Is Now Open For Submissions!" +date: 2018-07-19 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Andrea Menclova" + - "Journal policies" + - "Series of Unsurprising Results in Economics" + - "Statistical insignificance" + - "SURE" +draft: false +type: blog +--- + +###### Is the topic of your paper interesting, your data appropriate and your analysis carefully done – but your results are not “sexy”? If so, please consider submitting your paper to the Series of Unsurprising Results in Economics. SURE is an e-journal of high-quality research with “unsurprising”/confirmatory findings. + +###### This is how it works: + +###### – We accept papers from all fields of Economics… + +###### – Which have been rejected at a journal indexed in EconLit… + +###### – With the ONLY important reason being that their results are statistically insignificant or otherwise “unsurprising”. + +###### SURE is an open-access journal and there are no submission charges. + +###### SURE benefits readers by: + +###### – Mitigating the publication bias and thus complementing other journals in an effort to provide a complete account of the state of affairs; + +###### – Serving as a repository of potential (and tentative) “dead ends” in Economics research. + +###### SURE benefits writers by: + +###### – Providing an outlet for interesting, high-quality, but “risky” (in terms of uncertain results) research projects; + +###### – Decreasing incentives to data-mine, change theories and hypotheses ex post or exclusively focus on provocative topics. + +###### We hope you will consider SURE as an outlet for your work and look forward to hearing from you! + +###### To learn more about SURE, ***[click here](http://surejournal.org/)***. + +###### SURE Editorial board + +###### Karen S. Conway (University of New Hampshire), Hope Corman (Rider University), John Gibson (University of Waikato), David Giles (University of Victoria), John Landon-Lane (Rutgers University), Nicholas Mangee (Georgia Southern University), Andrea K. Menclova (University of Canterbury), W. Robert Reed (University of Canterbury), Steven Stillman (Free University of Bozen-Bolzano), Edinaldo Tebaldi (World Bank), Robert S. Woodward (University of New Hampshire) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/07/19/menclova-sure-journal-is-now-open-for-submissions/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/07/19/menclova-sure-journal-is-now-open-for-submissions/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/michele-nuijten-the-replication-paradox.md b/content/replication-hub/blog/michele-nuijten-the-replication-paradox.md new file mode 100644 index 00000000000..689b3000871 --- /dev/null +++ b/content/replication-hub/blog/michele-nuijten-the-replication-paradox.md @@ -0,0 +1,65 @@ +--- +title: "MICHELE NUIJTEN: The Replication Paradox" +date: 2016-01-05 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Psychology" + - "publication bias" + - "replication" +draft: false +type: blog +--- + +###### Lately, there has been a lot of attention for the excess of false positive and exaggerated findings in the published scientific literature. In many different fields there are reports of an impossibly high rate of statistically significant findings, and studies of meta-analyses in various fields have shown overwhelming evidence for overestimated effect sizes. + +###### The suggested solution for this excess of false postive findings and exaggerated effect size estimates in the literature is replication. The idea is that if we just keep replicating published studies, the truth will come to light eventually. + +###### This intuition also showed in a small survey I conducted among psychology students, social scientists, and quantitative psychologists. I offered them different hypothetical combinations of large and small published studies that were identical except for the sample size – they could be considered replications of each other. I asked them how they would evaluate this information if their goal was to obtain the most accurate estimate of a certain effect. In almost all of the situations I offered, the answer was almost unanimously: combine the information of both studies. + +###### This makes a lot of sense: the more information the better, right? Unfortunately this is not necessarily the case. + +###### The problem is that the respondents forgot to take into account the influence of publication bias: statistically significant results have a higher probability of being published than non-significant results. And only publishing significant effects leads to overestimated effect sizes in the literature. + +###### But wasn’t this exactly the reason to take replication studies into account? To solve this problem and obtain more accurate effect sizes? + +###### Unfortunately, there is evidence from multi-study papers and meta-analyses that replication studies suffer from the same publication bias as original studies (see below for references). This means that *both* types of studies in the literature contain overestimated effect sizes. + +###### The implication of this is that combining the results of an original study with those of a replication study could actually *worsen* the effect size estimate. This works as follows. + +###### Bias in published effect size estimates depends on two factors: publication bias and power (the probability that you will reject the null hypothesis, given that it is false). Studies with low power (usually due to a small sample size) contain a lot of noise, and the effect size estimate will be all over the place, ranging from severe underestimations to severe overestimations. + +###### This in itself is not necessarily a problem; if you would take the average of all these estimates (e.g., in a meta-analysis) you would end up with an accurate estimate of the effect. However, if because of publication bias only the significant studies are published, only the severe overestimations of the effect will end up in the literature. If you would calculate an average effect size based on these estimates, you will end up with an overestimation. + +###### Studies with high power do not have this problem. Their effect size estimates are much more precise: they will be centered more closely on the true effect size. Even when there is publication bias, and only the significant (maybe slightly overestimated) effects are published, the distortion would not be as large as with underpowered, noisier studies. + +###### Now consider again a replication scenario such as the one mentioned above. In the literature you come across a large original study and a smaller replication study. Assuming that both studies are affected by publication bias, the original study will probably have a somewhat overestimated effect size. However, since the replication study is smaller and has lower power, it will contain an effect size that is even more overestimated. Combining the information of these two studies then basically comes down to *adding bias* to the effect size estimate of the original study. In this scenario it would render a more accurate estimation of the effect if you would only evaluate the original study, and ignored the replication study. + +###### In short: even though a replication will increase *precision* of the effect size estimate (a smaller confidence interval around the effect size estimate), it will add *bias* if the sample size is smaller than the original study, but only if there is publication bias and the power is not high enough. + +###### There are two main solutions to the problem of overestimated effect sizes. + +###### The first solution would be to eliminate publication bias; if there is no selective publishing of significant effects, the whole “replication paradox” would disappear. One way to eliminate publication bias is to *preregister* your research plan and hypotheses *before* collecting the data. Some journals will even review this preregistration, and can give you an “in principle acceptance” – completely independent of the results. In this case, studies with significant and non-significant findings have an equal probability of being published, and published effect sizes will not be systematically overestimated.  Another way is for journals to commit to publishing replication results independent of whether the results are significant.  Indeed, this is the stated replication policy of some journals already. + +###### The second solution is to only evaluate (and perform) studies with high power. If a study has high power, the effect size estimate will be estimated more precisely and less affected by publication bias. Roughly speaking: if you discard all studies with low power, your effect size estimate will be more accurate. + +###### A good example of an initiative that implements both solutions is the *[recently published](http://www.sciencemag.org/content/349/6251/aac4716)* Reproducibility Project, in which 100 psychological effects were replicated in studies that were preregistered and high powered. Initiatives such as this one eliminates systematic bias in the literature and advances the scientific system immensely. + +###### However, before preregistered, highly powered replications are the new standard, researchers that want to play it safe should change their intuition from “the more information, the higher the accuracy,” to “the more power, the higher the accuracy.” + +###### This blog is based on the paper “The replication paradox: Combining studies can decrease accuracy of effect size estimate” (2015) by Nuijten**,** van Assen, Veldkamp, Wicherts (2015). *Review of General Psychology,* *19*(2), 172-182*.* + +###### LITERATURE ON HOW REPLICATIONS SUFFER FROM PUBLICATION BIAS: + +###### Francis, G. (2012). Publication bias and the failure of replication in experimental psychology. Psychonomic Bulletin & Review, 19(6), 975-991. + +###### Ferguson, C. J., & Brannick, M. T. (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17, 120-128. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/01/05/michele-nuijten-the-replication-paradox/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/01/05/michele-nuijten-the-replication-paradox/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/miller-the-statistical-fundamentals-of-non-replicability.md b/content/replication-hub/blog/miller-the-statistical-fundamentals-of-non-replicability.md new file mode 100644 index 00000000000..329fbfb6fdf --- /dev/null +++ b/content/replication-hub/blog/miller-the-statistical-fundamentals-of-non-replicability.md @@ -0,0 +1,139 @@ +--- +title: "MILLER: The Statistical Fundamentals of (Non-)Replicability" +date: 2019-01-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "false positive rate" + - "Jeff Miller" + - "negative results" + - "null hypothesis significance testing" + - "Replicability" + - "replication crisis" + - "Replication probability" + - "Replication success" + - "Statistical power" +draft: false +type: blog +--- + +###### *“Replicability of findings is at the heart of any empirical science” (Asendorpf, Conner, De Fruyt, et al., 2013, p. 108)* + +###### The idea that scientific results should be reliably demonstrable under controlled circumstances has a special status in science.  In contrast to our high expectations for replicability, unfortunately, recent reports suggest that only about 36% (Open Science Collaboration, 2015) to 62% (Camerer, Dreber, Holzmeister, et al., 2018) of the results reported in various areas of science are actually reproducible. This is disturbing because researchers and lay persons alike tend to accept published findings as rock solid truth.  As Mark Twain reportedly put it, + +###### *“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”* + +###### Dismay over poor replicability is widespread, with 90% of surveyed researchers reporting that there is at least some replicability crisis in their fields (Baker, 2016). + +###### The theme of this article is that it is important to understand the fundamental statistical issues underlying replicability. When pressed, most researchers will concede that replicability cannot be completely guaranteed when random variability affects study outcomes, as it does in all areas where replicability is a concern.  Due to random variability, there is always some probability of getting unrepresentative results in either the original study or its replication, either of which could produce a replication failure. + +###### If a successful replication is only probabilistic, what is its probability? I will show how easily-understood statistical models can be used to answer this question (for a more mathematically in-depth treatment, see Miller & Schwarz, 2011).  The results are illuminating for at least two reasons: + +###### 1) Knowing what replication probability should be expected on purely statistical grounds helps us calibrate the severity of the replicability problem. If we should expect 99% replication success, then the reported values of 36% to 62% indicate that something has gone seriously wrong. If we should only expect 50% replication success, though, then perhaps low replicability is just another part of the challenge of science. + +###### 2) If something has gone wrong, then seeing what kinds of things cause poor replicability would almost certainly help us find ways of addressing the problems. + +###### **What is replicability?** + +###### To construct a statistical model of replicability, it is first essential to define that term precisely, and many reasonable definitions are possible. I will illustrate the essential issues concerning replicability within the context of the standard hypothesis testing framework shown in Table 1. Parallel issues always arise—though sometimes under different names—in alternative frameworks (see “Concluding comments”). + +###### capture1*TABLE 1: Standard classification of researchers’ decisions within the hypothesis testing framework. Researchers test a null hypothesis (Ho) which is either true (to an adequate approximation) or false. At the conclusion of the study, either they decide that Ho is false and reject it—a “positive” result, or else they decide that Ho may be true and fail to reject it—a “negative” result.  Researchers will sometimes make incorrect decisions (e.g., false positives and false negatives), partly because their results are influenced by random variability.* + +###### The great majority of published studies report positive results within the framework of Table 1 (Fanelli, 2012), so it is common to define a successful replication simply as a positive result in study 1 (the initial published finding) followed by a positive result in study 2 (the replication study). + +###### **A model for replication probability** + +###### To understand the probability of a successful replication as just defined, it is useful to consider what happens across a large number of studies, as is illustrated in Figure 1. + +###### *capture2FIGURE 1. A model for computing the probability of replication. The values in red represent parameters describing the research area for which replication probability is to be computed, and the model illustrates how the replication probability and other values can be computed from these parameters.* + +###### Proceeding from left to right across the figure, the model starts with a large set of studies conducted by the researchers in the area for which replication probability is to be determined. In medical research, for example, these might be studies testing different drugs as possible treatments for various diseases. As shown in the next column of the figure, the null hypothesis is approximately true in some studies (i.e., the drug being tested has little or no effect), whereas it is false in others (i.e., the drug works well). In the particular numerical example of this figure, 10% of the studies tested false null hypotheses. + +###### The “Study 1” column of the figure illustrates the results of the initial 1,000 studies. For the 900 studies in which the null was true, there should be about 45 positive results—that is, false positives—based on the standard α=.05 cut-off for statistical significance[1]. For the 100 studies in which the null was false, the number of positive results—that is, true positives—depends on the level of statistical power. Assuming the power level of 60% shown here, there should be about 60 true positives.[2] + +###### It is illuminating—perhaps “alarming” would be a better word—to consider the implications of this model for the veracity of published findings in this research area. Given these values of the three parameters (i.e., α level, power, and base rate of false null hypotheses), publication of all positive findings would mean that 45/105 = 43% of the published findings would be false. + +###### That value, called the “rate of false positives”, is obviously incompatible with the common presumption that published findings represent rock solid truth, but it emerges inevitably from these parameter values.  Moreover, these parameter values are not outlandish; α=.05 is absolutely standard, statistical power=.60 is reasonable for many research areas, and so is a base rate of 10% false null hypotheses (e.g., 10% of drugs tested are effective). + +###### Returning to the issue of replicability, the “Study 2” column of the figure shows what happens when researchers try to replicate the findings from the 45+60=105 positive outcomes from Study 1.  If the replications use the same levels of α and power, on average only 2.25+36=38.25 replications will be successful.  Thus, the expected overall replication probability[3] is 38.25/105=0.36.  Again, this value is disturbingly low relative to the expectation that scientific findings should be consistently replicable, despite the fact that the parameter values assumed for this example are not wildly atypical. + +###### It is also worth noting that the low replication probability obtained in Figure 1 results partly from the computation’s exclusive focus on positive results, as dictated by the standard definition of a “successful replication” stated earlier (i.e., a positive result in study 1 followed by a positive result in study 2). + +###### Suppose instead that replication success was defined as getting the same result in Study 2 as in Study 1 (i.e., both studies got positive results or both got negative). To evaluate the probability of replication success under that revised definition, it would also be necessary to repeat each Study 1 that had negative results. If negative results were obtained again in these replication attempts, they would count as a successful replications under the revised definition. The replication probability would now jump to 87% (computations left as an exercise for the reader), which obviously sounds a lot better than the 36% computed under the standard definition. + +###### Replication probability increases under the revised definition mostly because 95% of the 855 true negative Study 1 results would be replicated successfully in Study 2 (i.e., only 5% would produce non-replicating positive effects by chance). There is of course no inherently correct definition of “successful replication”, but it is worth keeping in mind that low replicabilities under the standard definition do not mean that few studies reach correct conclusions—only that many reports of positive findings may be wrong. + +###### **Expected replication probabilities with other parameter values** + +###### The 36% replication probability in Figure 1 is of course specific to the particular parameter values assumed for that example (i.e., the researchers’ α level, the power of their experiments, and the base rate of false null hypotheses). Using the same model, though, it is possible to compute the expected replication probability over a wide range of parameter values, and the results are illuminating. + +###### Specifically, Figure 2 shows how the expected replication probability depends on the values of these three parameters when Study 2 power is the same as Study 1 power, as would be true in the case of exact replications of the original study, as diagrammed in Figure 1. Figure 3 shows the expected replication probabilities for the slightly different situation in which Study 2 power is much higher than it was in the original study—the Study 2 power value of 0.95 was used in these computations. Replicability with these “high power” replications is of interest because systematic studies of replicability (e.g., the above-cited studies producing the replicability estimates of 36%—62%) often increase sample sizes to obtain much higher power than was present in the original study. + +###### capture3*FIGURE 2. Replication probability, Pr(rep), as a function of α, power, and the base rate of false null hypotheses, computed using the model illustrated in Figure 1.* + +###### capture4*FIGURE 3. High-power replication probability, Pr(rep), as a function of α, Study 1 power, and the base rate of false null hypotheses. These Pr(rep) values were computed under the assumption that Study 2 power is 0.95, regardless of Study 1 power.* + +###### The most striking finding in Figures 2 and 3 is simply that replication probabilities can be quite low. In each panel of Figure 2, for example, the replication probability is less than or equal to the individual power of each study. The panel’s power is the maximum possible replication probability, because it is the replication probability in the ideal case where there are no false positives. The maximum replication probabilities are much higher in Figure 3, because almost all true positives replicate when Study 2 power is 0.95.  Nonetheless, within both figures some of the replication probabilities are far lower than is suggested by the expectation that scientific findings should be completely replicable. The implication is that we must either modify the research practices embodied in Figure 1 or else lower our expectations about the replicability of reported effects (cf. Stanley & Spence, 2014). + +###### Another striking result in Figures 2 and 3 is that replication probability can drop to quite low rates when the base rate of false null hypotheses is low. The reason can be seen in the model of Figure 1. If most studies test true null hypotheses, then most of the positive results will be false positives, and these will be unlikely to replicate. As an extreme example, suppose that 999 of the 1,000 studies tested true null hypotheses. In that case at most one of the positive results could be a true positive. All the rest would necessarily be false positives, and the overall replication rate would necessarily be low. This would be true for any level of power and practically any level of α, so a low base rate of false null hypotheses will virtually always produce low replicability. + +###### The implication, of course, is that one of the best ways to improve replicability is for researchers to avoid looking for “long shot” effects, instead requiring strong theoretical motivation before looking for effects (which, I presume, would increase the base rate). In fact, there are good reasons to believe that the temptation to test for long shots differs across research areas and that the resulting between-area differences in base rates are responsible for some of the between-area differences in replicability that have been reported (Wilson & Wixted, 2018). + +###### Finally, Figures 2 and 3 also show that replication probability increases (a) as the α level decreases (e.g., from α=0.05 to α=0.005), and (b) as statistical power increases. These patterns reinforce calls for researchers to use lower α levels (e.g., Benjamin, Berger, Johannesson, et al., 2018) and to increase power (e.g., Button, Ioannidis, Mokrysz, et al., 2013). + +###### Unfortunately, implementing these changes would be costly. For example, more than three times as much data are needed to run a study with α=0.005 and power=0.8, as compared with a study having α=0.05 and power=0.6.  Assuming that data collection resources are limited, researchers would thus face the trade-off of choosing between, say, 10 of the larger studies (with higher replicability) or 30+ of the smaller studies (with lower replicability). This would clearly be a complicated choice that could be influenced by many factors.  Using a cost-benefit analysis to quantify these trade-offs, Miller and Ulrich (2016, 2019) examined how researchers could make optimal choices (e.g., of α) to maximize overall scientific payoff, and they found—perhaps surprisingly—that under some circumstances the optimal choices would lead to replication probabilities under 50%. + +###### **Concluding comments** + +###### My presentation here was based on the de facto standard “null hypothesis testing” framework of statistical analysis shown in Table 1. Since that framework has often been criticized (for a relatively balanced discussion, see Nickerson, 2000—especially the esteemed colleague’s comment in footnote 2), some might wonder, “Could replication probabilities be improved by switching to a different framework (e.g., Bayesian)?” + +###### Answering that question requires a precise formalization of the suggested alternative framework that is comparable to the framework shown in Figure 1, but my sense is that the answer is “no”. Regardless of which statistical criterion is used to decide whether an effect is present, Studies 1 and 2 will sometimes give conflicting results due to random variation. Moreover, there will be close analogs of the base rate, α level, and power parameters within any framework for making inferences about whether effects are present, and these new parameters will have pretty much the same influences on replication probability.  For example, if the base rate of true effects is low, more of the seemingly positive Study 1’s effects will be false positives, and these effects will therefore be less replicable. + +###### Thus, random variability in the data is the main factor limiting replicability—not the statistical methods used in their analysis. To be sure, improving the replicability of research is a worthwhile goal, but efforts in this direction should take into account the fundamental statistical limits on what is attainable. + +###### *Jeff Miller is Professor of Psychology at Otago University in New Zealand. His email address is miller@psy.otago.ac.nz.* + +###### [1] I have assumed that researchers use one-tailed hypothesis tests to simplify the computations. The overall pattern of results is quite similar if two-tailed tests are assumed. + +###### [2] I have also assumed that all studies have the same statistical power to simplify the computations. The overall pattern of results is also quite similar if the indicated power level is the mean power of all studies, with the power levels of individual studies varying randomly—for example, with a beta distribution. + +###### [3] It is reasonable to call this overall value an “aggregate replication probability” (Miller, 2009), because it is the overall probability of replication aggregating across all 105 different studies with initially positive findings. In contrast one might also conceive of an “individual replication probability” that would likely be of more interest to an individual researcher.  This researcher, looking at the positive results of a specific Study 1, might ask, “If I repeat this study 100 times, about how often will I get significant results?” For the numerical example in the figure, the answer to that question is “either 5% or 60%, depending on whether the tested null hypothesis was actually false.”  Note that no single researcher in this area would expect a long-term replication rate of 36%; instead the 36% value is an average for some researchers whose individual replication rates are 5% and others whose rates are 60%. + +###### **References** + +###### Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J. A., et al. (2013). Recommendations for increasing replicability in psychology. *European Journal of Personality, 27*, 108–119. doi: 10.1002/per.1919 + +###### Baker, M. (2016). Is there a reproducibility crisis? *Nature, 533*, 452—454. + +###### Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek., B. A., Wagenmakers, E. J., Berk, R., et al. (2018). Redefine statistical significance. *Nature Human Behaviour, 2*, 6–10. doi: 10.1038/s41562-017-0189-z + +###### Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J. & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. *Nature Reviews Neuroscience, 14*(5), 365—376. doi: 10.1038/nrn3475 + +###### Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J. et al. (2018).  Evaluating the replicability of social science experiments in *Nature* and *Science* between 2010 and 2015. *Nature Human Behaviour, 2*, 637—644. doi: 10.1038/s41562-018-0399-z + +###### Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. *Scientometrics, 90*(3), 891—904. doi: 10.1007/s11192-011-0494-7 + +###### Miller, J. O. (2009). What is the probability of replicating a significant effect?  *Psychonomic Bulletin & Review, 16*(4), 617—640. doi: 10.3758/PBR.16.4.617 + +###### Miller, J. O. & Schwarz, W. (2011). Aggregate and individual replication probability within an explicit model of the research process.  *Psychological Methods, 16*(3), 337—360. doi: 10.1037/a0023347 + +###### Miller, J. O. & Ulrich, R. (2016). Optimizing research payoff. *Perspectives on Psychological Science, 11 (5)*, 664—691.  doi: 10.1177/1745691616649170 + +###### Miller, J. O. & Ulrich, R.  (2019).  The quest for an optimal alpha. *PLOS ONE, 14 (1)*, 1—13.  doi: 10.1371/journal.pone.0208631 + +###### Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. *Psychological Methods, 5*, 241—301. doi: 10.1037/1082-989X.5.2.241 + +###### Open Science Collaboration (2015). Estimating the reproducibility of psychological science. *Science, 349*(6251), aac4716-1—aac4716-8. doi: 10.1126/science.aac4716 + +###### Stanley, D. J. & Spence, J. R. (2014).  Expectations for replications: Are yours realistic? *Perspectives on Psychological Science, 9*(3), 305—318. doi: 10.1177/1745691614528518 + +###### Wilson, B. M. & Wixted, J. T.  (2018). The prior odds of testing a true effect in cognitive and social psychology.  *Advances in Methods and Practices in Psychological Science, 1 (2)*, 186—197.  doi: 10.1177/2515245918767122 + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/01/15/miller-the-statistical-fundamentals-of-non-replicability/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/01/15/miller-the-statistical-fundamentals-of-non-replicability/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/mueller-langer-et-al-replication-in-economics.md b/content/replication-hub/blog/mueller-langer-et-al-replication-in-economics.md new file mode 100644 index 00000000000..d99e1af98ee --- /dev/null +++ b/content/replication-hub/blog/mueller-langer-et-al-replication-in-economics.md @@ -0,0 +1,48 @@ +--- +title: "MUELLER-LANGER et al.: Replication in Economics" +date: 2018-10-19 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Data sharing" + - "economics" + - "Journal policies" + - "Open Science Collaboration" + - "replications" + - "Transparency" +draft: false +type: blog +--- + +###### *[This blog is based on the article “ Replication studies in economics—How many and which papers are chosen for replication, and why?” by Frank Mueller-Langer, Benedikt Fecher, Dietmar Harhoff, and Gert Wagner, published in the journal **[Research Policy](https://www.sciencedirect.com/science/article/pii/S0048733318301847?dgcid=rss_sd_all)**]* + +###### Academia is facing a quality challenge: The global scientific output doubles every nine years while the number of retractions and instances of misconduct is increasing. In this regard, replication studies can be seen as important post-publication quality checks in addition to the established pre-publication peer review process. It is for this reason that replicability is considered a hallmark of good scientific practice. In our recent research paper, we explore how often replication studies are published in empirical economics and what types of journal articles are replicated. + +###### We find that between 1974 and 2014, 0.1% of publications in the top 50 economics journals were replication studies. We provide empirical support for the hypotheses that higher-impact articles and articles by authors from leading institutions are more likely to be replicated, whereas the replication probability is lower for articles that appeared in top 5 economics journals. Our analysis also suggests that mandatory data disclosure policies may have a positive effect on the incidence of replication. The article can be found ***[here](https://www.sciencedirect.com/science/article/pii/S0048733318301847?dgcid=rss_sd_all)*** (published under a Creative Commons license). + +###### Scientific research plays an important role in the advancement of technologies and the fostering of economic growth. Hence, the production of thorough and reliable scientific results is crucial from a social welfare and science policy perspective. However, in times of increasing retractions and frequent instances of inadvertent errors, misconduct or scientific fraud, scientific quality assurance mechanisms are subject to a high level of scrutiny. + +###### Issues regarding the replicability of scientific research have been reported in multiple scientific fields, most notably in psychology. A ***[report by the Open Science Collaboration](http://science.sciencemag.org/content/349/6251/aac4716)*** from 2015 estimated the reproducibility of 100 studies in psychological science from three high-ranking psychology journals. Overall, only 36% of the replications yielded statistically significant effects compared to 97% of the original studies that had statistically significant results. + +###### However, similar issues have been reported from other fields. For example, ***[Camerer and colleagues attempted to replicate 18 studies](http://science.sciencemag.org/content/early/2016/03/02/science.aaf0918)*** published in two top economic journals—the *American Economic Review* and the *Quarterly Journal of Economics*—between 2011 and 2014 and were able to find a significant effect in the same direction as proposed by the original research in only 11 out of 18 replications (61%). + +###### Considering the impact that economic research has on society, for example in a field like evidence-based policy making, there is a particular need to explore and understand the drivers of replication studies in economics in order to design favorable boundary conditions for replication practice. + +###### We explore formal, i.e., published, replication studies in economics by examining which and how many published papers are selected for replication and what factors drive replication in these instances. To this extent, we use metadata about all articles published in the top 50 economics journals between 1974 and 2014. While there are also informal replication studies that are not published in scientific journals (especially replications conducted in teaching or published as working papers) and an increasing number of other forms of post-publication review (e.g., discussions on websites such as *PubPeer*), these are not covered with our approach. + +###### We find that between 1974 and 2014 0.1% of publications in the top 50 economics journals were replication studies. We find evidence that replication is a matter of impact: higher-impact articles and articles by authors from leading institutions are more likely to be replicated, whereas the replication probability is lower for articles that appeared in top 5 economics journals. Our analysis also suggests that mandatory data disclosure policies may have a positive effect on the incidence of replication. + +###### Based on our findings, we argue that replication efforts could be incentivized by reducing the cost of replication, for example by promoting data disclosure. Our results further suggest that the decision to conduct a replication study is partly driven by the replicator’s reputation considerations. Arguably, the low number of replication studies being conducted could potentially increase if replication studies received more formal recognition (for instance, through publication in [high-impact] journals), specific funding, (for instance, for the replication of articles with a high impact on public policy), or awards. Since replication is, at least partly, driven by reputational rewards, it may be a viable strategy to document and reward formal as well as informal replication practices. + +###### \* *DISCLAIMER*: *The views expressed are purely those of the authors and may not in any circumstances be regarded as stating an official position of the European Commission.* + +###### *Dr. Frank Müller-Langer is Affiliated Researcher at the Max Planck Institute for Innovation and Competition (MPI-IC) and Research Fellow at European Commission, Joint Research Centre, Directorate Growth & Innovation, Digital Economy Unit. Dr. Benedikt Fecher is Head of the “Learning, Knowledge, Innovation” research programme at the Alexander von Humboldt Institute for Internet and Society and co-editor of the blog journal Elephant in the Lab. Prof. Dietmar Harhoff, Ph.D., is Director at MPI-IC and Head of the MPI-IC Innovation and Entrepreneurship Research Group. Prof. Dr. Gert G. Wagner is Research Associate at the Alexander von Humboldt Institute for Internet and Society, Max Planck Fellow at the MPI for Human Development (Berlin) and Senior Research Fellow at the German Socio-Economic Panel Study (SOEP). Correspondence regarding this blog can be directed to Dr. Müller-Langer at [frank.mueller-langer@ip.mpg.de](mailto:frank.mueller-langer@ip.mpg.de).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/10/19/mueller-langer-et-al-replication-in-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/10/19/mueller-langer-et-al-replication-in-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/mueller-langer-fecher-harhoff-wagner-what-matters-for-replication.md b/content/replication-hub/blog/mueller-langer-fecher-harhoff-wagner-what-matters-for-replication.md new file mode 100644 index 00000000000..ee2c4785ee0 --- /dev/null +++ b/content/replication-hub/blog/mueller-langer-fecher-harhoff-wagner-what-matters-for-replication.md @@ -0,0 +1,57 @@ +--- +title: "MUELLER-LANGER, FECHER, HARHOFF & WAGNER: What Matters for Replication" +date: 2017-02-17 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "Impact" + - "Journal Data Policies" + - "replication" + - "reputation" +draft: false +type: blog +--- + +###### *NOTE: This entry is based on the paper, “**[The Economics of Replication](https://papers.ssrn.com/sol3/papers2.cfm?abstract_id=2908716)**”* + +###### Replication studies are considered a hallmark of good scientific practice (*1*). Yet they are treated among researchers as an ideal to be professed but not practised (*2*, *3*). For science policy makers, journal editors and external research funders to design favourable boundary conditions, it is therefore necessary to understand what drives replication. + +###### Using metadata from all articles published in the top-50 economics journals from 1974 to 2014, we investigated how often replication studies are published and which types of journal articles are replicated. + +###### We find that replication is a matter of impact: High-impact articles and articles by authors from leading institutions are more likely to be replicated. We could not find empirical evidence for the hypothesis that the lower cost of replication that is associated with the availability of data and code has a significant effect on the incidence of replication. + +###### We argue that researchers behave highly rationally in terms of the academic reputation economy, as they tend to replicate high-impact research from renowned researchers and institutions, possibly because in this case replications are more likely to be published (*4*). Our results are in line with previous assumptions that relate replication to impact (*3*, *5*–*7*). In this regard, private incentives are well aligned with societal interests, since high-impact publications are also the studies that are most likely to influence political and economic decisions as well as the public discourse. + +###### However, the question remains whether sufficient replications are conducted to guarantee the correctness of published findings. While we have no analytical result that would indicate which rate of replication is optimal for a scientific discipline, having less than 0.1% of articles among the top-50 journals in economics being replications strikes us as unreasonably low. In addition, there is no reason to believe that the share of published replication studies should be significantly higher among non-top-50 articles (*2*). We argue that the incidence of replication poses no threat to researchers. + +###### We also have to note that we cannot detect any statistically strong impact of data disclosure policies. Moreover, for 37% of the studies empirical articles subject to mandatory data-disclosure, the data or program code was not available although the data was not proprietary. This raises concern regarding the enforcement of mandatory data disclosure policies. + +###### Our results suggest that replication is—at least partly—driven by the replicator’s reputation considerations. Thus the low number of replication studies being conducted would possibly increase if replication received more formal recognition, e.g. through publication in (high-impact) journals or specific funding. The same holds true for the replicated author who should receive formal recognition if his results were successfully replicated. This could additionally motivate authors to ensure the replicability of published results. Moreover, considering the costs of replication, a stronger commitment of publishers for the replicability of research by establishing and enforcing data availability policies would lower the barrier for replicators. + +###### *Frank Mueller-Langer is Senior Research Fellow at the Max Planck Institute for Innovation and Competition and the Joint Research Center, Seville. Benedikt Fecher is a doctoral student at the German Institute of Economic Research and the Alexander von Humboldt Institute for Internet and Society. Dietmar Harhoff is Director at the Max Planck Institute for Innovation and Competition. Gert G. Wagner is Board Member of the German Institute for Economic Research and Max Planck Fellow at the MPI for Human Development in Berlin. Correspondence about this blog should be directed to Benedikt Fecher at* [*fecher@hiig.de*](mailto:fecher@hiig.de)*.* + +###### **References** + +###### (1)  B. R. Jasny, G. Chin, L. Chong, S. Vignieri, Again, and Again, and Again … *Science*. 334, 1225–1225 (2011). + +###### (2)  M. Duvendack, R. W. Palmer-Jones, W. R. Reed, Replications in Economics: A Progress Report. *Econ Journal Watch*. 12, 164–191 (2015). + +###### (3)  D. S. Hamermesh, Viewpoint: Replication in economics. *Canadian Journal of Economics*. 40, 715–733 (2007). + +###### (4)  B. Fecher, S. Friesike, M. Hebing, S. Linek, A. Sauermann, A Reputation Economy: Results from an Empirical Survey on Academic Data Sharing. *DIW Berlin Discussion Paper*. 1454 (2015) (available at ). + +###### (5)  D. Hamermesh, What is Replication? The Possibly Exemplary Example of Labor Economics (2017), (available at ). + +###### (6)  J. L. Furman, K. Jensen, F. Murray, Governing Knowledge in the Scientific Community: Exploring the Role of Retractions in Biomedicine. *Research Policy*. 41, 276–290 (2012). + +###### (7)  W. G. Dewald, J. G. Thursby, R. G. Anderson, Replication in Empirical Economics: The Journal of Money, Credit and Banking Project. *The American Economic Review*. 76, 587–603 (1986). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/02/17/mueller-langer-fecher-harhoff-wagner-what-matters-for-replication/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/02/17/mueller-langer-fecher-harhoff-wagner-what-matters-for-replication/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/murphy-quantifying-the-role-of-research-misconduct-in-the-failure-to-replicate.md b/content/replication-hub/blog/murphy-quantifying-the-role-of-research-misconduct-in-the-failure-to-replicate.md new file mode 100644 index 00000000000..f40379398cc --- /dev/null +++ b/content/replication-hub/blog/murphy-quantifying-the-role-of-research-misconduct-in-the-failure-to-replicate.md @@ -0,0 +1,72 @@ +--- +title: "MURPHY: Quantifying the Role of Research Misconduct in the Failure to Replicate" +date: 2018-01-04 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Cherry-Picking" + - "HARKing" + - "p-hacking" + - "Question Trolling" + - "replication" + - "Reproducibility crisis" +draft: false +type: blog +--- + +###### *[NOTE: This blog is based on the article “HARKing: How Badly Can Cherry-Picking and Question Trolling Produce Bias in Published Results?” by Kevin Murphy and Herman Aguinis, recently published in the Journal of Business and Psychology.]* + +###### The track record for replications in the social sciences is discouraging. There have been several recent papers documenting and commenting on the failure to replicate studies in economics and psychology (Chang & Li, 2015; Open Science Collaboration, 2015; Ortman, 2015; Pashler & Wagenmakers, 2012). This “reproducibility crisis” has simulated a number of excellent methodological papers documenting the many reasons for the failure to replicate (Braver, Thoemmes & Rosenthal, 2014; Maxwell, 2014). In general, this literature has shown that a combination of low levels of statistical power and a continuing reliance on null hypothesis testing have contributed substantially to the apparent failure of many studies to replicate, but there is a lingering suspicion that research misconduct plays a role in the widespread failure to replicate. + +###### Out-and-out fraud in research has been reported in a number of fields; Ben-Yehuda and Oliver-Lumerman (2017) have chronicled nearly 750 cases of research fraud between 1880 and 2010 involving fabrication and falsification of data, misrepresentation of research methods and results and plagiarism. Their work has helped to identify the roles of institutional factors in research fraud (e.g., a large percentage of the cases examined involved externally funded research at elite institutions) as well as identifying ways of detecting and responding to fraud. This type of fraud appears to represent only a small proportion of the studies that are published, and since many of the known frauds have been perpetrated by the same individuals, the proportion of genuinely fraudulent researchers may be even smaller. + +###### A more worrisome possibility is that researcher behaviors that fall short of outright fraud may nevertheless bias the outcomes of published research in ways that will make replication less likely. In particular, there is a good deal of evidence that a significant proportion of researchers engage in behaviors such as HARKing (posing “hypotheses” after the results of a study are known) or *p-hacking* (combing through or accumulating results until you find statistical significance) (Bedeian, Taylor & Miller, 2010; Head, Holman, Lanfear, Kahn & Jennions, 2015; John, Loewenstein & Prelec, 2012). These practices have the potential to bias results because they involve a systematic effort to find and report only the strongest results, which will of course make it less likely that subsequent studies in these same areas will replicate well. + +###### Although it is widely recognized that author misconduct, such as HARKing, can bias the results of published studied (and therefore make replication more difficult), it has proved surprisingly difficult to determine *how badly* HARKing actually influences research results. + +###### There are two reasons for this. First, HARKing might include a wide range of behaviors, from post-hoc analyses that are clearly labelled as such to unrestricted data mining in search for something significant to pubish, and different types of HARKing might have quite different effects. Second, authors usually do not disclose that the results they are submitting for publication are the result of HARKing, and there is rarely a definitive test for HARKing [O’Boyle, Banks & Gonzalez-Mulé (2017) were able to evaluate HARKing on an individual basis by comparing the hypotheses posed in dissertations with those reported in published articles based on the same work, and they suggested that in the majority of the cases they examined, there was considerably more alignment between results and hypotheses in published papers than in dissertations, presumably as a result of post-hoc editing of hypotheses]. + +###### In a recent paper Herman Aguinis and I published in *Journal of Business and Psychology* (***[see here](http://hermanaguinis.com/JBPharking.pdf)***), we suggested that simulation methods could be useful for assessing the likely impact of HARKing on the cumulative findings of a body of research. In particular, we used simulation methods to try and capture what it is authors actually *do* when they HARK. Our review of research on HARKing suggested that two particular types of behavior are both widespread and potentially worrisome. First, some authors decide on a research question, then scan results from several samples, statistical tests, or operationalizations of their key variables, selecting the strongest effects for publication. This type of *cherry picking* does not invent new hypotheses after the data have been collected, but rather samples the data that have been obtained to obtain the best case for a particular hypothesis. Other authors, scan results from different studies, samples, analyses etc. that involve some range of variables, and decide after looking at the data which relationships look strongest, then write up their research as if they had hypothesied this relationship all along. This form of *question trolling* is potentially more worrisome than cherry picking because these researchers allow the data to tell them what their research question should be rather than using the research question to determine what sort of data should be collected and examined. + +###### We wrote simulations that mimicked these two types of author behaviors to determine how much bias these behaviors might introduce. Because both cherry picking and question trolling represent choosing the strongest results for publication, they are both likely to introduce some biases (and the make the likelihood of subsequent replications lower). Our results suggest that cherry picking introduces relatively small biases, but because the effects reported in the behavioral and social sciences are often quite small (Bosco, Aguinis, Singh, Field & Pierce, 2015), cherry picking can create a substantially boost in the relative size of effect size estimates. Question trolling has the potential to create biases that are sizable in both an absolute and a relative sense. + +###### Our simulations suggest that the effects of HARKing a cumulative literature can be surprisingly complex. They depend on the prevalence of HARKing, the type of HARKing involved and the size and homogeneity of the pool of results the researcher consults before deciding what his or her “hypothesis” actually is. + +###### *Professor Kevin Murphy holds the Kemmy Chair of Work and Employment Studies at the University of Limerick. He can be contacted at Kevin.R.Murphy@ul.ie.* + +###### REFERENCES + +###### Bedeian, A. G., Taylor, S. G., & Miller, A. N. (2010). Management science on the credibility bubble: Cardinal sins and various misdemeanors. *Academy of Management Learning & Education, 9*, 715-725. + +###### Ben-Yehuda, N. & Oliver-Lumerman, A. (2017). *Fraud and Misconduct in Research: Detection, Investigation and Organizational Response*.  University of Michigan Press. + +###### Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. *Journal of Applied Psychology, 100*, 431–449. + +###### Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating meta-analysis and replicability. *Perspectives on Psychological Science, 9*, 333–342. doi:10.1177/1745691614529796 + +###### Chang, A. C., & Li, P. (2015). Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not”,” Finance and Economics Discussion Series 2015-083. + +###### Washington: Board of Governors of the Federal Reserve System, doi:10.17016/FEDS.2015.083 + +###### Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. & Jennions, M.D. (2015). The Extent and Consequences of P-Hacking in Science. *PLOS Biology*, + +###### John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. *Psychological Science*, 23, 524-532. + +###### Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. *Psychological Methods, 9*, 147–163. doi:10.1037/1082- 989X.9.2.147 + +###### O’Boyle, E. H., Banks, G. C., & Gonzalez-Mulé, E. (2017). The chrysalis effect: How ugly initial results metamorphosize into beautiful articles. *Journal of Management, 43*, NPi. 0149206314527133. + +###### Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. doi:10.1126/science.aac4716 + +###### Ortmann, A. (2015, November 2). The replication crisis has engulfed economics. Retrieved from [***http://theconversation.com/the-replication-crisis-has-engulfed-economics-49202***](http://theconversation.com/the-replication-crisis-has-engulfed-economics-49202) + +###### Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? *Perspectives on Psychological Science, 7,* 528–530. doi:10.1177/1745691612465253 + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/01/04/murphy-quantifying-the-role-of-research-misconduct-in-the-failure-to-replicate/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/01/04/murphy-quantifying-the-role-of-research-misconduct-in-the-failure-to-replicate/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/parasurama-why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance.md b/content/replication-hub/blog/parasurama-why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance.md new file mode 100644 index 00000000000..00e36215fd7 --- /dev/null +++ b/content/replication-hub/blog/parasurama-why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance.md @@ -0,0 +1,63 @@ +--- +title: "PARASURAMA: Why Overlapping Confidence Intervals Mean Nothing About Statistical Significance" +date: 2017-11-11 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "null hypothesis significance testing" + - "quadrature" + - "Science" + - "Statistical practice" + - "Testing for differences in estimates" +draft: false +type: blog +--- + +###### *[NOTE: This is a repost of a blog that Prasanna Parasurama published at the blogsite Towards Data Science].* + +###### Prasanna1 + +###### ***“The confidence intervals of the two groups overlap, hence the difference is not statistically significant”*** + +###### The statement above is wrong. Overlapping confidence intervals/error bars say nothing about statistical significance. Yet, a lot of people make the mistake of inferring lack of statistical significance. Likely because the inverse — non-overlapping confidence intervals — means statistical significance. I’ve made this mistake. I think part of the reason it is so pervasive is that it is often not explained why you cannot compare overlapping confidence intervals. I’ll take a stab at explaining this in this post in an intuitive way. HINT: It has to do with how we keep track of error. + +###### **The Setup** + +###### – We have 2 groups:  Group Blue and Group Green. + +###### – We are trying to see if there is a difference in age between these two groups. + +###### – We sample the groups to find the mean μ, and standard deviation σ (aka error) and build a distribution:Prasanna2– Group Blue’s average age is 9 years with an error of 2.5 years. Group Green’s average age is 17, also with an error of 2.5 years. + +###### – The shaded regions show the 95% confidence intervals (CI). + +###### From this setup, many will erroneously infer that there is no statistical significant difference between groups, which may or may not be correct. + +###### **The Correct Setup** + +###### – Instead of building a distribution for each group, we build one distribution for the difference in mean age between groups. + +###### – If the 95% CI of the difference contains 0, then there is no difference in age between groups. If it doesn’t contain 0, then there is a statistically significant difference between groups.Prasanna3As it turns out the difference is statistically significant, since the 95% CI (shaded region) doesn’t contain 0. + +###### **Why?** + +###### In the first setup we draw the distributions, then find the difference. In the second setup, we find the difference, then draw the distribution. Both setups seem so similar, that it seems counter-intuitive that we get completely different outcomes. The root cause of the difference lies in error propagation — fancy way of saying how we keep track of error. + +###### **Error Propagation** + +###### Imagine you are trying to measure the area A of a rectangle with sides L, W. You measure the sides with a ruler and you estimate that there is an error of 0.1 associated with measuring a side.Prasanna4To estimate the error of the area, intuitively you’d think it is 0.1 + 0.1 = 0.2, because errors add up. It is almost correct; errors add, but they add in quadrature (squaring then taking the square root of the sum). That is, imagine these errors as 2 orthogonal vectors in space. The resulting error is the magnitude of sum of these vectors. + +###### **Circling Back** + +###### The reasons we get different results from the 2 setups is how we propagate the error for difference in age.Prasanna5In the first setup, we simply added the errors of each group. In the second setup, we added the errors in quadrature. Since adding in quadrature will yield a smaller value than adding normally, we overestimated the error in the first setup, and incorrectly inferred no statistical significance. + +###### *Prasanna Parasurama is a data scientist at Atipica. He can be contacted at prasanna@atipicainc.com.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/11/11/parasurama-why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/11/11/parasurama-why-overlapping-confidence-intervals-mean-nothing-about-statistical-significance/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/pfeiffer-taking-replication-markets-to-a-whole-new-level.md b/content/replication-hub/blog/pfeiffer-taking-replication-markets-to-a-whole-new-level.md new file mode 100644 index 00000000000..c22b320a41e --- /dev/null +++ b/content/replication-hub/blog/pfeiffer-taking-replication-markets-to-a-whole-new-level.md @@ -0,0 +1,70 @@ +--- +title: "PFEIFFER: Taking Replication Markets To a Whole New Level" +date: 2019-09-23 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "3000 studies" + - "DARPA" + - "Prediction Markets" + - "replication" + - "Replication Markets" + - "SCORE" +draft: false +type: blog +--- + +###### Replication markets are prediction markets run in conjunction with systematic replication projects. We conducted such markets for the Replication Project: Psychology (RPP), Experimental Economics Replication Project (EERP), Social Science Replication Project (SSRP) and the Many Labs 2 Project (ML2). The participants in these markets trade ‘bets’ on the outcome of replications. Through the pricing of these bets they generate and negotiate quantitative forecasts for the replication results. + +###### This post has three objectives: 1 – Advertise a new replication market project; 2 – Explain why it is useful to run prediction markets on replications; 3 – Discuss caveats with relying on binary interpretations of replication results. + +###### **Advertisement** + +###### As part of DARPA SCORE, we are currently recruiting participants for a new replication market project. As in our past projects, there is a bunch of studies that are going to be replicated, and we’d love to know how well our participants can forecast the outcome of these replications. There are some important differences to our past markets. + +###### The first major difference is scale: in past replication markets, the number of studies we elicited forecasts for was in the order of 20. For SCORE, our forecasted studies will total 3,000. + +###### Of course, nobody is going to replicate 3,000 studies. Rather, about 100 are selected for replication. We are not informed which ones. We designed our markets to generate forecasts for all 3,000 studies, but only bets for those 100 forecasts that are validated will be paid out. + +###### Second, given the scale, we do the forecasting in monthly rounds over about one year. We will have 10 rounds, each on 300 studies, and 2 types of incentives for our participants. + +###### Each round, prizes are distributed for survey responses to those who are estimated (using a peer assessment method) to be most accurate. In addition, bets in the market are paid out once the 100 replications have been conducted, and the replication outcomes are released. The total prize pool is USD 100,000+. Currently, we have more than 600 active participants. If interested, have a look / sign up at ***[predict.replicationmarkets.com](https://predict.replicationmarkets.com/main/#!/users/register?referral_id=TPblog)***! + +###### **Why are we doing this?** + +###### We started with prediction markets to see if researchers have an idea about which findings are replicable, and which ones are not; and if prediction markets can aggregate this information into accurate forecasts. + +###### We acknowledge that “an idea about which findings are reliable” is incredibly vague. A more accurate description could involve Bayesian subjective priors on the replicated hypotheses, beliefs on distributions of effect sizes, and considerations about the appropriateness of the instrumentalization. But we’re not there yet. + +###### Our results are encouraging: in our past replication markets, the forecasted probabilities fit very well to the observed outcomes. In these projects, we observed the outcomes for (nearly) all forecasted replications. The value of forecasting replications may therefore not be so much the forecast itself, but the proof-of-principle that the outcome of replications can be forecasted. + +###### In the new SCORE project, this will be different. Rather than just providing a proof-of-principle, there are approximately 2,900 studies that are not selected for replication. For those studies, our forecasts will provide valuable pieces of information on the studies’ credibility. + +###### **Binary interpretations** + +###### In our replication markets, we typically use a binary criterium to settle the bets: whether the replication result is statistically significant and the effect is in the same direction as in the original study. + +###### The use of such dichotomies has been criticized – have a look at the “dichotomania” thread in Andrew Gelman’s blog – not only for prediction markets, but also for the summaries of large-scale replications (“X out of Y studies replicated”) and for the interpretation of research findings in general (p<0.05 = evidence; p>0.05 = lack of evidence). + +###### Dichotomies are simplifications, and as such entail a loss of information. Publications on large-scale replication projects therefore offer a wealth of additional information, starting from additional binary criteria down to every single replication effect size and how it relates to the original effect. + +###### For prediction markets, the reason for using the above binary criterium is that elicitation for continuous outcomes appears to be much harder than for binary outcomes. We would love to elicit and aggregate the beliefs of our forecasters in all detail and richness, but we haven’t yet figured how to do this best. + +###### So far, the results we are getting for binary forecasts tend to be more reliable, and therefore we stick to this approach. Among the various options for binary outcomes, we believe that same direction + statistical significance is the best criterium to elicit forecasts for, because this is the most common way of how researchers judge replications. + +###### Prediction markets might provide a stepping stone out of “dichotomania” in that they encourage dealing with uncertainty in a quantitative way. Rather than providing us with a simple Yes/No, our forecasters use probabilities to quantify and negotiate uncertainty in replication outcomes. + +###### Obviously, there is still way to go – we are just beginning to explore how best to differentiate between, e.g. a forecaster who believes that an effect exists, but is too small to be detected in a replication, and a forecaster who doubts the effect’s existence or the validity of the instrumentalization. + +###### In the meantime, we believe our past prediction markets and the upcoming SCORE project are important as they show that the scientific community has valuable information on the credibility of claims made in scientific publications. + +###### *Thomas Pfeiffer is Professor in Computational Biology/Biochemistry at Massey University, New Zealand; and a member of the Professoriate at the New Zealand Institute for Advanced Study. He can be contacted at [pfeiffer.massey@gmail.com](mailto:pfeiffer.massey@gmail.com).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/09/23/taking-replication-markets-to-a-whole-new-level/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/09/23/taking-replication-markets-to-a-whole-new-level/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/r-seler-replication-research-symposium-and-journal.md b/content/replication-hub/blog/r-seler-replication-research-symposium-and-journal.md new file mode 100644 index 00000000000..048a9ba8c87 --- /dev/null +++ b/content/replication-hub/blog/r-seler-replication-research-symposium-and-journal.md @@ -0,0 +1,40 @@ +--- +title: "RÖSELER: Replication Research Symposium and Journal" +date: 2025-02-08 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Annotator" + - "Educational materials" + - "Explorer" + - "FORRT" + - "Framework for Open and Reproducible Research Training" + - "Journal policies" + - "Replication Research journal" + - "Replication Research Symposium" +draft: false +type: blog +--- + +Efforts to teach, collect, curate, and guide replication research are culminating in the new diamond open access journal *Replication Research,* which will launch in late 2025. The Framework for Open and Reproducible Research Training (FORRT; [***forrt.org***](http://forrt.org)) and the Münster Center for Open Science have spearheaded several initiatives to bolster replication research across various disciplines. From May 14-16, 2025, we are excited to invite researchers to join us in Münster, as well as online, for the ***Replication Research Symposium***. This event will mark a significant step toward the launch of our interdisciplinary journal dedicated to reproductions, replications, and discussions on the methodologies involved. But let’s start from the beginning: What is going on at FORRT? + +**Finding and exploring replications**: FORRT Replication Database (FReD) includes hundreds of replication studies and thousands of replication findings – which we define as tests of previously established claims using different data. Researchers can use the [***Annotator***](https://forrt.org/apps/fred_annotator.html) to have their reference lists auto-checked to see whether they cited original studies that have been replicated. With the [***Explorer***](https://forrt.org/apps/fred_explorer.html), they get an overview of all studies and can analyze replication rates across different success criteria or moderator variables. + +**Meta-analyzing replication outcomes**: To increase the accessibility of the database, we created the FReD [***R-package***](https://forrt.org/FReD/index.html) with which researchers can run their own analyses or run the ShinyApps locally. In a vignette, we outline different [***replication success criteria***](https://forrt.org/FReD/articles/success_criteria.html) and show how this choice can affect the overall replication success rate. + +**Teaching replications**: One of FORRT’s core ideas is to support researchers from all fields to learn about openness and reproducibility. Among numerous projects, we clarified terminology ([***Glossary of Open Science Terms***](https://forrt.org/glossary/)), produced educational materials such as an [***educationally-driven review paper***](https://www.nature.com/articles/s44271-023-00003-2) on the transformative impact of the replication crisis, [***syllabus and slides***](https://forrt.org/positive-changes-replication-crisis/) with lecture and pedagogical notes (see Educational toolkit), and [***curated resources***](https://forrt.org/resources/). We are also now working together with experts from economics, psychology, medicine, and other fields to create an interdisciplinary guide to carrying out replications and reproductions. + +**Publishing replication studies and discussing standards across fields**: We are currently developing the journal [***Replication Research***](https://lukasroeseler.github.io/replicationresearch_mockup/), a diamond open-access outlet for replication and reproduction studies and discussions about the respective methods. There will be reproducibility checks for all published studies and standardized machine-readable templates that authors are encouraged to use. We are currently building the journal with a network of 20 experts from different fields. From February until April, we are organizing the *Road to Replication Research* via Zoom. This online discussion series is centered around different aspects of open and responsible scientific publishing and is open to anybody who wants to join the conversation, so that the journal is maximally open from the start. Finally, at the *Replication Research Symposium*, participants and experts from diverse fields such as psychology, economics, biology, medicine, marketing, meta-science, library science, humanities, and others will convene to discuss the significance and methodology of conducting replication and reproduction studies over three days in May 2025. This symposium will further shape *Replication Research* and [***we invite researchers from all fields to present their replications, reproductions, or methodological discussions***](https://indico.uni-muenster.de/event/3176/abstracts/). The journal launch is then slated for late 2025. + +For more information about *Replication Research,* the upcoming symposium, and the online discussion series about the creation of the journal [***click here***](https://lukasroeseler.github.io/replicationresearch_mockup/). + +*Lukas Röseler is the managing director of the Münster Center for Open Science at the University of Münster, one of the project leads at FORRT’s Replication Hub, and will be the managing editor of Replication Research. He can be contacted at lukas.roeseler@uni-muenster.de.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/02/08/roseler-replication-research-symposium-and-journal/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/02/08/roseler-replication-research-symposium-and-journal/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/randall-welser-on-the-irreproducibility-crisis-of-modern-science.md b/content/replication-hub/blog/randall-welser-on-the-irreproducibility-crisis-of-modern-science.md new file mode 100644 index 00000000000..498170af9d0 --- /dev/null +++ b/content/replication-hub/blog/randall-welser-on-the-irreproducibility-crisis-of-modern-science.md @@ -0,0 +1,59 @@ +--- +title: "RANDALL & WELSER: On the Irreproducibility Crisis of Modern Science" +date: 2018-04-20 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "governmental policy" + - "National Association of Scholars" + - "Policy recommendations" + - "Reproducibility crisis" + - "scientific credibility" +draft: false +type: blog +--- + +###### *[This post is based on the report,  “The Irreproducibility Crisis of Modern Science: Causes, Consequences and the Road to Reform”, recently published by the **[National Association of Scholars](https://www.nas.org/)**]* + +###### For more than a decade, and especially since the publication of a famous 2005 article by John Ioannidis, scientists in various fields have been concerned with the problems posed by the replication crisis. The importance of the crisis demands that it be understood by a larger audience of educators, policymakers, and ordinary citizens. To this end, our new report, ***[The Irreproducibility Crisis of Modern Science](https://www.nas.org/replication-network-blog/documents/irreproducibility_report/NAS_irreproducibilityReport.pdf)***, outlines the nature, causes, and significance of the crisis, and offers a series of proposals for confronting it. + +###### At its most basic level, the crisis arises from the widespread use of statistical methods that inevitably produce some false positives. Misuse of these methods easily increases the number of false positives, leading to the publication of many spurious findings of statistical significance. “P-hacking” (running repeated statistical tests until a finding of significance emerges) is probably the most common abuse of statistical methods, but inadequate specification of hypotheses and the tendentious construction of datasets are also serious problems. (Gelman and Loken 2014 provide several good examples of how easily these latter faults can vitiate research findings.) + +###### Methodological errors and abuses are enabled by too much researcher freedom and too little openness about data and procedures. Researchers’ unlimited freedom in specifying their research designs—and especially their freedom to change their research plans in mid-course—makes it possible to conjure statistical significance even for obviously nonsensical hypotheses (Simmons, Nelson, and Simonsohn 2011 provide a classic demonstration of this). At the same time, lack of outside access to researchers’ data and procedures prevents other experts from identifying problems in experimental design. + +###### Other factors in the irreproducibility crisis exist at the institutional level. Academia and the media create powerful incentives for researchers to advance their careers by publishing new and exciting positive results, while inevitable professional and political tendencies toward groupthink prevent challenges to an existing consensus. + +###### The consequences of all these problems are serious. Not only is a lot of money being wasted—in the United States, up to $28 billion annually on irreproducible preclinical research alone (Freedman et al. 2015)—but individuals and policymakers end up making bad decisions on the basis of faulty science. Perhaps the worst casualty is public confidence in science, as people awaken to how many of the findings they hear about in the news can’t actually be trusted. + +###### Fixing the replication crisis will require energetic efforts to address its causes at every level. Many scientists have already taken up the challenge, and institutions like the ***[Center for Open Science](https://cos.io/)*** and the ***[Meta-Research Innovation Center at Stanford (METRICS)](https://metrics.stanford.edu/)***, both in the U.S., have been established to improve the reproducibility of research. Some academic journals have changed the ways in which they ask researchers to present their results, and other journals, such as the ***[International Journal for Re-Views in Empirical Economics](https://www.iree.eu/)***, have been created specifically to push back against publication bias by publishing negative results and replication studies. National and international organizations, including the World Health Organization, have begun delineating more stringent research standards. + +###### But much more remains to be done. In an effort to spark an urgently needed public conversation on how to solve the reproducibility crisis, our report ***[offers a series of forty recommendations](https://www.nas.org/replication-network-blog/documents/irreproducibility_report/NAS_irreproducibilityReport_executiveSummary.pdf)***. At the level of statistics, researchers should cease to regard p-values as dispositive measures of evidence for or against a particular hypothesis, and should try to present their data in ways that avoid a simple either/or determination of statistical significance. Researchers should also pre-register their research procedures and make their methods and data publicly available upon publication of their results. There should also be more experimentation with “born-open” data—data archived in an open-access repository at the moment of its creation, and automatically time-stamped. + +###### Given the importance of statistics in modern science, we need better education at all levels to ensure that everyone—future researchers, journalists, legal professionals, policymakers and ordinary citizens—is well-acquainted with the fundamentals of statistical thinking, including the limits to the certainty that statistical methods can provide. Courses in probability and statistics should be part of all secondary school and university curricula, and graduate programs in disciplines that rely heavily on statistics should take care to emphasize the ways in which researchers can misunderstand and misuse statistical concepts and techniques. + +###### Professional incentives have to change too. Universities judging applications for tenure and promotion should look beyond the number of scholars’ publications, giving due weight to the value of replication studies and expecting adherence to strict standards of reproducibility. Journals should make their peer review processes more transparent, and should experiment with guaranteeing publication for research with pre-registered, peer-reviewed hypotheses and procedures. To combat groupthink, scientific disciplines should ask committees of extradisciplinary professionals to evaluate the openness of their fields. + +###### Private philanthropy, government, and scientific industry should encourage all these efforts through appropriate funding and moral support. Governments also need to consider their role as consumers of science. Many government policies are now made on the basis of scientific findings, and the replication crisis means that those findings demand more careful scrutiny. Governments should take steps to ensure that new regulations which require scientific justification rely solely on research that meets strict standards for reproducibility and openness. They should also review existing regulations and policies to determine which ones may be based on spurious findings. + +###### Solving the replication crisis will require a concerted effort from all sectors of society. But this challenge also represents a great opportunity. As we fight to eliminate opportunities and incentives for bad science, we will be rededicating ourselves to good science and cultivating a deeper public awareness of what good science means. Our report is meant as a step in that direction. + +###### *David Randall is Director of Research at the National Association of Scholars (NAS).  Christopher Welser is an NAS Research Associate.* + +###### **References** + +###### Freedman, Leonard P., Iain M. Cockburn, and Timothy S. Simcoe (2015), “The Economics of Reproducibility in Preclinical Research.” *PLoS Biology*, 13(6), e1002165. doi:10.1371/journal.pbio.1002165 + +###### Gelman, Andrew and Eric Loken (2014), “The Statistical Crisis in Science.” *American Scientist*, 102(6), 460–465. + +###### Ioannidis, John P. A. (2005), “Why Most Published Research Findings Are False.” *PLoS Medicine*, 2(8), doi:10.1371/journal.pmed.0020124. + +###### Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” *Psychological Science*, 22(11), 1359–1366. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/04/20/randall-welser-on-the-irreproducibility-crisis-of-modern-science/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/04/20/randall-welser-on-the-irreproducibility-crisis-of-modern-science/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-an-open-invitation-to-give-your-perspective-on-the-practice-of-replication.md b/content/replication-hub/blog/reed-an-open-invitation-to-give-your-perspective-on-the-practice-of-replication.md new file mode 100644 index 00000000000..aee6a153869 --- /dev/null +++ b/content/replication-hub/blog/reed-an-open-invitation-to-give-your-perspective-on-the-practice-of-replication.md @@ -0,0 +1,59 @@ +--- +title: "REED: An Open Invitation to Give Your Perspective on “The Practice of Replication”" +date: 2017-11-21 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Economics E-Journal" + - "replications" + - "W. Robert Reed" +draft: false +type: blog +--- + +###### In September of this year, the journal *Economics: The Open Access, Open Assessment E-Journal* published a series of Discussion Papers for a special issue on “The Practice of Replication”. The motivation behind the special issue came from the following two facts: First, there has been increasing interest in replications in economics. Second, there is still no standard for how to do a replication, nor for determining whether a replication study “confirms” or “disconfirms” an original study. + +###### Contributors to the special issue were each asked to select an influential economics article that had not previously been replicated. They were to discuss how they would go about “replicating” their chosen article, and what criteria they would use to determine if the replication study “confirmed” or “disconfirmed” the original study. They were not to do an actual replication, but rather present a replication plan. + +###### Papers were to consist of four parts: (i) a general discussion of principles about how one should do a replication, (ii) an explanation of why the “candidate” paper was selected for replication, (iii) a replication plan that applies these principles to the “candidate” article, and (iv) a discussion of how to interpret the results of the replication (e.g., how does one know when the replication study successfully “replicates” the original study). The contributions to the special issue were intended to be short papers, approximately *Economics Letters*-length (though there would not be a length limit placed on the papers). + +###### A total of ten papers were submitted to the special issue and have now been published online as Discussion Papers: nine replication plans and a general thought piece on how to do a replication. They are, respectively: + +###### – Richard G. Anderson, [***“Should you choose to do so… A replication paradigm”***](http://www.economics-ejournal.org/economics/discussionpapers/2017-79) + +###### – B. D. McCullough, [***“Quis custodiet ipsos custodes?: Despite evidence to the contrary, the American Economic Review concluded that all was well with its archive”***](http://www.economics-ejournal.org/economics/discussionpapers/2017-78) + +###### – Tom Coupé, [***“Replicating ‘Predicting the present with Google trends’ by Hyunyoung Choi and Hal Varian (The Economic Record, 2012)***”](http://www.economics-ejournal.org/economics/discussionpapers/2017-76) + +###### – Raymond Hubbard, ***[“A proposal for replicating Evanschitzky, Baumgarth, Hubbard, and Armstrong’s ‘Replication research’s disturbing trend’ (Journal of Business Research, 2007)”](http://www.economics-ejournal.org/economics/discussionpapers/2017-75)*** + +###### – Andrew C. Chang, ***[“A replication recipe: list your ingredients before you start cooking”](http://www.economics-ejournal.org/economics/discussionpapers/2017-74)*** + +###### – Dorian Owen, ***[“Replication to assess statistical adequacy”](http://www.economics-ejournal.org/economics/discussionpapers/2017-73)*** + +###### – Randall J. Hannum, ***[“A replication plan for ‘Does social media reduce corruption?’ (Information Economics and Policy, 2017)”](http://www.economics-ejournal.org/economics/discussionpapers/2017-72)*** + +###### – Benjamin Douglas Kuflick Wood and Maria Vasquez, [***“Microplots and food security: encouraging replication studies of policy relevant research”***](http://www.economics-ejournal.org/economics/discussionpapers/2017-71) + +###### – Gerald Eric Daniels Jr. and Venoo Kakar, ***[“Normalized CES supply-side system approach: how to replicate Klump, McAdam, and Willman (Review of Economics and Statistics, 2007)”](http://www.economics-ejournal.org/economics/discussionpapers/2017-70)*** + +###### – Annette N. Brown and Benjamin Douglas Kuflick Wood, ***[“Which tests not witch hunts: a diagnostic approach for conducting replication research”](http://www.economics-ejournal.org/economics/discussionpapers/2017-77)*** + +###### The papers have been sent out for review and the reviews, along with the authors’ responses, are now beginning to appear online with the papers (the journal is open assessment). Before a decision is made on whether to publish the papers as articles, the journal would like to receive further comments from researchers interested in replications. + +###### Online contributors can comment narrowly on whether a paper successfully carried out its fourfold task (see above). But they can also comment more generally on how they think a replication should be done, and/or how one should interpret the results from a replication. + +###### The ultimate goal is to combine the different papers, the reviewers’ assessments, and the comments from online contributors to develop a set of guidelines for doing and interpreting replications. It is hoped that this “crowd-sourced” approach will bring a large range of perspectives to “the practice of replications.” + +###### To contribute a comment on one or more of the papers, click on the papers’ links above and add a comment, which may require that you first register with the journal. Alternatively, you can email your comment to the special issue’s editor, W. Robert Reed, at [bob.reed@canterbury.ac.nz](mailto:bob.reed@canterbury.ac.nz). The deadline to submit comments is January 10, 2017. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand and co-organizer of The Replication Network. He can be contacted at the email listed above.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/11/21/reed-an-open-invitation-to-give-your-perspective-on-the-practice-of-replication/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/11/21/reed-an-open-invitation-to-give-your-perspective-on-the-practice-of-replication/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-an-update-on-the-progress-of-replications-in-economics.md b/content/replication-hub/blog/reed-an-update-on-the-progress-of-replications-in-economics.md new file mode 100644 index 00000000000..107a0fcfa37 --- /dev/null +++ b/content/replication-hub/blog/reed-an-update-on-the-progress-of-replications-in-economics.md @@ -0,0 +1,87 @@ +--- +title: "REED: An Update on the Progress of Replications in Economics" +date: 2018-10-31 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Duvendack et al. (2015)" + - "economics" + - "Journal policies" + - "replication" + - "Reproducibility" +draft: false +type: blog +--- + +###### *[This post is based on a presentation by Bob Reed at the **[Workshop on Reproducibility and Integrity in Scientific Research](https://replicationnetwork.com/2018/09/21/all-invited-workshop-on-reproducibility-and-integrity-in-scientific-research/)**, held at the University of Canterbury, New Zealand, on October 26, 2018]* + +###### In 2015, Duvendack, Palmer-Jones, and Reed (DPJ&R) published a paper entitled ***[“Replications in Economics: A Progress Report”](https://econjwatch.org/articles/replications-in-economics-a-progress-report)***. In that paper, the authors gave a snapshot of the use of replications in economics. + +###### A little over three and a half years have passed since the research for that paper was completed. During that time, there has been much talk about the so-called “replication crisis”, including featured articles in the ***[2017 Papers and Proceedings issue of the American Economic Review](https://www.aeaweb.org/issues/465)***. That issue spotlighted 8 articles addressing various aspects of replications in economics. Which raises the question, has anything changed since DPJ&R published their article? + +###### In this blog, I update DPJ&R’s research. I focus on four measures of the use of replications in economics: + +###### – Total number of replications published in economics journals + +###### – Which journals say they will publish replications + +###### – Which journals actually publish replications + +###### – Which journals require authors of empirical papers to supply their data and code. And, of those, which journals actually do it. + +###### **Total Number of Replications Published in Economics Journals** + +###### DPJ&R defined a replication as any study published in a peer-reviewed journal whose main purpose was to verify a previously published study. Based upon that definition, the figure below reports the total number of replications published in economics journals over time. The solid, vertical, black line delineates the time period included in DPJ&R’s study. + +###### Total replications + +###### At the time DPJ&R wrote their article, it looked like replications were “taking off”, with the publication of replications increasing exponentially. In the three and half years since, that impression needs to be moderated. While the publication of replications have definitely increased since the early 2000s, it would appear that the rate of publication has leveled off. Increasing talk about replications in economics has not been matched by a corresponding increase in the number of published replications. + +###### **Journals That Say They Publish Replications** + +###### In order to gauge the receptivity of journals to publish replications, DPF&R went to the websites of all the journals listed by Web of Science as “Economics” journals. At the time of their study, there were 333 such journals. The websites of 10 of these explicitly mentioned that they published replications. These are listed on the left hand side of the table below. + +###### Websites + +###### In August of this year, I and a team of students rechecked the websites of “Economics” journals listed by Web of Science – now totalling 360 journals. A total of 14 journals now explicitly state on their websites that they publish replications. Interestingly, the net gain of 4 journals is the result of the addition of 6 journals that have newly stated they will publish replications, minus two journals that have removed mention of publishing replications: *Economic Development and Cultural Change* and the *International Journal of Forecasting* no longer explicitly state a policy of publishing replications. + +###### **Journals That Publish Replications** + +###### There are many journals that do not explicitly state they publish replications, but for which revealed preference shows they do. The left hand side of the table below reports published replications by journal for the time period covered by DPF&R. Note that the total of 206 published replications exceeds the number reported by DPJ&R. This is because I updated their list of replication studies and found additional studies that they missed. Through 2014, I identify a total of 21 journals that published more than 1 replication over their history. 5 journals were responsible for publishing approximately half of all replication studies, with the *Journal of Applied Econometrics* and the *American Economic Review* leading the pack. + +###### Published Replications + +###### Approximately three and a half years later, 26 journals have published more than 1 replication study. 5 journals still account for half of all published replication studies, with the same two journals leading the list. Notably, over half of the 71 replication studies published since 2014 have appeared in three journals: the *Journal of Applied Econometrics, Econ Journal Watch,*and *Public Finance Review.* The increase in *Public Finance Review’s* replications can be attributed to the fact that they introduced a dedicated replication section. + +###### **Journals That Require Authors to Provide Data and Code** + +###### In their article, DPJ&R surveyed the list of Web of Science “Economics” journals that “regularly” provided data and code for their empirical articles. “Regularly” was defined as providing data and code for at least half of the journal’s empirical articles in recent issues. They reported 27 articles met this criterion. + +###### We updated the list of 360 Web of Science “Economics” journals and repeated the analysis. Only this time, we also kept track of which journals required authors to supply their data and code. A total of 41 journals explicitly stated that authors of empirical papers were required to provide data and code sufficient to replicate the results in the paper. Journals that said they “encouraged” authors to provide data and code, or that required authors to make their data and code available “upon request”, were not included in this list. + +###### We then went through 15 of each journal’s most recently published empirical articles. If at least half (8) had both data and code, we classified it as satisfying the journal’s requirement. 23 of the 41 (56%) met this criterion. These are listed in the left hand side of the table below. + +![Data and Code](/replication-network-blog/data-and-code.webp) + +###### If less than half of the articles were accompanied by data and code, or if only data and not code were provided, or code and not data, we judged it to have not satisfied the criteria. In this manner, 18 of the 41 journals (44%) were determined to not be in compliance with their stated data and code policy. These are listed on the right hand side of the table. + +###### Please note, however, an important caveat: most journals make an exception for confidential data. These could be data that were provided with strict confidentiality requirements, or subscription data where the vendor does not want to make the data public. We had no way of knowing if this was the reason an article did not provide data and code. Thus, some of the journals on the right hand side of the table may be in compliance with their data and code policy once one accounts for confidentiality restrictions. + +###### **Conclusion** + +###### Putting the above together, what can one make of the current status of replications in economics? Based upon what one sees in the journals, this is a case of water in the glass. For those who are optimists, the glass may be seen as half full. More journals are publishing more replication studies now than they were a decade ago. More journals are announcing that they publish replications. And more journals (one more!) are posting data and code along with their empirical articles. Replications and reproducibility are inching their way forward in the economics discipline. + +###### However, for those who think there is a replication crisis in economics, the glass is half empty, and arguably not even that. The situation is a far cry from what is happening in some other social sciences, particularly psychology. There, articles like ***[“Making Replication Mainstream”](https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/making-replication-mainstream/2E3D8805BF34927A76B963C7BBE36AC7)*** speak to a major culture change that seems to have been embraced by many, if not most, editors of leading journals in that discipline. + +###### Some would argue that the reason psychology has been so willing to embrace replication is because that discipline has been more prone to questionable research practices. While that may be the case, the fact is, nobody really knows. There is only way to find out just how good, or bad, things are in economics. And that is to do more replications. Based on the above, it will be sometime yet before enough replications are done so that we can have a better idea of the status of reproducibility in the economics discipline. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at* [*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/10/31/reed-an-update-on-the-progress-of-replications-in-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/10/31/reed-an-update-on-the-progress-of-replications-in-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-calculating-power-after-estimation-everybody-should-do-this.md b/content/replication-hub/blog/reed-calculating-power-after-estimation-everybody-should-do-this.md new file mode 100644 index 00000000000..44f4e128062 --- /dev/null +++ b/content/replication-hub/blog/reed-calculating-power-after-estimation-everybody-should-do-this.md @@ -0,0 +1,199 @@ +--- +title: "REED: Calculating Power After Estimation – Everybody Should Do This!" +date: 2024-07-29 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "a priori power" + - "post hoc power analysis" + - "power curves" + - "program code" + - "R" +draft: false +type: blog +--- + +So your estimate is statistically insignificant and you’re wondering: Is it because the effect size is small, or does my study have too little power? In ***[Tian et al. (2024)](https://onlinelibrary.wiley.com/doi/10.1111/rode.13130)***, we propose a simple method for calculating statistical power after estimation (“post hoc power analysis”). While our application is targeted towards empirical studies in development economics, the method has many uses and is widely applicable across disciplines. + +It is common to calculate statistical power before estimation (“a priori power analysis”). This allows researchers to determine the minimum sample size to achieve a given level of power for a given effect size. In contrast, post hoc power analysis is rarely done, and often discouraged (for example, see ***[here](http://daniellakens.blogspot.com/2014/12/observed-power-and-what-to-do-if-your.html)***). + +There is a reason for this. The most common method for calculating post hoc power is flawed. In our paper, we explain the problem and demonstrate that our method successfully avoids this issue. + +Before discussing our method, it is useful to review some of the reasons why one would want to calculate statistical power after estimation is completed. + +**When estimates are insignificant**. All too commonly, statistical insignificance is taken as an indication that the true effect size is small, indistinguishable from zero. That may be the case. But it also may be the case that the statistical power of the study was insufficient to generate significance even if the true effect size were substantial. Knowing the statistical power of the regression equation that produced the estimated effect can help disentangle these two cases. + +**When estimates are significant.** Post hoc power analysis can also be useful when estimates are statistically significant. In a recent paper, ***[Ioannidis et al. (2017)](http://The Power of Bias in Economics Research)*** analyzed “159 empirical economics literatures that draw upon 64,076 estimates of economic parameters reported in more than 6,700 empirical studies”. They calculated that the “median statistical power is 18%, or less.” Yet the great majority of these estimates were statistically significant. How can that be? + +One explanation is Type M error. As elaborated in ***[Gelman and Carlin (2014)](https://journals.sagepub.com/doi/full/10.1177/1745691614551642)***, Type M error is a phenomenon associated with random sampling . Estimated effects that are statistically significant will tend to be systematically larger than the population effect. If journals filter out insignificant estimates, then the estimates that get published are likely to overestimate the true effects. + +Low statistical power is an indicator that Type M error may be present. Post hoc power analysis cannot definitively establish the presence of Type M error. The true effect may be substantially larger than the value assumed in the power analysis. But post hoc power analysis provides additional information that can help the researcher interpret the validity of estimates. + +Our paper provide examples from actual randomized controlled trials that illustrate the cases above. We also demonstrate how post hoc power analysis can be useful to funding agencies to assess whether previously funded research met their stated power criteria. + +**Calculating statistical power.** Mathematically, the calculation of statistical power (either a priori or post hoc) is straightforward. Let: + +*ES* = an effect size + +*s.d.(ES\_hat)* = the standard deviation of estimated effects, *ES\_hat* + +*τ = ES / s.d.(ES\_hat)* + +*tcritdf,1-α/2* = the critical *t*-value for a two-tailed *t*-test having *df* degrees of freedom and an associated significance level of *α*. + +Given the above, one can use the equation below to calculate the power associated with any effect size ES: + +(1) *tvaluedf,1-Power* = (*tcritdf,1-α/2* – *τ* ) + +Equation (1) identifies the area of the *t*-distribution with *df* degrees of freedom that lies to the right of (*tcritdf,1-α/2* – *τ*) (see FIGURE 1 in *Tian et al., 2024*). All that is required to calculate power is a given value for the effect size, *ES*; the standard deviation of estimated effect sizes, *s.d.(ES\_hat)*, which will depend on the estimator (e.g., OLS, OLS with clustered standard errors, etc.); the degrees of freedom *df* and the significance level *α*. + +Most software packages that calculate statistical power essentially consist of estimating *s.d.(ES\_hat)* based on inputs such as sample size, estimate of the standard deviation of the output variable, and other parameters of the estimation environment. This raises the question, why not directly estimate *s.d.(ES\_hat)* with the standard error of the associated regression coefficient? + +**The SE-ES Method**. We show that simply replacing *s.d.(ES\_hat)* with the standard error of the estimated effect from the regression equation, *s.e.(ES\_hat)*, produces a useful, post hoc estimator of power. We call our method “SE-ES”, for Standard Error-Effect Size. + +As long as *s.e.(ES\_hat)* provides a reliable estimate of the variation in estimated effect sizes, SE-ES estimates of statistical power will perform well. As ***[McKenzie and Ozier (2019)](https://blogs.worldbank.org/en/impactevaluations/why-ex-post-power-using-estimated-effect-sizes-bad-ex-post-mde-not?ct=4422)*** note, this condition generally appears to be the case. + +Our paper provides a variety of Monte Carlo experiments to demonstrate the performance of the SE-ES method when (i) errors are independent and homoskedastic, and (ii) when they are clustered. + +In the remainder of this blog, I present two simple R programs for calculating power after estimation. The first program produces a single-valued, post hoc estimate of statistical power. The user provides a given effect size, an alpha level, and the standard error of the estimated effect from the regression equation along with its degrees of freedom. This program is given below. + +``` +# Function to calculate power + +power_function <- function(effect_size, standard_error, df, alpha) { + +# This matches FIGURE 1 in Tian et al. (2024) +# "Power to the researchers: Calculating power after estimation" +# Review of Development Economics +# http://doi.org/10.1111/rode.13130 + + t_crit <- qt(alpha / 2, df, lower.tail = FALSE) + tau <- effect_size / standard_error + t_value = t_crit - tau + calculate_power <- pt(t_value, df, lower.tail = FALSE) + + return(calculate_power) +} +``` + +For example, if after running the power\_function above, one wanted to calculate post hoc power for an effect size = 4, given a regression equation with 50 degrees of freedom where the associated coefficient had a standard error of 1.5, one would then run the chunk below. + +``` +# Example +alpha <- 0.05 +df <- 50 +effect_size <- 4 +standard_error <- 1.5 + +power <- power_function(effect_size, standard_error, df, alpha) +print(power) +``` + +In this case, post hoc power is calculated to be 74.3% (see screen shot below). + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2024/07/image-2.webp) + +Alternatively, rather than calculating a single power value, one might find it more useful to generate a power curve. To do that, you would first run the following program defining two functions: (i) the power\_function (same as above), and (ii) the power\_curve\_function. + +``` +# Define the power function +power_function <- function(effect_size, standard_error, df, alpha) { + # Calculate the critical t-value for the upper tail + t_crit <- qt(alpha / 2, df, lower.tail = FALSE) + qt(alpha / 2, df, lower.tail = FALSE) + tau <- effect_size / standard_error + t_value <- t_crit - tau + calculate_power <- pt(t_value, df, lower.tail = FALSE) + + return(calculate_power) +} + +# Define the power_curve_function +# Note that this uses the power_function above +power_curve_function <- function(max_effect_size, standard_error, df, alpha) { + + # Initialize vector to store results + powers <- numeric(51) + + # Calculate step size for incrementing effect sizes + d <- max_effect_size / 50 + + # Create a sequence of 51 effect sizes + # Each incremented by step size d + effect_sizes <- seq(0, max_effect_size, by = d) + + # Loop through each effect size to calculate power + for (i in 1:51) { + effect_size <- effect_sizes[i] + power_calculation <- power_function(effect_size, standard_error, df, alpha) + powers[i] <- power_calculation + } + + return(data.frame(EffectSize = effect_sizes, Power = powers)) +} +``` + +Now suppose one wanted to create a power curve that corresponded to the previous example. You would still have to set alpha and provide values for *df* and the standard error from the estimated regression equation. But to generate a curve, you would also have to specify a maximum effect size. + +The code below sets a maximum effect size of 10, and then creates a sequence of effect sizes from 0 to 10 in 50 equal steps. + +``` +# Define global parameters +alpha <- 0.05 +df <- 50 +standard_error <- 1.5 +max_effect_size <- 10 +d <- max_effect_size / 50 +effect_sizes <- seq(0, max_effect_size, by = d) +``` + +Running the chunk below generates the power curve. + +``` +# Calculates vector to hold power values +powers <- numeric(51) + +# Loop through each effect size to calculate power +for (i in 1:51) { + effect_size <- effect_sizes[i] + power_calculation <- power_function(effect_size, standard_error, df, alpha) + powers[i] <- power_calculation +} + +# Generate power curve data +power_data <- power_curve_function(max_effect_size, standard_error, df, alpha) + +# Plot power curve +ggplot(power_data, aes(x = EffectSize, y = Power)) + + geom_line() + + labs(title = "Power Curve", x = "Effect Size", y = "Power") + + theme_minimal() + +# This shows the table of power values for each effect size +View(power_data) +``` + +The power curve is given below. + +[![](/replication-network-blog/image-4.webp)](https://replicationnetwork.com/wp-content/uploads/2024/07/image-4.webp) + +The last line of the chunk produces a dataframe that lists all the effect size-power value pairs. From there one can see that given a standard error of 1.5, the associated regression equation has an 80% probability of producing a statistically significant estimate when the effect size = 4.3. + +The code above allows one to calculate power values and power curves for one’s own research. But perhaps its greatest value is that it allows one to conduct post hoc power analyses of estimated effects from other studies. All one needs to supply the programs is the standard error of the estimated effect and the associated degrees of freedom. + +**Limitation**. The performance of the SE-ES method depends on the nature of the data and the type of estimator used for estimation. We found that it performed well when estimating linear models with clustered errors. However, one should be careful in applying the method to settings that are different from those investigated in our experiments. Accordingly, good practice would customize *Tian et al.’s (2024)* Monte Carlo simulations to see if the results carry over to data environments that represent the data at hand. To facilitate that, we have provided the respective codes and posted them at OSF ***[here](https://osf.io/frwx2/?view_only=5a0a8d2ecc2e4f6eb3be8097152f6712)***. + +*NOTE: Bob Reed is Professor of Economics and *the Director of*[***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*at the University of Canterbury.*He can be reached at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +**REFERENCE** + +Tian, J., Coupé, T., Khatua, S., Reed, W. R., & Wood, B. D. K. (2024). Power to the researchers: Calculating power after estimation. *Review of Development Economics*, 1–35.  + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/07/29/reed-calculating-power-after-estimation-everybody-should-do-this/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/07/29/reed-calculating-power-after-estimation-everybody-should-do-this/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-do-a-dag-before-you-do-a-specification-curve-analysis.md b/content/replication-hub/blog/reed-do-a-dag-before-you-do-a-specification-curve-analysis.md new file mode 100644 index 00000000000..f79a0e5ce0e --- /dev/null +++ b/content/replication-hub/blog/reed-do-a-dag-before-you-do-a-specification-curve-analysis.md @@ -0,0 +1,149 @@ +--- +title: "REED: Do a DAG Before You Do a Specification Curve Analysis" +date: 2024-11-18 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "DAGs" + - "Del Giudice & Gangestad (2021)" + - "Directed Acyclic Graphs" + - "Multiverse" + - "Principled equivalence" + - "Principled nonequivalence" + - "SCA" + - "Specification curve analysis" + - "Type E" + - "Type N" + - "Type U" +draft: false +type: blog +--- + +My previous two blogs focused on how to do a specification curve analysis (SCA) using the R package “specr” (***[here](https://replicationnetwork.com/2024/11/05/reed-using-the-r-package-specr-to-do-specification-analysis/)***) and the Stata program “speccurve” ([***here***](https://replicationnetwork.com/2024/11/16/reed-using-the-stata-package-speccurve-to-do-specification-curve-analysis/)). Both blogs provided line-by-line explanations of code that allowed one to reproduce the specification curve analysis in Coupé (***[see here](https://replicationnetwork.com/2024/05/09/coupe-why-you-should-add-a-specification-curve-analysis-to-your-replications-and-all-your-papers/)***). + +Coupé included 320 model specifications for his specification curve analysis. But why 320? Why not 220? Or 420? And why the specific combinations of models, samples, variables, and dependent variables that he chose? + +To be fair, Coupé was a replication project, so it took as given the specification choices made by the respective studies it was replicating. But, in general, how DOES one decide which model specifications to include? + +**Del Giudice & Gangestad (2021)** + +In their seminar article, “[***Mapping the Multiverse: A Framework for the Evaluation of Analytic Decisions***](https://journals.sagepub.com/doi/pdf/10.1177/2515245920954925)”, Del Giudice & Gangestad (2021), henceforth DG&G, provide an excellent framework for how to choose model specifications for SCA. They make a persuasive argument for using Directed Acyclic Graphs (DAGs). + +The rest of this blog summarizes the key points in their paper and uses it as a template for how to select specifications for one’s own SCA. + +**Using a DAG to Model the Effect of Inflammation on Depression** + +DG&G’s focus their analysis on a hypothetical example: a study of the effect of inflammation on depression. Accordingly, they present the DAG below to represent the causal model relating inflammation (the treatment variable) to depression (the outcome variable). + +[![](/replication-network-blog/image-57.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-57.webp) + +**The Variables** + +In this DAG, “Inflammation” is assumed to have a direct effect on “Depression”. It is also assumed to have an indirect effect on “Depression” via the pathway of “Pain”. + + “Inflammation” is measured in multiple ways. There are four, separate measures of inflammation, “BM1”-“BM4” that may or may not be correlated. + +“Age” is a confounder. It affects both “Inflammation” and “Depression”. + +“Fatigue” is a collider variable. It affects neither “Inflammation” nor “Depression”. Rather, it is affected by them. “Pain” and “Age” also affect “Fatigue”. + +Lastly, “Pro-inflammatory genotype” represents the effect that one’s DNA has on their predisposition to inflammation. + +One way to build model specifications is to combine all possible variables and samples. There are four different measures of the treatment variable “Inflammation” (“BM1”-“BM4”). + +There are four “control” variables: “Pro-inflammatory Genotype”, “Pain”, “Fatigue”, and “Age”. + +The authors also identify three different ways of handling outliers. + +In the end, DG&G arrive at 1,216 possible combinations of model characteristics. Should a SCA include all of these? + +**Three Types of Specifications** + +To answer this question, DG&G define three types of model specifications: (i) Principled Equivalence (“Type E”), (ii) Principled Non-equivalence (“Type N”), and (iii) Uncertainty (“Type U”). + +Here is how they describe the three categories: + +“In *Type E decisions* (principled equivalence), the alternative specifications can be expected to be practically equivalent, and choices among them can be regarded as effectively arbitrary.” + +“In *Type N decisions* (principled nonequivalence), the alternative specifications are nonequivalent according to one or more nonequivalence criteria. As a result, some of the alternatives can be regarded as objectively more reasonable or better justified than the others.” + +“…in *Type U decisions* (uncertainty), there are no compelling reasons to expect equivalence or nonequivalence, or there are reasons to suspect nonequivalence but not enough information to specify which alternatives are better justified.” + +**Using Type E and Type N to Select Specifications for the SCA** + +I next show how these three types guide the selection of specifications for SCA in the “Inflammation”-“Depression” study example. + +First, we need to identify the effect we are interested in estimating. Are we interested in the direct effect of “Inflammation” on “Depression”? If so, then we want to include the variable “Pain” and thus separate the direct from the indirect effect. + +Specifications with “Pain” and without “Pain” are not equivalent. They measure different things. Thus, equivalent (“Type E”) specifications of the direct effect of “Inflammation” on “Depression” should always include “Pain” in the model specification. + +Likewise, for the variable “Age”. “Age” is a confounder. One needs to control its independent effect on “Inflammation” and “Depression”. Accordingly, “Age” should always be included in the specification. Variable specifications that do not include “Age” will be biased. They are “Type N” specifications. Specifications that do not include “Age” should not be included in the SCA. + +How about “Fatigue”? “Fatigue” is a collider variable. “Inflammation” and “Depression” affect “Fatigue”, but “Fatigue” does not affect them. In fact, including “Fatigue” in the specification will bias estimates of “Inflammation’s” direct effect. To see this, suppose “Fatigue” = “Inflammation” + “Depression”. + +If “Inflammation” increases, and “Fatigue” is held constant by including it in the regression, “Depression” must necessarily decrease. Including “Fatigue” would induce a negative bias in estimates of the effect of “Inflammation” on “Depression”. Thus, specifications that include “Fatigue” are Type N specifications and should not be included in the SCA. + +What do DG&G’s categories have to say about “Pro-inflammatory Genotype” and the multiple measures of the treatment variable (“BM1”-“BM4”)? As “Pro-inflammatory Genotype” has no effect on “Depression”, the only effect of including it in the regression is to reduce the independent variance of “Inflammation”. + +While this leaves the associated estimates unbiased, it diminishes the precision of the estimated treatment effect. As a result, specifications that include “Pro-inflammatory Genotype” are inferior (“Type N”)  to those that omit this variable. + +Finally, there are multiple ways to use the four measures, “BM1” to “BM4”. One could include specifications with each one separately. One could create composite measures, that additively combine the individual biomarkers; such as “BM1+BM2”, “BM1+BM3”,…, “BM1+BM2+BM3+BM4”. + +Should all of the corresponding specifications be included in the SCA? DG&G state that “composite measures of a construct are usually more valid and reliable than individual indicators.” However, “if some of the indicators are known to be invalid, composites that exclude them will predictably yield higher validities.” + +Following some preliminary analysis that raised doubts about the validity of BM4, DG&G concluded that only the composites “BM1+BM2+BM3+BM4” and “BM1+BM2+BM3” were “principled equivalents” and should be included in the SCA. + +The last element in DG&G’s analysis was addressing the problem of outliers. They considered three alternative ways of handling outliers. All were considered to be “equivalent” and superior to using the full sample. + +In the end, DG&G cut down the number of specifications in the SCA from 1,216 to six, where the six consisted of specifications that allowed two alternative composite measures of the treatment variable, and three alternative samples corresponding to each of the three ways of handling outliers. All specifications included (and only included) the control variables, “Age” and “Pain”. + +**Type U** + +The observant reader might note that we did not make use of the third category of specifications, Type U for “Uncertainty”. DG&G give the following DAG as an example. + +[![](/replication-network-blog/image-58.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-58.webp) + +In Figure 2b, “Fatigue” is no longer a collider variable. Instead, it is a mediator variable like “Pain”. Accordingly, DG&G identify six specifications for this DAG, with two alternative composite measures of “Inflammation”, three ways of addressing outliers, and all specifications including (and only including), the control variables, “Pain”, “Fatigue”, and “Age”. + +DG&G argue that it would be wrong to include these six specifications with the previous six specifications. The associated estimates of the treatment effect are likely to be hugely different depending on whether “Fatigue” is a collider or a mediator. + +But how can we know which DAG is correct? DG&G argue that we can’t. The true model is “uncertain”, so DG&G recommend that we should conduct separate SCAs, one for the DAG in Figure 2a, and one for the DAG in Figure 2b. + +**DG&G Conclude** + +DG&G conclude their paper by addressing the discomfort that researchers may experience when doing a SCA that only has six, or a relatively small number of model specifications: + +“Some readers may feel that, no matter how well justified, a multiverse of six specifications is too small, and that a credible analysis requires many more models—perhaps a few dozen or hundreds at a minimum. We argue that this intuition should be actively resisted. If a smaller, homogeneous multiverse yields better inferences than a larger one that includes many nonequivalent specifications, it should clearly be preferred.” + +**Extensions** + +There are many scenarios that DG&G do not address in their paper, but in principle the framework is easily extended. Consider alternative estimation procedures. If the researcher does not have strong priors that one estimation procedure is superior to the other, then both could be assumed to be “equivalent” and should be included. + +Consider another scenario: Suppose the researcher suspects endogeneity and uses instrumental variables to correct for endogeneity bias. In this case, one might consider IV methods to be superior to OLS, so that the two methods were not “equivalent” and only IV estimates should be included in the SCA. + +Alternatively, suppose there is evidence of endogeneity, but the instruments are weak so that the researcher cannot know which is the “best” method. As IV and OLS are expected to produce different estimates, one might argue that this is conceptually similar to the “Fatigue” case above. Following DG&G, this would be considered a Type U scenario, and separate SCAs should be performed for each of the estimation procedures. + +**Final Thoughts** + +It’s not all black and white. In the previous example, since the researcher cannot determine whether OLS or IV is better, why not forget Type U and combine the specifications in one SCA? + +This highlights a tension between two goals of SCA. One goal is to narrow down the set of specifications to those most likely to identify the “true” effect, and then observe the range of estimates within this set. + +Another goal is less prescriptive about the “best” specifications. Rather, it is interested in identifying factors associated with the heterogeneity among estimates. A good example of this approach is TABLE 1 in the “specr” and “speccurve” blogs identified at the top of this post. + +Despite this subjectivity, DAGs provide a useful approach for identifying sensible specifications for a SCA. Everybody should do a DAG before doing a specification curve analysis! + +*NOTE: Bob Reed is Professor of Economics and the Director of*[***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*at the University of Canterbury. He can be reached at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +**REFERENCES** + +[Del Giudice, M., & Gangestad, S. W. (2021). A traveler’s guide to the multiverse: Promises, pitfalls, and a framework for the evaluation of analytic decisions. *Advances in Methods and Practices in Psychological Science*, *4*(1), 2515245920954925](https://journals.sagepub.com/doi/pdf/10.1177/2515245920954925). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/11/18/reed-do-a-dag-before-you-do-a-specification-curve-analysis/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/11/18/reed-do-a-dag-before-you-do-a-specification-curve-analysis/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-doing-meta-analyses-with-pccs-here-s-something-you-might-not-know.md b/content/replication-hub/blog/reed-doing-meta-analyses-with-pccs-here-s-something-you-might-not-know.md new file mode 100644 index 00000000000..11e3008b547 --- /dev/null +++ b/content/replication-hub/blog/reed-doing-meta-analyses-with-pccs-here-s-something-you-might-not-know.md @@ -0,0 +1,97 @@ +--- +title: "REED: Doing Meta-Analyses with PCCs? Here’s Something You Might Not Know" +date: 2023-12-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Bias" + - "Chris Doucouliagos" + - "Inverse variance weighting" + - "maer-net" + - "Meta-analysis" + - "Partial Correlation Coefficients" + - "PCC" + - "Robbie van Aert" + - "Tom Stanley" + - "Variance" + - "Weighted Least Squares" +draft: false +type: blog +--- + +[*This blog first appeared at the MAER-Net Blog under the title “Something I Recently Learned About PCCs That Maybe You Also Didn’t Know”,****[see here](https://www.maer-net.org/post/something-i-recently-learned-about-pccs-that-maybe-you-also-didn-t-know)****]* + +While TRN is primarily dedicated to replications in economics, I also do research on meta-analysis. As such, I try to attend the Meta-Analysis in Economics Research Network (MAER-Net) Colloquium every year. It is a great place to learn from the best and have my many questions answered. + +In 2022, the colloquium was held in Kyoto, Japan. That year I went with an especially large number of questions that I was hoping to have answered.  In fact, I used my presentation at the colloquium as an opportunity to take my questions to the MAER-Net “brain trust”. Below is a slide from the presentation I gave in Kyoto: + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2023/12/image.webp) + +Here is the background: My presentation was on “The Relationship Between Social Capital and Economic Growth: A Meta-Analysis.” Because measures of social capital and economic growth vary widely across studies, we transformed the estimates from the original studies into partial correlation coefficients (PCCs). + +As is standard in economics, we used the following expression for the sampling variance of PCC: + +1) s.e.(PCC)^2 = (1-PCC^2) / df + +In the course of our analysis, one of the co-authors on this project, Robbie van Aert (Tilburg University) said we were using the wrong expression for the sampling variance of PCC. He said the correct expression was: + +2) s.e.(PCC)^2 = (1-PCC^2)^2 / df + +Notice the difference in the numerators. + +This was pretty shocking to me, as I had published several meta-analyses with PCCs using Equation (1). As had many other economists. + +Indeed, Robbie was right. Economists were using the wrong sampling variance! As a result of this experience, he published a note in *Research Synthesis Methods* (see below). + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2023/12/image-1.webp) + +Unfortunately, I wasn’t able to get much of a response from those attending MAER-Net in Kyoto so I left confused about what I should do in my research. + +However, the answer to my question was not long in coming. In March of this year I learned of an article by Tom Stanley and Chris Doucouliagos that addressed the issue of the “correct” sampling variance of PCC (see below). + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2023/12/image-2.webp) + +To cut to the chase, the answer to my question if economists were using the wrong s.e.(PCC) is twofold: + +1) Yes, economists are using the wrong sampling variance of PCC + +2) Economists should continue using the wrong sampling variance because it produces better estimates + +The reason why the “wrong” sampling variance is better than the “correct” sampling variance is enlightening. It raises issues about PCCs that I never appreciated. I thought if I was unaware of these issues, maybe others were too. Hence the motivation for this blog. + +First, a reminder about why meta-analysis uses inverse variance weighting. Given heteroskedasticity, it is well known that weighted least squares (WLS) will produce estimates with least variance. Ceteris paribus, that argues in favor of using the “correct” sampling variance of PCC. + +However, ceteris paribus doesn’t hold because the “correct” sampling variance of PCC is a function of PCC (see Equation 1). + +In particular, as PCC increases, s.e.(PCC) decreases. As a result, inverse variance weighting favors larger values of PCC. This introduces bias in the estimation of the overall mean. + +Doesn’t the “wrong” sampling variance also have a bias problem (cf. Equation 2)? Yes, it does. But the bias is not as bad. + +Stanley and Doucouliagos demonstrate this in a series of simulations reported in their paper. In the table below, S1^2 is the “correct” sampling variance, and S2^2 is the “wrong” sampling variance commonly used by economists. In every case, bias is less using S2^2. + +[![](/replication-network-blog/image-3.webp)](https://replicationnetwork.com/wp-content/uploads/2023/12/image-3.webp) + +Does that mean that somehow WLS isn’t relevant for PCCs? Not at all. It is still the case that inverse weighting with the “correct” sampling variance produces estimates with smaller variance. + +However, its advantage in variance is outweighed by its disadvantage in bias. As a result, inverse variance weighting using the “wrong” sampling variance is more efficient. This is demonstrated in the table where root mean squared error (RMSE) for S2^2 is smaller than for S1^2. + +In summary, the “correct” sampling variance of PCC produces estimates with smaller variance. The “wrong” sampling variance produces estimates with smaller bias. + +As Stanley and Doucouliagos show, the “wrong” sampling variance makes a better bias-variance trade-off and is thus more efficient. Accordingly, they recommend that economists continue to use the “wrong” sampling variance of PCC in inverse variance weighting. + +Before reading Stanley and Doucouliagos’ article, I was unaware that inverse variance weighting with PCCs involved a bias-variance trade-off. Perhaps others were also unaware and will find this blog useful. + +**REFERENCES** + +Stanley, T. D., & Doucouliagos, H. (2023). Correct standard errors can bias meta‐analysis. *Research Synthesis Methods*, 14(3), 515-519. + +van Aert, R. C., & Goos, C. (2023). A critical reflection on computing the sampling variance of the partial correlation coefficient. *Research Synthesis Methods*, 14(3), 520-525. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/12/15/reed-doing-meta-analyses-with-pccs-heres-something-you-might-not-know/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/12/15/reed-doing-meta-analyses-with-pccs-heres-something-you-might-not-know/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-eir-heterogeneity-in-two-way-fixed-effects-models.md b/content/replication-hub/blog/reed-eir-heterogeneity-in-two-way-fixed-effects-models.md new file mode 100644 index 00000000000..e9963d5e5d0 --- /dev/null +++ b/content/replication-hub/blog/reed-eir-heterogeneity-in-two-way-fixed-effects-models.md @@ -0,0 +1,176 @@ +--- +title: "REED: EiR* – Heterogeneity in Two-Way Fixed Effects Models" +date: 2019-06-01 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "D'Haultfœuille" + - "de Chaisemartin" + - "fuzzydid" + - "replications" + - "Stata" + - "Treatment effects" + - "Two-way fixed effects" + - "twowayfeweights" +draft: false +type: blog +--- + +###### *[\* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research. The material for this blog is drawn from the recent working paper “**[Two-way fixed effects estimators with heterogeneous treatment effects](https://arxiv.org/abs/1803.08807)**” by Clément de Chaisemartin and Xavier D’Haultfoeuille, posted at ArXiv.org]* + +###### *NOTE #1: All the data and code (Stata) necessary to produce the results in the tables below are available at Harvard’s Dataverse: **[click here.](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FEGMLQG)*** + +###### *NOTE #2: Since this blog was written, the “breps” and “brepscluster” options have been removed from the twowayfeweights command (see below).* + +###### It is common to estimate treatment effects within a model incorporating both group and time fixed effects (think Differences-in-Differences). In a recent paper, Clément de Chaisemartin and Xavier D’Haultfoeuille (henceforth C&D) demonstrate how these models can produce unreliable estimates of average treatment effects when effects are heterogeneous across groups and time periods. + +###### Their paper both identifies the problem and provides a solution. The purpose of this blog is to enable others to use C&D’s procedures to re-analyze published research. + +###### In what follows, I highlight key points from their paper. Following their paper, I illustrate the problem, discuss the solution, and show how it makes a difference in replicating a key result from Gentzkow et al. (2011). + +###### **The Problem** + +###### Consider the following two-group, three-period data, where groups are designated by *g = 0,1* and time periods by *t = 0,1,2.* + +###### TRN1(20190524) + +###### *G1, T1,* and *T2* are dummy variables indicating whether an observation belongs to the first group, first time period, and second time period, respectively. *D* is a treatment indicator that takes the value *1* if the particular (*g,t*) cell received treatment. Note that treatment is applied to group *g=0* at time *t=2*, and to group *g=1* at times *t=1,2.* + +###### *Δ* indicates the size of the treatment effect for treated cells (we ignore the size of the treatment effect for the untreated cells). Consider three regimes for *Δ.* + +###### In the first regime (*Δ1*), the treatment effect is homogeneous across groups and time periods. In the second regime (*Δ2*), the treatment effects are heterogeneous, with the treatment effect equalling 1 for (*g,t*) = (*0,2*), and *2* and *0* for cells (*g,t*) = (*1,1*) and (*g,t*) = (*1,2*), respectively. The third regime (*Δ3*) is similar, except that the sizes of the treatment effects are reversed for group *g=1*, with treatment effects equal to *0* and *2* in time periods *1* and *2.* Note that the average treatment effect for the treated (ATT) equals *1* for all three treatment regimes. + +###### Let the outcome for each observation be determined by the following equation: + +###### *Ygt =* *Δgt∙Dgt + G1gt + T1gt + T2gt ,  g=0,1; t=0,1,2.* + +###### Suppose one estimates the following regression specification using OLS: + +###### (1) *Ygt =* *β0* + *βfe* *Dgt +* *βG1G1gt +* *βT1T1gt +* *βT2T2gt*  + error. + +###### In this specification, the treatment effect is estimated by *βfe*, the coefficient on the treatment dummy variable. C&D prove that *βfe* can be expressed as a weighted average of the individual treatment effects: + +###### *βfe* = *w02Δ02 + w11Δ11 + w12Δ12* , + +###### where *w02* + *w11* + *w12* = 1; and *Δ02**, Δ11*, and *Δ12*represent the treatment effects associated with the (*g,t*) cells (*0,2*), (*1,2*), and (*1,3*) for a given treatment regime. + +###### What follows is quite surprising. C&D demonstrate that the weights need not all be positive.  In fact, it can be shown in the current case that: + +###### *βfe* = *(**½ Δ02) + Δ11 + (-½ Δ12)* . + +###### Where does the negative weight on the third treatment effect come from? *βfe* is the average of two difference-in-differences. The first is associated with the change in treatment for *g=0* over the time periods *t=1,2* (*DID0*). The second relates to the change in treatment for *g=1* over the time periods *t=0,1* (*DID1*). Specifically, + +###### *βfe = (DID0 + DID1)/2*. + +###### First consider *DID0 = [E(Y02) – E(Y01)] -[E(Y12) – E(Y11)] =**Δ02*– (*Δ12**– Δ11*). + +###### The first term in brackets in *DID0* measures the change in outcomes for the treatment observations associated with the first change in treatment. The second term in brackets represents the change in outcomes for the control observations over the same period. Note that both control observations receive treatment. + +###### Ignoring time trends (as they cancel out under the common trend assumption), if treatment effects are homogeneous, the latter term, *[E(Y12) – E(Y11)] =*(*Δ12**– Δ11*), will drop out. But if the treatment effect for group *g=1* is different for periods *t=1,2*, this term remains. Further, if the treatment effect in *t=2* is sufficiently large relative to *t=1,* this will dominate *Δ02*, and*DID0* will be negative. + +###### Next consider *DID1 = [E(Y11) – E(Y10)] -[E(Y01) – E(Y00)] =**Δ11*. + +###### In this case, heterogeneity in treatment effects is not an issue because the control observations, represented by the last term in brackets, consist of two untreated observations. + +###### What is important to note here is that *βfe = (DID0 + DID1)/2*can be negative even if all the individual treatment effects are positive! + +###### The table below reports the values for *βfe* for each of the three treatment regimes (*Δ1*, *Δ2*, *Δ3*). Also reported are the values of the first difference estimator, *βfd*, which, in this case of two groups and three time periods, is equal to the fixed effects estimator. (Note that in general, *βfe*≠ *βfd*, a fact that we will exploit below.) + +###### Of particular interest is the third treatment regime, where *βfe*= *βfd*< 0, even though all the individual treatment effects (*Δ02**, Δ11*, and *Δ12*) are greater than zero. + +###### TRN2(20190524) + +###### The case above demonstrates how heterogeneity can cause estimates of the average treatment effect to be negative even though the individual treatment effects are all positive. More generally, *βfe*will not be the same as *βfd*; and given heterogeneity, either one, or the other, or both can be a biased estimate of the average treatment effect on the treated (ATT). + +###### **How Do You Know If You Have a Problem?** + +###### Define *ΔTR* as the *ATT*, weighted by the number of individuals in each (*g,t*) cell: + +###### TRN3(20190524) + +###### (in the example above, *N­gt* = *1* and *N1* = *3*). + +###### *βfe* and *βfd* are also weighted measures of the ATT, but they have an additional set of weights (*wgt* and *wfd,gt*, respectively). + +###### TRN4(20190524) + +###### TRN5(20190524) + +###### Note that *βfe* and *βfd* employ different weights *wg,t* and *wfd,g,t*. A necessary condition for both estimators to provide an unbiased estimate of *ΔTR* is that the weighting terms *wg,t* and *wfd,g,t* be uncorrelated with the respective treatment effects. + +###### Thus, one diagnostic is to test for a significant difference between *βfe* and *βfd*.. If the two estimates are significantly different, that is an indicator that at least one of the two estimators is a biased estimator of the overall treatment effect. + +###### Another diagnostic is to regress the weights on a variable that is associated with the size of the treatment effect. If one finds a significant correlation, then that is an indicator that the respective estimator (*βfe* or *βfd*.) is a biased estimator of *ΔTR*. + +###### **The Solution** + +###### C&D propose an estimator that focuses on treatment changes. The estimator compares treatment changes over consecutive time periods (either untreated or treated, or treated to untreated) with other observations during the same time period whose treatment did not change. They call this estimator the *WTC* estimator, for *Wald-Time Corrected*. While the example above consisted of a very restricted case (binary treatment, only one observation per (*g,t*) cell), their estimator generalizes to cases where treatment is continuous, and where only a portion of individuals in a given (*g,t*) cell receive treatment. + +###### As a check on their estimator, they suggest a placebo estimator. The placebo estimator relates treatment changes to outcomes from the preceding period. Under the “common trends” assumption, the placebo estimator *WplTC* should equal zero. Failure to reject this hypothesis provides some evidence that the assumptions underlying the *WTC* estimator are valid. + +###### **A Replication Application** + +###### In their paper, C&D replicate results from the study, “The Effect of Newspaper Entry and Exit on Electoral Politics”, published in the *American Economic Review* in 2011 by Matthew Gentzkow, Jesse Shapiro, and Michael Sinkinson (GSS). GSS use county-level data from the US for the years 1868-1928 to estimate the relationship between Presidential turnout and the number of newspapers in a county. Following C&D, I explain how to implement their procedures and compare their results with those reported by GSS. + +###### In the notation of the leading example above, + +###### *Ygt* = Presidential turnout in county *g* at time *t,* + +###### *Dgt* = Number of newspapers in county *g* at time *t.* + +###### GSS use a first-difference estimator to estimate the effect of an additional newspaper on Presidential turnout. Their difference specification includes state-year fixed effects, and clusters on counties. They estimate that an additional newspaper in a county increased Presidential turnout by 0.26 percentage points (average Presidential turnout was approximately 65 percent during this period). Their estimate is reported below (cf. *βfd*). C&D use GSS’s data to also estimate a conventional fixed effects estimate and this is also reported in the table (cf. *βfe*). Note that the fixed effects estimator produces a negative estimate. + +###### TRN6(20190524) + +###### C&D first test *H0:* *βfd =* *βfe*and obtain a t-stat of 2.86, rejecting the null at conventional levels of significance. This indicates that at least one of these is a biased estimator of the overall treatment effect. + +###### As a further test, C&D estimate the relationship between the respective weights, *wfd,gt* and *wgt*, and the treatment effect. Of course, the treatment effect is unobserved. As a proxy for the size of the treatment effect, C&D use *year*. C&D hypothesize that the effect of newspapers might change over time as other sources of communication, such as radio towards the end of the period, became more important. + +###### To estimate this relationship, C&D employ a user-written Stata program called ***twowayfeweights***. An example command for *βfe* is given below. + +###### ***twowayfeweights prestout cnty90 year numdailies, type(feTR) controls(styr1-styr666) breps(100) brepscluster(cnty90) test\_random\_weights(year)*** + +###### The command is ***twowayfeweights***. The outcome variable is presidential turnout (*prestout*), the group and time variables are *cnty90* and *year,* respectively. The treatment variable is *numdailies*. The option *type* identifies whether one is estimating weights for *βfe* or *βfd* (the syntax for *βfd* is slightly different). *controls* and *breps* identify, respectively, the other variables in the equation (here, state and year fixed effects), and the number of bootstrap replications to run. *brepscluster* indicates that the bootstrapping should be blocked according to county. The last option, *test\_random\_weights* regresses the respective weights on the variable assumed to be related to the size of the treatment effect (*year*). Note that while the weights are not automatically saved, there is an option to save them so that one can observe how they vary across counties and years. The results are reported below. + +###### TRN7(20190524) + +###### The results suggest that *βfd* may be adversely affected by correlation between the weights and the treatment effect, causing it to be a biased estimator of the overall average treatment effect on the treated. Note, however, that the sign of the correlation does not, *per se*, indicate the sign of the associated bias. On the other hand, the fixed effects estimator does not demonstrate evidence that the corresponding weights are correlated with treatment effects. + +###### The last step consists of estimating the treatment effect using the *WTC* estimator. To do that, we use another user-written Stata program called ***fuzzydid****:* + +###### ***fuzzydid prestout G\_T G\_T\_for year numdailies, tc newcateg(0 1 2 1000) qualitative(st1-st48)*** + +###### The syntax is similar to ***twowayfeweights****,* except that following the outcome variable (*prestout*) are two indicator variables. These indicate whether the treatment variable (*numdailies*) increased (*G\_T=1*), decreased (*G\_T=-1*) or stayed the same (*G\_T=0*), compared to the preceding election period. *G\_T\_for* is the lead value of *G\_T* in the immediately succeeding election period. + +###### The options indicate that the *Wald-Time Corrected* statistic is to be calculated (*tc*), *newcateg* lumps the number of newspapers into 4 categories (0, 1, 2, and >2), and that state fixed effects should be included in the estimation (*qualitative*). + +###### The reason for combining numbers of newspapers greater than 2 into a single category is that control groups need to have the same number of “treatments” as the treatment group. From the histogram below, it is apparent that relatively few counties have more than two newspapers. + +###### TRN8(20190524) + +###### C&D estimates of the effect of newspapers on Presidential turnout are given below. They estimate an additional newspaper increases turnout by 0.43 percentage points (compared to 0.26 and -0.09 for the first-difference and fixed-effects estimators). Their placebo test produces an insignificant estimate, suggesting that the assumptions of the Wald-TC estimator are valid. Finally, as the placebo estimate uses a somewhat restricted sample, they reestimate the treatment effect on the restricted sample and obtain an estimate very close to what they obtain using the larger sample (0.0045 versus 0.0043). + +![TRN9(20190524)](/replication-network-blog/trn920190524.webp) + +###### **Conclusion** + +###### C&D show that conventional estimates of treatment effects in two-way fixed effects models consist of weighted averages of individual treatment effects. When treatment effects are heterogeneous, this can cause conventional estimates to be biased. C&D present both (i) tests to identify if heterogeneous treatment effects present a problem for conventional estimators, and (ii) an alternative estimator that allows unbiased estimation of average treatment effects on the treated. + +###### Replication researchers may find C&D’s procedures useful when re-analyzing original studies that estimate treatment effects within a two-way, fixed effects model. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +###### **References** + +###### de Chaisemartin, C. and D’Haultfœuille, X., 2018. ***[Two-way fixed effects estimators with heterogeneous treatment effects](https://arxiv.org/abs/1803.08807)***. *arXiv preprint at arXiv:1803.08807.* + +###### de Chaisemartin, C., D’Haultfœuille, X. and Guyonvarch, Y., 2019. ***[Fuzzy Differences-in-Differences with Stata](http://www.crest.fr/ckfinder/userfiles/files/Pageperso/xdhaultfoeuille/fdid_stata.pdf)***. The Stata Journal (in press). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/06/01/reed-eir-heterogeneity-in-two-way-fixed-effects-models/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/06/01/reed-eir-heterogeneity-in-two-way-fixed-effects-models/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-eir-how-to-measure-the-importance-of-variables-in-regression-equations.md b/content/replication-hub/blog/reed-eir-how-to-measure-the-importance-of-variables-in-regression-equations.md new file mode 100644 index 00000000000..11003cc2a59 --- /dev/null +++ b/content/replication-hub/blog/reed-eir-how-to-measure-the-importance-of-variables-in-regression-equations.md @@ -0,0 +1,117 @@ +--- +title: "REED: EiR* – How to Measure the Importance of Variables in Regression Equations" +date: 2019-07-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Ceteris Paribus Importance" + - "Economic Growth" + - "Economic importance" + - "Effect size" + - "Non-Ceteris Paribus Importance" + - "Olivier Sterck" + - "R-squared" + - "Regression" + - "Standardized Beta Coefficient" +draft: false +type: blog +--- + +###### *[\* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research. The material for this blog is drawn from a recent working paper, “*[***On the measurement of importance***](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3386218)*” by Olivier Sterck.]* + +###### *NOTE: The files (Stata) necessary to produce the results in the tables below are posted at Harvard’s Dataverse: **[click here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OHOWDU)**.* + +###### Researchers are often interested in assessing practical, or economic, importance when doing empirical analyses. Ideally, variables are scaled in such a way that interpreting a variable’s effect is straightforward. For example, a common variable in cross-country, economic growth equations is average annual temperature. Accordingly, one can gauge the effect of a 10-degree increase in temperature on economic growth, ceteris paribus. + +###### However, sometimes variables do not allow a straightforward interpretation. This is true, for example, for index variables. It is also true for variables that are otherwise difficult to relate to, such as measures of “terrain ruggedness” or “genetic diversity”, both of which have been employed in growth studies. + +###### Problems can still arise even when coefficients are straightforward. For example, a variable may have a large effect, but differ only slightly across observations, so that it explains very little of the variation in the dependent variable. + +###### Further, a researcher may be interested in assessing the relative importance of variables.  For example, the coefficients on average annual temperature and percentage of fertile soil may be straightforward to interpret, but a researcher may be interested knowing which is “more important” for explaining differences in growth rates across countries. + +###### The most common approach for measuring importance in these latter cases is to calculate *Standardized Beta Coefficients*. These are obtained by standardizing all the variables in an equation, including the dependent variable, and re-running the regression with the standardized variables. However, this measure has serious shortcomings. + +###### As we shall show below, the nominal value of the *Standardized Beta Coefficient* does not lend itself to a straightforward interpretation. Further, it has difficulty handling nonlinear specifications of a variable, such as when both a variable and its square is included in an equation. And it is unable to assess groups of variables. For example, a researcher may include regional variables such as North America, Latin America, Middle East, etc., and wish to determine if the regional variables are collectively important. + +###### A recent paper by Olivier Sterck identifies shortcomings of existing measures of importance (such as the *Standardized Beta Coefficient*) and proposes two new measures: *Ceteris Paribus Importance* and *Non-Ceteris Paribus Importance*. Both are measured in percentages and have straightforward interpretations. They can be implemented with a new Stata command (“importance”) that can be downloaded from [***the author’s website***](https://oliviersterck.wordpress.com/). The two measures address different aspects of importance and are intended as complements. + +###### ***Ceteris Paribus Importance*** + +###### Let the relationship between a variable of interest, *y*, and a set of explanatory variables be given by + +###### (1) *yi =* *b0 + ∑**i* *bixi +* *εi* + +###### It follows that the variance of *y* is + +###### (2) *Var(y) = ∑**i=1/n Var(**bixi) + 2∑**i=1/n-1**∑j=i+1/n Cov(bixi,bjxj) Var(**ε) .* + +###### A key challenge is how to allocate the *Cov(**bixi,**bjxj)* terms between *xi* and *xj* for all *i* ≠ *j.* The first measure, *Ceteris Paribus Importance*, denoted *qi-squared*, addresses this problem by ignoring these terms: + +###### (3) *qi-squared = Var(**bixi) / [∑**i=1/n Var(**bixi) + Var(**ε)]* + +###### *Ceteris Paribus Importance* takes values between 0 and 1 and can be understood as a percentage. Specifically, it is the percent of variation in *y* attributed to a given variable, holding the other variables constant. Note that the sum of the individual *q-squared* terms, including *q-squared* for the error term, equals one: *∑**i=1/n**qi-squared + q-squared(**ε) = 1.* Further, in the special case when *∑**i=1/n-1**∑j=i+1/n Cov(bixi,bjxj)**= 0,* as when all the explanatory variables are uncorrelated with each other, *∑**i=1/n**qi-squared**= R2.* + +###### **A comparison of *Ceteris Paribus Importance* with *Standardized Beta Coefficient*** + +###### Consider two data generating processes (DGPs). + +###### DGP1: *y1i = x1i +* *ε1i* ,  where *x1i,* *ε1i ~ N*(0,1); + +###### DGP2: *y2i = x2i +* *ε2i* ,  where *x2i ~ 2**·N*(0,1) and *ε1i ~ N*(0,1). + +###### In the first DGP, the coefficient on *x1i* is 1 and *x1i* and *ε1i* each contribute 50% to the variance of *y*. The second DGP is identical to the first, except that *x* now contributes 80% of the variance of *y*. The table below reports coefficient estimates for both models from simulated samples of 10,000 observations. It also reports the corresponding *Ceteris Paribus Importance* (*qi-squared*) and *Standardized Beta Coefficient* measures. + +###### TRN1(20190713) + +###### Recall that *Ceteris Paribus Importance* measures the contribution of the *x* variable to the variance of *y*. Since there is only one explanatory variable in both DGPs, the covariance terms in equation (3) drop out, and *qi-squared = R-squared.* Accordingly, in DGP1, where both *x* and *ε* contribute equally to the variance of *y, Ceteris Paribus Importance* equals 50%. When the variance of *x* increases fourfold, so that *x* contributes 4/5s of the variance of *y*, *Ceteris Paribus Importance* rises to 80%. In both cases, *Ceteris Paribus Importance* has a straightforward interpretation as a percent of the variance of *y* contributed by *x* (holding other variables constant). + +###### The corresponding *Standardized Beta Coefficients* for the two models are 0.714 and 0.894. While the *Standardized Beta Coefficient* is larger in the second model, there is no straightforward interpretation of its numerical value. + +###### ***Non-Ceteris Paribus Importance*** + +###### Sterck’s second measure, *Non-Ceteris Paribus Importance,* accommodates the fact that variables are likely to be correlated when working with observational data. It allocates the covariance terms in equation (2) equally across the two variables and is defined as follows: + +###### (4a) *Ei = [Var(**b**i**xi) / Var(y)] + [**∑**j**≠**i/n* *Cov(**b**i**xi,**b**j**xj) / Var(y)].* + +###### This can be expressed alternatively as + +###### (4b) *Ei = [**∂**Var(y)/Var(y)] / [**∂**Var(**b**i**xi)/Var(**b**i**xi)] .* + +###### Despite its seemingly nonintuitive appearance, *Non-Ceteris Paribus Importance* has two characteristics that make it appealing. First,*∑**i=1/n Ei = R-squared.* Thus, it decomposes *R-squared* across the respective variables. As shown by equation (4.b), the individual components can be expressed as elasticities. In particular, they measure how a marginal change in the variance of *b**i**xi* affects the variance of *y.* + +###### The second characteristic is that *Non-Ceteris Paribus Importance* can take both positive and negative values. The first term in equation (4), *[Var(**bixi) / Var(y)],* will always be positive, of course. However, the second term captures the association of *xi* with the other explanatory variables. In particular, if *xi* is strongly negatively correlated with other variables, the overall effect of increases in the variance of *bixi* can be to decrease the variance of *y.* + +###### *Non-Ceteris Paribus Importance* serves as a complement to *Ceteris Paribus Importance.* The latter focuses on the direct effect of *x* on the variance of *y*. In contrast, *Non-Ceteris Paribus Importance* incorporates the covariance of *xi* with other variables. It provides a measure of the extent to which *xi* works to reinforce, or counteract, the effects of the other explanatory variables. + +###### **An application to growth empirics** + +###### Sterck provides an empirical example of the two importance measures using income data from 155 countries. He employs OLS to estimate a regression of the log of per capita GDP in 2000 on 18 variables that have been used by other researchers of economic growth. + +###### Table 2 classifies the variables in five groups. Category 1 consists of stand-alone variables. Category 2 consists of a quadratic specification for the variable *Predicted genetic diversity*. Categories 3 through 5 consist of groupings of dummy variables (religion, legal foundations, regions). + +###### Note that *Standardized Beta Coefficients* are unable to collectively evaluate Categories 2 through 5. For example, one can calculate a *Standardized Beta Coefficient* for the linear and quadratic forms of the *Predicted genetic diversity* variable, but not for their combined importance. Likewise, one can calculate *Standardized Beta Coefficients* for the individual religion dummies, but one cannot use this measure to obtain an overall measure of the importance of religion. + +###### TRN2(20190713) + +###### Table 3 reports importance measures for the 5 categories of variables (where the stand-alone variables of Category 1 are lumped together for convenience). + +###### TRN3(20190713) + +###### Column (1) displays *Ceteris Paribus Importance (qi-squared*). Note that the importance of the individual categories plus the residuals sum to 100 percent of the variance of *y.* Corporately, the stand-alone variables account for approximately 45% of the variance of *y,* with the religion and regional variables next in importance at 13% and 10%, respectively. + +###### Column (2) reports *Non-Ceteris Paribus Importance* *(Ei).* This measure accounts for the interactions of variables across categories. The individual shares sum to the *R-squared* of the respective OLS regression (84.4%). + +###### Columns (3) and (4) divide *Non-Ceteris Paribus Importance* into two components. The *Variance* and *Covariance* components correspond to the first and second terms in equation (4.a) above. In most cases, variables within a category reinforce the effects of variables in other categories. However, note that the *Covariance* term for the regional variables is negative. This indicates that the regional dummies counteract the effect of some of the variables in other categories, reducing the variation of *y* that would otherwise result. Nevertheless, overall, regional variables positively contribute to the variance of incomes across countries. + +###### The above provides an example of how *Ceteris Paribus* and *Non-Ceteris Paribus Importance* can be calculated for OLS regressions. An option in the corresponding do file also allows one to calculate importance measures for 2SLS regressions. Additional details are provided in ***[Sterck’s paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3386218)***. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at bob.reed@canterbury.ac.nz.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/07/15/reed-eir-how-to-measure-the-importance-of-variables-in-regression-equations/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/07/15/reed-eir-how-to-measure-the-importance-of-variables-in-regression-equations/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-eir-interval-testing.md b/content/replication-hub/blog/reed-eir-interval-testing.md new file mode 100644 index 00000000000..e8aeecca9b3 --- /dev/null +++ b/content/replication-hub/blog/reed-eir-interval-testing.md @@ -0,0 +1,105 @@ +--- +title: "REED: EIR* – Interval Testing" +date: 2019-06-28 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Economic significance" + - "Equivalence Test" + - "Interval testing" + - "Minimum Effects Test" + - "NHST" + - "null hypothesis significance testing" + - "Treatment effects" +draft: false +type: blog +--- + +###### *[\* EIR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research. The material for this blog is motivated by a recent blog at TRN, “**[The problem isn’t just the p-value, it’s also the point-null hypothesis!](https://replicationnetwork.com/2019/06/07/kim-robinson-the-problem-isnt-just-the-p-value-its-also-the-point-null-hypothesis/)**” by Jae Kim and Andrew Robinson]* + +###### In a recent blog, Jae Kim and Andrew Robinson highlight key points from their recent paper, “[***Interval-Based Hypothesis Testing and Its Applications to Economics and Finance***](https://www.mdpi.com/2225-1146/7/2/21)” (*Econometrics,* 2019). They identify three problems with conventional null hypothesis significance testing (NHST) based on *p*-values. + +###### First, the *p*-value does not convey any information about the economic significance of the estimated effect. + +###### Second, the *p*-value is decreasing in sample size for the same measured effect so that at a sufficiently large sample size, virtually everything is “statistically significant”. + +###### Third, the null hypothesis is almost always wrong, as it unlikely in the extreme that a particular effect is truly 0.000000000… + +###### As an alternative, they promote the use of interval-based hypothesis testing. In particular, they advance two types of interval tests: Minimum Effect Tests (MET) and Equivalence Tests (ET). + +###### The idea behind the two tests is similar. In both cases, the researcher posits limits for a given effect. Say, in the judgment of the researcher, any effect that lies between *value1* and *value 2* is too small to be economically important. Only values outside this range are economically meaningful. + +###### With Minimum Effect Tests, the aim is to determine if *value1* *≤ effect* *≤ value2.*Hypothesis testing consists of two, one-sided hypothesis tests (TOST). *H01: effect* *≥* *value1,* and *H02: effect* *≤ value2.* Rejection of either hypothesis leads to the conclusion that the effect is economically important. Otherwise one cannot reject the hypothesis that the effect is economically unimportant. The size of the MET test is the sum of the sizes of the two separate, one-sided *t*-tests. + +###### With Equivalence Tests, the aim is to determine if *value1* *< effect* *< value2.*Hypothesis testing again consists of two, albeit different, one-sided hypothesis tests (TOST): *H01: effect* *≤* *value1* and *H02: effect* *≥ value2.* Rejection of both hypotheses leads to the conclusion that the effect is economically unimportant. Otherwise one cannot reject the hypothesis that the effect is economically important. The size of the ET is the same as the size of the individual one-sided tests (which typically are of equal size). + +###### This is summarized in the table below: + +###### TRN1(20190628) + +###### Given the above, it follows that the respective rejection criteria, expressed in terms of *t-*tests, are as reported in the table below. + +###### TRN2(20190628) + +###### While similar, the two tests are designed for different purposes. Minimum Effect Tests are designed to test for economic importance, while Equivalence Tests are designed to test for lack of economic importance. Rejection of the respective null hypotheses allows one to accept the economic status for which the researcher is seeking evidence. This also means that the tests can lead to seemingly conflicting conclusions. + +###### The remainder of this blog presents two examples to illustrate how to implement and interpret interval testing. In both examples, we envision an experiment with two groups, a treatment and a control group. The data generating process (DGP) used to produce the data is given by *y = β**·treat + error,* where *treat* is a binary treatment variable that takes the value 1 if the subject received the treatment and 0 otherwise. + +###### The examples are constructed so that in both cases the coefficient on the *treat* variable is statistically significant. Interval testing is used to determine whether the treatment effect is economically meaningful. To determine “economic importance”, we convert the estimated treatment effect to [***Cohen’s d***](http://staff.bath.ac.uk/pssiw/stats2/page2/page14/page14.html), a familiar metric for measuring effect sizes when comparing means between two groups. Following the example of [***Lakens (2017)***](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5502906/), we interpret values of *Cohen’s d* less than 0.3 in absolute value to be economically unimportant. + +###### **Example One** + +###### The regression below reports the results of regressing the outcome variable *y* on the treatment dummy variable. 500 subjects receive the treatment, with another 500 held out for the control group. The treatment effect is significant at the 5% level, with a *t*-value of 2.23 (see below). + +###### TRN3(20190628) + +###### To implement interval testing, we test for differences in the means of the two groups, calculate *Cohen’s d,* and then carry out the respective tests of hypotheses as presented in Tables 1 and 2. The first example is constructed so that both the Minimum Effect Test and the Equivalence Test produce a similar conclusion. Table 4A reports the results for the MET. + +###### TRN4(20190628) + +###### The estimated effect (*Cohen’s d*) is 0.14. The lower and upper critical values are ±0.42.  We fail to reject both *H01* and *H02* for the Minimum Effect Test. According to Table 1, this leads to the conclusion that we cannot reject the hypothesis that the treatment effect is economically unimportant. + +###### TRN5(20190628) + +###### TABLE 4B reports the results of an Equivalence Test applied to the same data. Now we compare the *Cohen’s d* value of 0.14 to the critical values ±0.20. Accordingly, we reject both null hypotheses. This allows us to accept the hypothesis, at the 5% significance level, that the treatment does not have an economically important effect. + +###### In this example, both tests lead to similar conclusions. However, there is an important difference. The Equivalence Test is the stronger result in that we accept the hypothesis of economic unimportance. The Minimum Effect Test is weaker, in that we only fail to reject the hypothesis that it is unimportant. + +###### **Example Two** + +###### In this example, we construct the DGP so that we still obtain a significant treatment effect. However, the associated tests will lead to seemingly conflicting conclusions. + +###### Table 5 gives the OLS regression estimate of the treatment effect. The treatment effect is highly significant, with a t-value of 4.67 (see below). + +###### TRN6(20190628) + +###### Tables 6A and 6B report the results of the Minimum Effect and Equivalence Tests. + +###### TRN7(20190628) + +###### The estimated effect (*Cohen’s d*) is now 0.295, very close to our threshold of economic significance (0.30). The lower and upper critical values remain at their values from the first example (±0.42). A comparison of the estimated effect with the respective critical values confirms that we fail to reject both *H01* and *H02* for the Minimum Effect Test. This leads to the conclusion that we cannot reject the hypothesis that the treatment effect is economically unimportant. + +###### TRN8(20190628) + +###### Table 6B performs an Equivalence Test on the same data. Now we compare the *Cohen’s d* value of 0.295 with the critical values ±0.20. We reject *H01* but not *H02.* Accordingly, we cannot reject the hypothesis that the treatment effect is economically important. + +###### In this example, the different tests lead to seemingly conflicting conclusions. The conflict derives from the fact that both tests produced weak conclusions. We could neither reject the hypothesis that the treatment effect was economically unimportant, nor reject the hypothesis that it was important. + +###### In conclusion, interval testing addresses a shortcoming of NHST in that it allows us to address issues of economic importance, something that NHST is ill-equipped to do. However, it does require the researcher to declare a range of values for the effect that are deemed “economically unimportant”. Not all researchers may agree with the researcher’s choice of values. + +###### Further, both Minimum Effect and Equivalence Tests share the weakness of all hypothesis testing in that conclusions of “Failure to reject” are weak results with respect to discriminating between null and alternative hypotheses. + +###### To learn more about interval testing, see [***Kim and Robinson (2019***)](https://www.mdpi.com/2225-1146/7/2/21). + +###### \*NOTE: The programming code (Stata) necessary to reproduce the results for the two examples in this blog are available at Harvard’s Dataverse: ***[click here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FG8AFOK)***. Feel free to check it out and play around with the simulation parameters to produce different examples.] + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/06/28/reed-eir-interval-testing/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/06/28/reed-eir-interval-testing/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-eir-more-on-heterogeneity-in-two-way-fixed-effects-models.md b/content/replication-hub/blog/reed-eir-more-on-heterogeneity-in-two-way-fixed-effects-models.md new file mode 100644 index 00000000000..abe2af327a5 --- /dev/null +++ b/content/replication-hub/blog/reed-eir-more-on-heterogeneity-in-two-way-fixed-effects-models.md @@ -0,0 +1,128 @@ +--- +title: "REED: EiR* – More on Heterogeneity in Two-Way Fixed Effects Models" +date: 2019-10-18 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Andrew Goodman-Bacon" + - "bacondecomp" + - "Decomposition" + - "Heterogeneity" + - "Panel data" + - "Stata" + - "Treatment effects" + - "Two-way fixed effects" +draft: false +type: blog +--- + +###### *[\* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research. The material for this blog is primarily drawn from the recent working paper “**[Difference-in-differences with variation in treatment timing](https://cdn.vanderbilt.edu/vu-my/wp-content/uploads/sites/2318/2019/04/14141044/ddtiming_9_5_2018.pdf)**” by Andrew Goodman-Bacon, available from his webpage at Vanderbilt University. FIGURE 1 is modified from a [lecture slide](http://economics.ozier.com/econ626/lec/econ626-L03-slides-2019.pdf) by Pamela Jakiela and Owen Ozier.]* + +###### In a ***[recent blog](https://replicationnetwork.com/2019/06/01/reed-eir-heterogeneity-in-two-way-fixed-effects-models/)*** at TRN, I discussed research by Clément de Chaisemartin and Xavier D’Haultfoeuille (C&H) that pointed out how heterogeneity in treatment effects causes two-way fixed effects (2WFE) estimation to produce biased estimates of Average Treatment Effects on the Treated (ATT). + +###### This paper by Andrew Goodman-Bacon (GB) provides a nice complement to C&H. In particular, it decomposes the 2WFE estimate into mutually exclusive components. One of these can be used to identify the change in treatment effects over time. An accompanying Stata module (“bacondecomp”) allows researchers to apply GB’s procedure. + +###### In this blog, I summarize GB’s decomposition result and reproduce his example demonstrating how his Stata command can be applied. + +###### **Conventional difference-in-differences with homogeneous treatment effects** + +###### The canonical DD example consists of two groups, “Treatment” and “Control”, and two time periods, “Pre” and “Post”. The treatment is simultaneously applied to all members of the treatment group. The control group never receives treatment. The treatment effect is homogenous both across the treated individuals and “within” individuals over time. If there are time trends, we assume they are identical across both groups (“common trends assumption”). + +###### FIGURE 1 motivates the corresponding DD estimator. + +###### TRN1(20191018) + +###### Let *δ* be the ATT (which is the same for everybody and constant over time). Note that ATT is given by the double difference DD, where,TRN2(20191018) + +###### The first difference sweeps out any unobserved fixed effects that characterize Treatment individuals. This leaves *δ*plus the time trend for the Treatment group. + +###### The second difference (in parentheses) sweeps out unobserved effects associated with Control individuals. This leaves the time trend for the Control group. + +###### The first difference minus the second difference then leaves *δ*, the ATT, assuming both groups have a common time trend. (Note how the “common trends” assumption is key to identifying *δ*.) + +###### It is easily shown that, given the above assumptions, that OLS estimation of the regression specification below produces an unbiased estimate of *δ.* + +###### TRN3(20191018) + +###### **A more realistic, three-period example** + +###### Now consider a more realistic example, close in spirit to what researchers actually encounter in practice. Let there be three groups, “Early Treatment”, “Late Treatment” and “Never Treated”; and three time periods, “Pre”, “Mid”, and “Post”. + +###### FIGURE 2 motivates the following discussion. + +###### TRN4(20191018) + +###### The Early Treatment group receives treatment at *t\*k* (GB uses the *k* subscript to indicate early treatees). + +###### The Late Treatment group receives treatment at *t\*l* , *t\*l* > *t\*k* . + +###### Suppose a researcher were to estimate the following 2WFE regression equation, where *Dit* is a dummy variable indicating whether individual *i* was treated at or before time *t.* For example, *Dit* = 0 and 1 for Late treatees at times “Mid” and “Post”, while *Dit* = 1 for Early treatees at times “Mid” and “Post”,TRN5a(20191018)GB shows that the OLS estimate of *βDD* is a weighted average of all possible *DD* paired differences. One of those paired differences (cf. Equation 6 in GB) isTRN5(20191018)Note that in this case, the Early Treatment group (subscripted by *k)*can serve as a control group for the Late Treatment group because its treatment status does not change over the “Mid”/“Post” period. This particular paired difference ends up being important. + +###### GB goes on to derive the following decomposition result: The probability limit of the OLS estimator of *βDD*consists of three components:TRN6(20191018)*VWATT* is the Variance Weighted ATT, *VWCT* is the Variance Weighted Common Trends, and *ΔATT* is the change in individuals’ treatment effects that occurs over time, where the weights come from sample size and treatment variance. + +###### When the common trends assumption is valid (*VWCT*=0), and the treatment effect is homogeneous both across individuals and within individuals over time, then the probability limit equals *δ,* the homogeneous treatment effect. + +###### However, if treatment effects are heterogeneous, then even if the common trends assumption holds, estimation of the 2WFE specification will not equal the ATT. There are two sources of bias. + +###### The first bias arise because OLS weights individual treatment effects differently depending on (i) the number of people who are treated and (ii) the timing of the treatment. This will introduce a bias if the size of the treatment effect is associated with either of these. However, this bias is not necessarily a bad thing. It is the byproduct of minimizing the variance of the estimator, so there are some efficiency gains that accompany this bias. + +###### The second bias is associated with changes in the treatment effect over time, *Δ**ATT.* It’s entirely a bad thing. + +###### Consider again the paired differenceTRN5(20191018) + +###### The second term is the difference in outcomes for the Early treatees between the Post and Mid periods. Because Early treatees are treated for both of these periods, this difference should sweep away everything but the time trend if the treatment effect stays constant over time. + +###### However, if treatment effects vary over time, say the benefits depreciate (or, alternatively, accumulate), the treatment effect will not be swept out. As a result, the change in the treatment effect will carry through to the respective DD estimate. As a result, the DD estimate will respectively over- or under-estimate the true treatment effect. + +###### GB’s decomposition allows one to investigate this last type of bias. Towards that end, GB (along with Thomas Goldring and Austin Nichols) has written a Stata module called ***bacondecomp***. + +###### **Application: Replication of Stevenson and Wolfers (2006)** + +###### To demonstate ***bacondecomp***, GB replicates a result from the paper “***[Bargaining in the Shadow of the Law: Divorce Laws and Family Distress](https://academic.oup.com/qje/article/121/1/267/1849020)***” by Betsey Stevenson and Justin Wolfers (S&W), published in *The Quarterly Journal of Economics* in 2006. + +###### Among other things, S&W estimate the effect of state-level, no-fault divorce laws on female suicide. Over their sample period of 1964–1996, 37 US states gradually adopted no-fault divorce. 8 states had already done so, and 5 states never did. + +###### *[NOTE: The data and code to reproduce the results below are taken directly from the examples in the help file accompanying **bacondecomp**. They can be obtained by installing **bacondecomp** and then accessing the help documentation through Stata].* + +###### GB does not exactly reproduce S&W’s result, but uses a similar specification and obtains a similar result. In particular, he estimates the following regression + +###### *TRN8(20191018)*where *asmrs* is the suicide mortality rate per one million women; *post* is the treatment dummy variable (i.e., *Dit*); *pcinc, asmrh*, and *cases* are control variables for income, the homicide mortality rate, and the welfare case load; and *αi* and *αt* are individual and time fixed effects. + +###### The 2WFE estimate of *βDD*is -2.52. In other words, this specification estimates that no-default divorce reform reduced female suicides by 2.52 fatalities per million women. + +###### The ***bacondecomp*** command decomposes the 2WFE estimate of -2.52 into three separate components by treatment (T) and control (C) groups. + +###### *Timing\_groups:* Early treatees (T) versus Late treatees (C) & Late treatees (T) versus Early treatees (C). + +###### *Always\_v\_timing:* Treatees (T) versus Always treated/Pre-reform states (C). + +###### And *Never\_v\_timing:* Treatees (T) versus Never treated (C). + +###### ***bacondecomp*** produces the following table, where “Beta” is the DD estimate for the respective group and “TotalWeight” represents its share in the overall estimated effect (-2.52). Notice that the sum of the products of “Beta” × “TotalWeight” ≈ the 2WFE estimate.TRN10(20191018) + +###### Conspiculously, the first group (*Timing\_groups*) finds that no-fault divorce reform is associated with an increase in the female suicide rate (+2.60). In contrast, the latter two groups find a decrease (-7.02 and -5.26). This is indicative that there may be changes in treatment effects over time. If so, this would invalidate the difference-in-differences estimation framework. + +###### Unfortunately, ***bacondecomp*** does not produce a corrected estimate of ATT. It is primarily useful for identifying a potential problem with time-varying treatment effects. As a result, it should be seen as complementing other approaches, such as the estimation procedures of de Chaisemartin and D’Haultfoeuille (***[see here](https://replicationnetwork.com/2019/06/01/reed-eir-heterogeneity-in-two-way-fixed-effects-models/)***), or an alternative approach such as an event study framework that includes dummies for each post-treatment period (***[see here](http://economics.mit.edu/files/14964)***). + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at [bob.reed@canterbury.ac.nz](mailto:bob.reed@canterbury.ac.nz).* + +###### **References** + +###### Goodman-Bacon, A. (2018). ***[Difference-in-differences with variation in treatment timing](https://cdn.vanderbilt.edu/vu-my/wp-content/uploads/sites/2318/2019/04/14141044/ddtiming_9_5_2018.pdf)***. National Bureau of Economic Research, No. w25018. + +###### Goodman-Bacon, A., Goldring, T., & Nichols, A. (2019).  bacondecomp: Stata module for decomposing difference-in-differences estimation with variation in treatment timing.  ****** + +###### Stevenson, B. & Wolfers, J. (2006). ***[Bargaining in the shadow of the law: Divorce laws and family distress](https://academic.oup.com/qje/article-abstract/121/1/267/1849020?redirectedFrom=fulltext)***. *The Quarterly Journal of Economics*, 121(1):267-288. + +###### + +###### + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/10/18/reed-eir-more-on-heterogeneity-in-two-way-fixed-effects-models/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/10/18/reed-eir-more-on-heterogeneity-in-two-way-fixed-effects-models/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-eir-replications-and-dags.md b/content/replication-hub/blog/reed-eir-replications-and-dags.md new file mode 100644 index 00000000000..2bc2447fa03 --- /dev/null +++ b/content/replication-hub/blog/reed-eir-replications-and-dags.md @@ -0,0 +1,72 @@ +--- +title: "REED: EiR* — Replications and DAGs" +date: 2020-03-10 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Causal effects" + - "Causal interpretation" + - "DAGitty" + - "DAGs" + - "Directed Acyclic Graphs" + - "Observational data" +draft: false +type: blog +--- + +###### *[\* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]* + +###### In recent years, DAGs (Directed Acyclic Graphs) have received increased attention in the medical and social sciences as a tool for determining whether causal effects can be estimated. A brief introduction can be found ***[here](https://cran.r-project.org/web/packages/ggdag/vignettes/intro-to-dags.html)***. While DAGs are commonly used to guide model specification, they can also be used in the post-publication assessment of studies. + +###### Despite widespread recognition of the dangers of drawing causal inferences from observational studies, and with general, nominal acknowledgement that “correlation does not imply causation”, it is still standard practice for researchers to discuss estimated relationships from observational studies as if they represent causal effects. + +###### In this blog, we show how one can apply DAGs to previously published studies to assess whether implied claims of causal effects are justified. For our example, we use the [***Mincer earnings regression***](https://en.wikipedia.org/wiki/Mincer_earnings_function), which has appeared in hundreds, if not thousands, of economic studies. The associated wage equation relates individuals’ observed wages to a number of personal characteristics: + +###### *ln(wage) =* *b**0* *+* *b**1* *Educ +* *b**2* *Exp +* *b**3* *Black +* *b**4* *Female + error**,* + +###### where *ln(wage)* is the natural log of wages, *Educ* is a measure of years of formal education, *Exp* is a measure of years of labor market experience, and *Black* and *Female* are dummy variables indicating an individual’s race (black) and sex. + +###### The parameters *b1* and *b2* are comonly interpreted as the rate of return to education and labor market experience, respectively. The coefficients on *Black* and *Female* are commonly interpreted as measuring labor market discrimination against blacks and women. + +###### Suppose one came across an estimated Mincer wage regression like the one above in a published study. Suppose further that the author of that study attached causal interpretations to the respective estimated parameters. One could use DAGs to determine whether those interpretations were justified. + +###### To do that, one would first hypothesize a DAG that summarized all the common cause relationships between the variables. By way of illustration, consider the DAG in the figure below, where *U* is an unobserved confounder.1 + +###### TRN (20200310) + +###### In this DAG, *Educ* affects *Wage* through a direct channel, *Educ -> Wage*, and an indirect channel, *Educ -> Exp -> Wage*. The Mincerian regression specification captures the first of these channels. However, it omits the second because the inclusion of *Exp* in the specification blocks the indirect channel. Assuming both channels carry positive associations, the estimated rate of return to education in the Mincerian wage regression will be downwardly biased. + +###### We can use the same DAG to assess the other estimated parameters. Consider the estimated rate of return on labor market experience. The DAG identifies both a direct causal path (*Exp -> Wage*) and a number of non-causal paths. *Exp <- Female -> Wage* is one non-causal path, as is *Exp <- Educ -> Wage.* Including the variables *Educ* and *Female* in the regression equation blocks these non-causal paths. As a result, the specification solely estimates the direct causal effect, and thus provides an unbiased estimate of the rate of return of labor market experience on wages. + +###### In a similar fashion, one can show that given the DAG above, one cannot interpret the estimated values of *b**3* and *b**4* as estimates of the causal effects of labor market and sex discrimination. + +###### DAGs also have the benefit of suggesting tests that allow one to assess the validity of a given DAG. In particular, the DAG above implies the following independences:2 + +###### 1) *Educ* ⊥ *Female* + +###### 2) *Exp* ⊥ *Black* | *Educ* + +###### 3) *Female* ⊥ *Black* + +###### Rejection of one or more of these would indicate that the DAG is not supported by the data. + +###### In practice, there are likely to be many possible DAGs for a given estimated equation. If a replicating researcher can obtain the data and code for an original study, he/she could then posit a variety of DAGs that seemed appropriate given current knowledge about the subject. + +###### For each DAG, one could determine whether the conditions exist such that the estimated specification allows for a causal interpretation of the key parameters. If so, one could then use the model implications to assess whether the DAG was “reasonable”, as evidenced by non-conflicting data. + +###### If no DAGs can be found that support a causal interpretation, or if adequacy tests cause one to eliminate all such DAGs, one could then request that the original author provide a DAG that would support their causal interpretations. In this fashion, existing studies could be assessed to determine if there is an evidentiary basis for causal interpretation of the estimated effects. + +###### 1 This DAG is taken from Felix Elwert’s course, ***[Directed Acyclic Graphs for Causal Inference](https://statisticalhorizons.com/seminars/public-seminars/directed-acyclic-graphs-for-causal-inference-fall19)***, taught through ***[Statistical Horizons](https://statisticalhorizons.com/)***. + +###### 2 A useful, free online tool for drawing and assessing DAGs, is DAGitty, which can be found ***[here](http://dagitty.net/)***. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at bob.reed@canterbury.ac.nz.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2020/03/10/reed-eir-replications-and-dags/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2020/03/10/reed-eir-replications-and-dags/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-eir-what-s-supporting-that-fixed-effects-estimate.md b/content/replication-hub/blog/reed-eir-what-s-supporting-that-fixed-effects-estimate.md new file mode 100644 index 00000000000..7e7f03669f1 --- /dev/null +++ b/content/replication-hub/blog/reed-eir-what-s-supporting-that-fixed-effects-estimate.md @@ -0,0 +1,74 @@ +--- +title: "REED: EiR* – What’s Supporting that Fixed Effects Estimate?" +date: 2020-04-25 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Fixed Effects" + - "Panel data" + - "Stata" + - "Treatment effects" +draft: false +type: blog +--- + +###### *[\* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]* + +###### *NOTE: All the data and code  necessary to produce the results in the tables below are available at Harvard’s Dataverse: **[click here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29IXQN).*** + +###### Fixed effects estimators are often used when researchers are concerned about omitted variable bias due to unobserved, time-invariant variables. These can prove insightful if there is much within-variation to support the fixed effects estimate. However, they can be misleading when there is not. + +###### Stata has several commands that can help the researcher gauge the extent of within-variation. In this example, we use the “wagepan” dataset that is bundled with Jeffrey Wooldridge’s text, “Introductory Econometrics: A Modern Approach, 6e”. The dataset consists of annual observations of 545 workers over the years 1980-1987. It is described ***[here](https://rdrr.io/cran/wooldridge/man/wagepan.html)***. + +###### In this example we use fixed effects to regress log(wage) on education, labor market experience, labor market experience squared, dummy variables for marital and union status, and annual time dummies. + +###### The table below reports the fixed effects (within-estimate) for the “married” variable. For the sake of comparision, it also reports the between-estimate for “married” (calculated used the Mundlak version of the Random Effects Within Between estimator (***[Bell, Fairbrother, and Jones, 2019](https://link.springer.com/article/10.1007/s11135-018-0802-x)***). + +###### TRN1(20200425) + +###### The within-estimate of the marriage premium is smaller than the between-estimate. This is consistent with marital status being positively associated with unobserved, time-invariant productivity characteristics of the worker. However, we want to know how much variation there is in marital status for the workers in our sample. If it is just a few workers who are changing marital status over time, then our estimate may not be representative of the effect of marriage in the population. + +###### Stata provides two commands that can be helpful in this regard. The command ***xttab*** reports, among other things, a measure of variable stability across time periods. In the table below, among workers who ever reported being unmarried, they were unmarried for an average of 64.8% of the years in the sample. + +###### Among workers who ever reported being married, they were married for 62.5% of the years in the sample. In this case, changes in marital status are somewhat common. Note that a time-invariant variable would have a “Within Percent” value of 100%. + +###### TRN2(20200425) + +###### Stata provides another command, ***xttrans***, that gives detail about year-to-year variable transitions. + +###### TRN3(20200425) + +###### The rows represent the values in year *t,* with the columns representing the values in the following year. In this case, 86% of observations that were unmarried at time *t,* were also unmarried at time *t+1.* 14% of observations that were unmarried at time *t* changed status to “married” at time *t+1*. + +###### Among other things, the ***xttrans*** command provides a reminder that the fixed effects estimate of the marriage premium includes the effect of transitioning from married to unmarried: 5% of observations that were married at time *t* were unmarried at time *t+1.* The implied assumption is that the effect of marriage on wages is symmetric, something that could be further explored in the data. + +###### While these analyses are useful, they are based on observations, not workers. If there is concern about sample selection biasing the fixed effects estimates (so that “movers” are different from “stayers”), it would be useful to know how many of the 545 workers had experienced a marital status change, since it is the changes that support the fixed effects estimate. + +###### The following set of commands calculate the min and the max values of the explanatory variables for all the workers in the sample. It then creates a dummy variable with the prefix “change” that takes the value 1 anytime the max and min values differ. Finally, it collapses the dataset so that there is one observation per worker, and then takes averages of the change variables. + +###### TRN4(20200425) + +###### The results below indicate that 56.9% of the workers changed their marital status during the sample period. Whether this is a sufficient number of “changers” to represent population “changers” is an open question. However, if the number were only 5 or 10% of workers, the argument for representativeness would be much weaker. + +###### TRN5(20200425) + +###### What does this have to do with replication? Oftentimes treatments are administered over time in panel datasets (say microcredit loans). Fixed effects estimates may be used to identify causal estimates of the treatment. Sample statistics, when they are reported, typically only report the percent of observations receiving treatment. Consider the two samples below. + +![TRN6(20200425)](/replication-network-blog/trn620200425-1.webp) + +###### In both samples, 30% of the observations are treatment observations. Thus a table of sample statistics would show identical means for the treatment variable in the two samples. + +###### However, in the first sample, 100% of the workers received treatment, and 75% of year-to-year transitions involved a change in treatment status. In the second sample, only 50% of the workers experienced treatment, and 25% of year-to-year transitions involved a change in treatment status. + +###### These are the kinds of differences that the procedures described above can be used to identify. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at [bob.reed@canterbury.ac.nz](mailto:bob.reed@canterbury.ac.nz).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2020/04/25/reed-eir-whats-supporting-that-fixed-effects-estimate/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2020/04/25/reed-eir-whats-supporting-that-fixed-effects-estimate/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-how-open-science-can-discourage-good-science-and-what-journals-can-do-about-it.md b/content/replication-hub/blog/reed-how-open-science-can-discourage-good-science-and-what-journals-can-do-about-it.md new file mode 100644 index 00000000000..83b66eff4a9 --- /dev/null +++ b/content/replication-hub/blog/reed-how-open-science-can-discourage-good-science-and-what-journals-can-do-about-it.md @@ -0,0 +1,49 @@ +--- +title: "REED: How “Open Science” Can Discourage Good Science, And What Journals Can Do About It" +date: 2018-07-25 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "data and code" + - "Data sharing" + - "Journal policies" + - "Open Science" + - "peer review" +draft: false +type: blog +--- + +###### In a recent tweet (or series of tweets) ***[Kaitlyn Werner](https://twitter.com/kaitlynmwerner)*** shares her experience of having a paper rejected after she posted all her data and code and submitted her paper to a journal. The journal rejected the paper because a reviewer looked over the data and had “a hunch” that there was a mistake. + +###### Werner states that she was just about to change her stance on open science when, after several checks of her data and code, she realized the reviewer was right. There was a mistake in the coding of the data. + +###### The lesson the author learned from this experience?: + +###### *“Fortunately, I think this error will actually make my paper a lot stronger. And as upset that I am about the 3 months of review that are now lost, I am happy to know that you didn’t publish a misleading paper. And from now on, I will always share my data.”* + +###### To read her full set of tweets, ***[click here](https://twitter.com/kaitlynmwerner/status/1021047716355493889)***. + +###### But there is another lesson here. If papers with data and code are more likely to be rejected (because they have more things that reviewers can find fault with), then they face a higher standard of getting published. If one believes that making data and code public makes researchers more careful, and the associated research is higher quality and more likely to be “true”, then “open science” will enable discrimination against higher quality research, and tilt the playing field towards lower quality research. + +###### In this particular case, the journal’s actions were not compatible with good science. + +###### If journals don’t want to discourage good science, and if some papers submit data and code and others do not, then at the very least, the journal should create a level playing field. Papers with data and code should not face a higher threshold of acceptance than papers without data and code. + +###### One way they could do that is to inform their reviewers that they should never reject a paper based on the data and code. If a reviewer finds a mistake, but the rest of the paper seems publishable, the journal should allow the author to resubmit their research with corrected data and code. + +###### Further, if journals wanted to tilt the playing field in favor of good science, they could build in a higher probability of acceptance for papers that supplied data and code. This is a reasonable policy for a journal to follow if one believes that these papers will tend to be higher quality: Researchers who make their data and code transparent know that they run a higher risk of having their mistakes uncovered. As a result, they will go to extra lengths to make sure their research is mistake-free and “true”. + +###### Kaitlyn Werner is a noble scientist who cares about truth more than getting a publication in a prestigious journal. The lesson she drew from her experience made her more committed to open science. + +###### However, if open science is to lead to better science, journals are going to have to figure out how to avoid penalizing open science practices. + +###### *Bob Reed is Professor of Economics at the University of Canterbury in New Zealand and co-founder of The Replication Network.  He can be contacted at bob.reed@canterbury.ac.nz.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/07/25/reed-how-open-science-can-discourage-good-science-and-what-journals-can-do-about-it/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/07/25/reed-how-open-science-can-discourage-good-science-and-what-journals-can-do-about-it/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-how-you-as-a-reviewer-can-encourage-journals-to-become-more-transparent.md b/content/replication-hub/blog/reed-how-you-as-a-reviewer-can-encourage-journals-to-become-more-transparent.md new file mode 100644 index 00000000000..00e40cf260d --- /dev/null +++ b/content/replication-hub/blog/reed-how-you-as-a-reviewer-can-encourage-journals-to-become-more-transparent.md @@ -0,0 +1,69 @@ +--- +title: "REED: How You, as a Reviewer, Can Encourage Journals to Become More Transparent" +date: 2019-05-31 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Data sharing" + - "Journal policies" + - "Manuscript review" + - "Peer Reviewers Openness initiative" + - "Transparency" +draft: false +type: blog +--- + +###### I am a member of the ***[Peer Reviewers Openness (PRO) Initiative](https://opennessinitiative.org/)***. The Pro Initiative is based on the idea that reviewers have the power to get journals to become more transparent. In particular, they encourage reviewers to request data and code from the journal when they are asked to review a manuscript. Here is the statement from their homepage: + +###### “We believe that openness and transparency are core values of science. … The promise of open research can finally be realized, but this will require a cultural change in science. The power to create that change lies in the peer-review process.” + +###### “We suggest that beginning January 1, 2017, **reviewers make open practices a pre-condition for more comprehensive review.**This is already in reviewers’ power; to drive the change, all that is needed is for reviewers to collectively agree that the time for change has come.” + +###### This can work! I was recently asked to review a manuscript for a journal that does not require authors to provide their data and code along with their manuscript at the time of submission. In other words, a typical journal. Here is what I wrote the editor: + +###### *Dear Professor XXX,* + +###### *Thank you for the invitation to review a manuscript for XXX.* + +###### *I am happy to do that conditional on the authors providing their data and code so I can double check their analysis.* + +###### *I am a member of the **[Peer Reviewers’ Openness Initiative](https://opennessinitiative.org/)**. I also co-founded and manage **[The Replication Network](https://replicationnetwork.com/)**. The bottom line is that I believe that many if not most of the research findings in empirical economics are not reliable. The only way to address this problem that I can see is to have researchers provide their data and code upon submission so that reviewers can do a satisfactory job of assessing the research.* + +###### *As somebody who also sits on the other side of the desk, I know how difficult it can be to secure reviewers. I don’t want to make your job more difficult than it already is. But I do think our discipline has a serious problem and I don’t know of any other way to fix it but to encourage journals to require authors to provide their data and code.* + +###### *I look forward to hearing your response.* + +###### *Sincerely,* + +###### *Bob Reed* + +###### Frankly, I did not expect to receive a positive response from the journal, but I am apparently a man of little faith. A few days later I received the following response: + +###### *Dear Prof. Reed,* + +###### *I have contacted the author and got the following reply:* + +###### *“No problem. We value transparency. The data is stored on Harvard Dataverse under the link XXX (I have attached relevant information to our application, and I attach the data itself to the message for convenience). As for the code (also attached) it is designed for the R environment.* + +###### *I hope this helps in assessing our findings.”* + +###### *I hope that you will now be able to accept this invitation to review:* + +###### *Thanks in advance.* + +###### *Best regards,* + +###### *XXX* + +###### As anybody knows who ever has tried to find reviewers, good reviewers are scarce. Anything that is scarce has value. And value translates to leverage. Reviewers have the leverage to get journals to become more transparent. So…why not give it a go the next time you are asked to review a manuscript? + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/05/31/reed-how-you-as-a-reviewer-can-encourage-journals-to-become-more-transparent/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/05/31/reed-how-you-as-a-reviewer-can-encourage-journals-to-become-more-transparent/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-is-science-self-correcting-evidence-from-5-recent-papers-on-the-effect-of-replications-on-citat.md b/content/replication-hub/blog/reed-is-science-self-correcting-evidence-from-5-recent-papers-on-the-effect-of-replications-on-citat.md new file mode 100644 index 00000000000..9bc36ab691b --- /dev/null +++ b/content/replication-hub/blog/reed-is-science-self-correcting-evidence-from-5-recent-papers-on-the-effect-of-replications-on-citat.md @@ -0,0 +1,195 @@ +--- +title: "REED: Is Science Self-Correcting? Evidence from 5 Recent Papers on the Effect of Replications on Citations" +date: 2023-04-05 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Citations" + - "economics" + - "Psychology" + - "replications" + - "Self-correcting Science" +draft: false +type: blog +--- + +NOTE: This is a long blog. TL;DR: I discuss 5 papers and the identification strategies each use in their effort to identify a causal effect of replications on citations. + +One of the defining features of science is its ability to self-correct. This means that when new evidence or better explanations emerge, scientific theories and models are modified or even discarded. However, the question remains whether science really works this way. In this blog I review 5 recent papers that attempt to empirically answer this question. All five investigated whether there was a citation penalty from an unsuccessful replication. Although each of the papers utilized multiple approaches, I only report one or a small subset of results as representative of their analyses. + +Three of the papers are published and from psychology: [***Serra-Garcia & Gneezy (2021)***](https://www.science.org/doi/pdf/10.1126/sciadv.abd1705), [***Schafmeister (2021)***](https://journals.sagepub.com/doi/pdf/10.1177/09567976211005767), and [***von Hippel (2022)***](https://journals.sagepub.com/doi/pdf/10.1177/17456916211072525?casa_token=YUDI8W6J9C4AAAAA:vmCbqg2LzQJoHS6T2ix2H_2I1BX2f11ZmF2s_mVmLy4h_dfE6ugXmGFMv25qn4S4spNxYdsMx-6OXQ). Two of the papers are from economics and are unpublished: [***Ankel-Peters, Fiala, & Neubauer (2023)***](https://www.rwi-essen.de/fileadmin/user_upload/RWI/Publikationen/Ruhr_Economic_Papers/REP_23_1005.pdf) and [***Coupé & Reed (2023)***](https://ideas.repec.org/p/cbt/econwp/22-16.html). + +All five find no evidence that psychology/economics are self-correcting. However, there are interesting things to learn in how they approached this question and that is what I want to cover in this blog. + +**The Psychology Studies** + +The three psychology studies rely heavily on replications from the Reproducibility Project: Psychology (Open Science Collaboration, 2015; henceforth RP:P). In particular, they exploit a unique feature of RP:P = RP:P “randomly” selected studies to replicate. + +Specifically, they chose three leading journals in psychology. For each, they opened up to the first issue of 2008 and then selected experiments to replicate that met certain feasibility requirements. They did not choose experiments based on their results. RP:P was only concerned with selecting experiments whose methods could be reproduced with reasonable effort. They continued reading through the journals until they found 100 studies to replicate. + +Because of RP:P’s procedure for selecting studies, one can view the outcome of their replications as random events since the decision to replicate an experiment was independent of expectations about whether the replication would be successful. + +It is this feature that allows the three psychology studies to model the treatments “successful replication” and “unsuccessful replications” as random assignments. All three studies investigate whether “unsuccessful” replications adversely affect the original studies’ citations. I discuss each of them below. + +Serra-Garcia & Gneezy (2021). Serra-Garcia & Gneezy draw replications from three sources, with the primary source being RP:P. The other two studies (“Economics”-Camerer et al., 2016; “Nature/Science”-Camerer et al., 2018) followed similar procedures in selecting experiments to replicate. Their main results are based on 80 replications and are presented in Figure 3 and Table 1 of their paper. + +The vertical lines in the three panels of their Figure 3 indicate the year the respective replication results were published. The height of the lines represents the yearly citations for original studies that were successfully replicated (blue) and unsuccessfully replicated (black). If science were self-replicating, one would hope to see that citations for studies that failed to replicate would take a hit and decrease after the failure to replicate became known. Nothing of that sort is obvious from the graphs. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image.webp) + +To obtain a quantitative estimate of the treatment effect (=”successful replication”), Serra-Garcia & Gneezy estimate the following specification: + +*Yit = β0i + β1Successit + β2AfterReplicationit + β3Success×AfterReplicationit + Year Fixed Effects + Control Variables* + +where: + +Dependent variable = Google Scholar cites per year + +Number of original studies = 80 + +Time period = 2010-2019 + +Estimation Method = Poisson/Random Effects + +Control group = No + +Their Table 1 reports the results of a difference-in-difference (DID) analysis (see below). The treatment variable is “Replicated x After publication of replication”. The estimated effect says that original studies that are successfully replicated receive approximately 1.2 more citations per year than those that are not. However, the effect is not statistically significant. + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-1.webp) + +It is the random assignment of “successful replication” and “unsuccessful replication” that allow Serra-Garcia & Gneezy (2021) to claim they identify a causal effect. The other two psychology studies follow a similar identification strategy. + +Schafmeister (2021). Schafmeister focuses solely on replication studies from RP:P. In particular, he selects 95 experiments that had a single replication and whose original studies produced statistically significant results. + +He then constructs a control group of 329 articles taken from adjacent years (2007,2009) of the same three psychology journals used by RP:P. He uses the same criteria that RP:P used to select their replication studies except that these studies are used as controls. This allows him to define three treatments: “successful replication”, “unsuccessful replication”, and “no replication”. Because he uses the same criteria as RP:P in selecting his control group, he is able to claim that all three treatments are randomly assigned. + +To obtain a quantitative estimate of the two treatment effects (=”successful replication” and “unsuccessful replication), Schafmeister estimates the following DID specification: + +*Yit = β0 + β1Successfulit + β2Failedit+ Study Fixed Effects + Year Fixed Effects + Control Variables* + +where “Successful” and “Failed” are binary variables that indicate that that the replication was successful/failed and that t > 2015, the year the replication result was published; and + +Dependent variable = ln(Web of Science cites per year) + +Number of original studies = 429 (95 RP:P + 329 controls) + +Time period = 2010-2019 + +Estimation Method = OLS/Fixed Effects + +Control group = Yes + +Schafmeister’s Table 2 reports the results (see below). Focusing on the baseline results, studies that successfully replicate receive approximately 9% more citations per year (=0.037+0.051) than studies whose replications failed. Unfortunately, Schafmeister did not test whether this difference was statistically significant. + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-2.webp) + +Von Hippel (2022). Similar to Schafmeister, von Hippel draws his replication entirely from RP:P, albeit with a slightly different sample of 98 studies. His Figure 2 provides a look at his main results. There is some evidence that successful replications gain citations relative to unsuccessful replications. + +[![](/replication-network-blog/image-3.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-3.webp) + +To obtain quantitative estimates of the treatment effect (“unsuccessful replication), von Hippel estimates the following DID specification: + +*Yit = β0 + β1AfterFailureit + Study Fixed Effects + Year Fixed Effects + Control Variables* + +where “AfterFailure” takes the value 1 if the original study failed to replicate and the year > 2015; and + +Dependent variable = ln(Google Scholar cites per year) + +Number of original studies = 95 + +Time period = 2008-2020 + +Estimation Method = Negative Binomial/Fixed Effects + +Control group = No + +He concludes that “replication failure reduced citations of the replicated studies by approximately 9%”, though the effect was not statistically significant. + +In conclusion, all three psychology studies estimate that unsuccessful replication reduce citations, but the estimated effects are insignificant in two of the three studies and unreported in a third. + +**The Economics Studies** + +Once we get outside of psychology, the plot thickens. There is nothing of the scale of the RP:P to allow researchers to assume random assignment with respect to whether a replication is successful. Any investigation of whether economics is self-correcting must work with non-experimental, observational data. Two studies have attempted to do this. + +Ankel-Peters, Fiala, & Neubauer (2023). Ankel-Peters, Fiala, & Neubauer focus on the flagship journal of the American Economic Association. They study all replications published as “Comments” that appeared in the American Economic Review (AER) from 2010-2020. Their AER sample comes with one big advantage and one big disadvantage. + +The advantage lies in the fact that replications that appear in the AER are likely to be seen. A problem with replications that appear in lesser journals is that they may not have the visibility to affect citations. But that isn’t a problem for replications that appear in the AER. If ever one were to hope to see an adverse citation impact from an unsuccessful replication, one would expect to find it in the studies replicated in the AER. + +The big disadvantage is that virtually all of the replications published by the AER are unsuccessful replications. This makes it impossible to compare the citation impact of unsuccessful replications with successful ones. + +A second disadvantage is the relatively small number of studies in their sample. When Ankel-Peters, Fiala, & Neubauer try to examine citations for original studies that have at least 3 years of data before the replication was published and 3 years after, they are left with 38 studies. + +Their main finding is represented by their FIGURE 6 below. + +[![](/replication-network-blog/image-4.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-4.webp) + +Ankel-Peters, Fiala, & Neubauer do not attempt to estimate the causal effect of an unsuccessful replication. Recognizing the difficulty of using their data to do that, they state, “…*we do not strive for making a precise causal statement of how much a comment affects the [original paper’s] citation trend. The qualitative assessment of an absence of a strong effect is sufficient for our case*” (page 15). That leaves us with Coupé & Reed (2023). + +Coupé & Reed (2023). I note that one of the co-authors of this paper is Reed, who is also writing this blog. This raises the question of objectivity. Let the reader beware! + +Unlike Ankel-Peters, Fiala, & Neubauer, Coupé & Reed attempt to produce a causal estimate of the effect of unsuccessful replications. Their approach relies on matching. + +They begin with a set of 204 original studies that were replicated for which they have 3 years of data before the replication was published and 3 years of data after the replication was published. Approximately half of the replicated studies had their results refuted by their replications, with the remaining half receiving either a confirmation or a mixed conclusion. + +They consider estimating the following DID specification: + +*Yit = β0 + β1Negativeit + Study Fixed Effects + Year Fixed Effects + Control Variables* + +Where *Yit* is Scopus citations per year and “Negative” takes the value 1 for an original study that failed to replicate and *t* > the year the replication was published. However, concern about the non-random assignment of treatment and the ability of control variables to adjust for this non-randomness causes them to reject this approach. + +Instead, they pursue a two-stage matching approach. First, they use Scopus’ database and identify potential controls from all studies that were published in the same years as the replicated studies, appeared in the same set of journals that published the replicated studies, and belonged in the same general Scopus subject categories. This produced a pool of 112,000 potential control studies. + +In the second stage, they matched these potential controls with the replicated studies on the basis of their year-by-year citation histories. Their matching strategy is illustrated in FIGURE 2. + +If the original study was published 3 years before the replication study was published, they match on the intervening two years (Panel A). If the original study was published 4 years before the replication study was published, they match on the intervening 3 years (Panel B). And so on. + +[![](/replication-network-blog/image-5.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-5.webp) + +They don’t just match on the total number of citations in the pre-treatment period, but on the year-by-year history. The logic is that non-random assignment is better captured by finding other articles with identical citation histories than by adjusting regressions with control variables. + +This gives them 3 sets of treateds and controls depending on the closeness of the match. For near perfect matches (“PCT=0%”), they have 74 replications and 7,044 controls. For two looser matching criteria (“PCT=10%” and “PCT=20%), they have 103 replications and 7,552 controls; and 142 replications and 11,202 controls, respectively. + +For all original studies with a positive replication in a given year *t*, they define *DiffPit = *Ypit* – Ypbarit*, where **Ypit** is the associated citations for study *i* and *YPbarit* is the average of all the controls matched with study *i*. For all original studies *i* with a negative replication in a given year *t*, they define *DiffNit = YNit – YNbarit*, where *YNit* and *YNbarit* are defined analogously as above. + +Coupé and Reed then pool these two sets of observations to get + +*Diffit = (**Ypit** – Ypbarit)×(1-Nit) + (YNit – YNbarit)×Nit = β0 + β1Negativeit* + +where *Nit* is a binary variable that takes the value 1 if the original study had a negative/failed replication. + +They then estimate separate regressions for each year t = -3, -2, -1, 0, 1, 2, 3, where time is measured from the year the replication study was published. + +β1 then provides an estimate of the difference in the citation effect from a negative replication compared to a positive or mixed replication. Their preferred results are based on quantile regression to address outliers and are reported in their Table 10 (see below). + +[![](/replication-network-blog/image-6.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-6.webp) + +They generally find a small, positive effect associated with negative replications of less than 2 citations/year. In all but one case (PCT=0%, t=2), the estimates are statistically insignificant. In no case do they find a negative and significant effect on citations and thus, they find no evidence of a citation penalty for failed replications. + +Unlike psychology, any estimates of the causal effect of failed replications on citations in economics must deal with the problem of non-random treatment assignment. Reed and Coupé’s identification strategy relies on the fact that *Ypbarit* and *YNbarit* account for any unobserved characteristics associated with positive and negative replications, respectively. Since *Ypbarit* and *YNbarit*  “predicted” the citation behaviour of the original studies before they were replicated, the assumption is that they represent an unbiased estimate of how many citations the respective original studies would have received if they had not been replicated. Under this assumption, β1 provides a causal estimate of the citation effect of a negative replication versus a positive one. + +For psychology, the results are pretty convincing: a failed replication has a relatively small and statistically insignificant impact on a study’s citations. + +In economics, the challenge is to find a way to address the problem that researchers do not randomly choose studies to replicate. Studies by Ankel-Peters, Fiala, & Neubauer (2023) and Reed & Coupé (2023) present two such approaches. Both studies fail to find any evidence of a citation penalty from unsuccessful replications. Whether one finds their results convincing depends on how well one thinks they address the problem of non-random assignment of treatment. + +*Bob Reed is Professor of Economics and the Director of*[***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*at the University of Canterbury. He can be contacted at [bob.reed@canterbury.ac.nz](mailto:bob.reed@canterbury.ac.nz), respectively.* + +**REFERENCES** + +Ankel-Peters, J., Fiala, N., & Neubauer, F. (2023). Is economics self-correcting? Replications in the American Economic Review.  Ruhr Economic Papers, #1005. + +Coupé, T. & Reed, W.R. (2023). Do Replications Play a Self-Correcting Role in Economics? Mimeo, University of Canterbury. + +Schafmeister, F. (2021). The effect of replications on citation patterns: Evidence from a large-scale reproducibility project. *Psychological Science*, 32(10), 1537-1548. + +Serra-Garcia, M., & Gneezy, U. (2021). Nonreplicable publications are cited more than replicable ones. *Science Advances*, 7(21), eabd1705. + +von Hippel, P. T. (2022). Is psychological science self-correcting? Citations before and after successful and failed replications. *Perspectives on Psychological Science*, 17(6), 1556-1565. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/04/05/reed-is-science-self-correcting-evidence-from-5-recent-papers-on-the-effect-of-replications-on-citations/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/04/05/reed-is-science-self-correcting-evidence-from-5-recent-papers-on-the-effect-of-replications-on-citations/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-logchies-calculating-power-after-estimation-no-programming-necessary.md b/content/replication-hub/blog/reed-logchies-calculating-power-after-estimation-no-programming-necessary.md new file mode 100644 index 00000000000..61cc8524c4c --- /dev/null +++ b/content/replication-hub/blog/reed-logchies-calculating-power-after-estimation-no-programming-necessary.md @@ -0,0 +1,110 @@ +--- +title: "REED & LOGCHIES: Calculating Power After Estimation – No Programming Necessary!" +date: 2024-08-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "American Economic Journal: Applied Economics" + - "Chris Doucouliagos" + - "economics" + - "Ioannidis et al. (2017)" + - "Partial Correlation Coefficients (PCCs)" + - "post hoc power analysis" + - "Power Curve" + - "ShinyApp" + - "Tian et al. (2024)" +draft: false +type: blog +--- + +**Introduction.** Your analysis produces a statistically insignificant estimate. Is it because the effect is negligibly different from zero? Or because your research design does not have sufficient power to achieve statistical significance? Alternatively, you read that “The median statistical power [in empirical economics] is 18%, or less” (***[Ioannidis et al., 2017](https://academic.oup.com/ej/article-abstract/127/605/F236/5069452?login=false)***) and you wonder if the article you are reading also has low statistical power. By the end of this blog, you will be able to easily answer both questions. Without doing any programming. + +**An Online App**. In this post, we show how to calculate statistical power post-estimation for those who are not familiar with R. To do that, we have created a Shiny App that does all the necessary calculating for the researcher ([***CLICK HERE***](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/)). We first demonstrate how to use the online app and then showcase its usefulness. + +**How it Works.** The input window requires researchers to enter four numbers: (i) The alpha value corresponding to a two-tailed test of significance; (ii) the degrees of freedom of the regression equation; (iii) the standard error of the estimated effect; and (iv) the effect size for which the researcher wants to know the corresponding statistical power. + +The input window comes pre-filled with four values to guide how researchers should enter their information. (The numbers in the table are taken from an example featured in the previous TRN blog). Once the respective information is entered, one presses the “Submit” button. + +[![](/replication-network-blog/image.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +The app produces two outputs. The first is an estimate of statistical power that corresponds to the “Effect size” entered by the researcher. For example, for the numbers in the input window above, the app reports the following result: “Post hoc power corresponding to an effect size of 4, a standard error of 1.5, and df of 50 = 74.3%” (see below). + +[![](/replication-network-blog/image-1.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +In words, this regression had a 74.3% probability of producing a statistically significant (5%, two-tailed) coefficient estimate if the true effect size was 4. + +The second output is a power curve (see below). + +[![](/replication-network-blog/image-2.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +The power curve illustrates how power changes with effect size. When the effect size is close to zero, it is unlikely that the regression will produce a statistically significant estimate. When the effect size becomes large, the probability increases, eventually asymptoting to 100%. + +The power curve plot also includes two vertical lines: “Effect size” and “80% power”. The former translates the “Calculation Result” from above and places it within the plot area. The latter plots the effect size that corresponds to 80% power as a reference point. + +**Useful when Estimates are Statistically Insignificant.** One application of post-hoc power is that it can help distinguish when statistical insignificance is due to a negligible effect size versus when it is the result of a poorly powered research design. + +[***Tian et al. (2024)***](https://onlinelibrary.wiley.com/doi/full/10.1111/rode.13130), on which this blog is based, give the example of a randomized controlled trial that was designed to have 80% power for an effect size of 0.060, where 0.060 was deemed sufficiently large to represent a meaningful economic effect. Despite estimating an effect of 0.077, the study found that the estimated effect was statistically insignificant (degrees of freedom = 62). In fact, the power of the research design *as it was actually implemented* was only 20.7%. + +We can illustrate this case in our online app. First, we enter the respective information in the input window: + +[![](/replication-network-blog/image-3.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +This produces the following “Calculation Result”, + +[![](/replication-network-blog/image-4.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +and associated Power Curve: + +[![](/replication-network-blog/image-5.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +As it was implemented, the estimated regression model only had a 20.7% probability of producing a statistically significant estimate for an effect size (0.060) that was economically meaningful. Clearly, it would be wrong to interpret statistical insignificance in this case as indicating that the true effect was negligible. + +**How About When it is Difficult to Interpret Effect Sizes?**  The example above illustrates the case where it is straightforward to determine an economically meaningful effect size for calculating power. When it is not possible to do this, the online Post Hoc Power app can still be of use by converting estimates to “partial correlation coefficients” (*PCCs*). + +*PCCs* are commonly used in meta-analyses to convert regression coefficients to a common effect size. All one needs is the estimated *t*-statistic and the regression equation’s degrees of freedom (*df*): + +[![](/replication-network-blog/image-6.webp)](https://replicationnetwork.com/wp-content/uploads/2024/08/image-6.webp) +[![](/replication-network-blog/image-15.webp)](https://replicationnetwork.com/wp-content/uploads/2024/08/image-15.webp) + +The advantage of converting regression coefficients to *PCCs* is that there exist guidelines for interpreting the associated economic sizes of the effects. First, though, we demonstrate how converting the previous example to a *PCC* leads to a very similar result. + +To get the *t*-statistics for the previous example, we divide the estimated effect (0.077) by its standard error (0.051) to obtain *t* = 1.176. Given *df* = 62, we obtain *PCC* = 0.148 and *se(PCC)* = 0.124. We input these parameter values into the input window (see below). + +[![](/replication-network-blog/image-9.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +The output follows below: + +[![](/replication-network-blog/image-10.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +[![](/replication-network-blog/image-11.webp)](https://w87avq-bob-reed.shinyapps.io/post_hoc_power_app/) + +A comparison of FIGURES 3B and 2B and FIGURES 3C and 2C confirms that the conversion to *PCC* has produced very similar power calculations. + +**Calculating Power for the Most Recent issue of the *American Economic Journal: Applied Economics***. For our last demonstration of the usefulness of our online app, we investigate statistical power in the most recent issue of the ***[American Economic Journal: Applied Economics](https://www.aeaweb.org/issues/767)*** (July 2024, Vol. 16, No.3). + +There are a total of 16 articles in that issue. For each article, we selected one estimate that represented the main effect. Five of the articles’ did not provide sufficient information to calculate power for their main effects, usually because they clustered standard errors but did not report the number of clusters. That left 11 articles/estimated effects. + +To determine statistical power, we converted all the estimates to *PCC* values and calculated their associated *se(PCC)* values (see above). We then calculated power for three effect sizes. + +To select the effect sizes, we turned to a very useful paper by Chris Doucouliagos entitled “***[How Large is Large? Preliminary and relative guidelines for interpreting partial correlations in economics](https://www.deakin.edu.au/__data/assets/pdf_file/0003/408576/2011_5.pdf)”***, Doucouliagos collected 22,000 estimated effects from the economics literature and converted them to *PCCs*. He then rank-ordered them from smallest to largest. Reference points for “small”, “medium” and “large” were set at the 25th, 50th, and 75th percentile values. For the full dataset, the corresponding *PCC* values were 0.07, 0.17, and 0.33. + +Our power analysis will calculate statistical power for these three effect sizes. Specifically, we want to know how much statistical power each of the studies in the most recent issue of the *AEJ: Applied Economics* had to produce significant estimates for effect sizes corresponding to “small”, “medium”, and “large”. The results are reported in the table below. + +[![](/replication-network-blog/image-13.webp)](https://replicationnetwork.com/wp-content/uploads/2024/08/image-13.webp) + +We can use this table to answer the question: Are studies in the most recent issue of the  *AEJ: Applied Economics* underpowered? Based on a very limited sample, our answer would be some are, but most are not. The median power of the 11 studies we investigated was 81.8% for a “small” effect. These results differ substantially from what Ioannidis et al. (2017) found. Why the different conclusions? We have some ideas, but they will have to wait for a more comprehensive analysis. However the point of this example was not to challenge Ioannidis et al.’s conclusion. It is merely to show how useful, and easy, calculating post hoc power can be. Everybody should do it! + +*NOTE: Bob Reed is Professor of Economics and the Director of **[UCMeta](https://www.canterbury.ac.nz/research/about-uc-research/research-groups-and-centres/ucmeta)** at the University of Canterbury. He can be reached at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*. Thomas Logchies is a Master of Commerce (Economics) student at the University of Canterbury. He was responsible for creating the Shiny App for this blog. His email address is thomas.logchies@pg.canterbury.ac.nz .* + +**REFERENCE** + +[*Tian, J., Coupé, T., Khatua, S., Reed, W. R., & Wood, B. D. K. (2024). Power to the researchers: Calculating power after estimation. Review of Development Economics, 1–35. https://doi.org/10.1111/rode.13130*](https://onlinelibrary.wiley.com/doi/full/10.1111/rode.13130) + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/08/15/reed-logchies-calculating-power-after-estimation-no-programming-required/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/08/15/reed-logchies-calculating-power-after-estimation-no-programming-required/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-meta-analysis-and-univariate-regression-tests-for-publication-bias-seriously.md b/content/replication-hub/blog/reed-meta-analysis-and-univariate-regression-tests-for-publication-bias-seriously.md new file mode 100644 index 00000000000..1379305e13f --- /dev/null +++ b/content/replication-hub/blog/reed-meta-analysis-and-univariate-regression-tests-for-publication-bias-seriously.md @@ -0,0 +1,140 @@ +--- +title: "REED: Meta-Analysis and Univariate Regression Tests for Publication Bias – Seriously?" +date: 2023-10-10 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Effect beyond bias" + - "Egger's regression Test" + - "FAT-PET-PEESE" + - "Journal of Economic Surveys" + - "Meta-analysis" + - "publication bias" +draft: false +type: blog +--- + +[*This blog first appeared at the MAER-Net Blog under the title “Univariate Regression Tests for Publication Bias: Why Do We Do Them?”, **[see here](https://www.maer-net.org/post/univariate-regression-tests-for-publication-bias-why-do-we-do-them)**]* + +**The FAT-PET Framework**: A standard meta-analysis article goes something like this (see, for example, Knaisch and Pöschel, 2023): + +PART I: Introduction +PART 2: Literature Review +PART 3: Description of Data +PART 4: Testing for Publication Bias +PART 5: Explaining Heterogeneity +PART 6: Best Practice Estimate +PART 7: Conclusion + +(1) Estimated Effect = β0 + β1 SE + ε, + +where SE is the standard error of the estimated effect. β1 is used to test for the existence of publication bias. Statistical significance is interpreted as evidence of publication bias. + +In economics, this is called the Funnel Asymmetry Test, or “FAT”. Elsewhere, it is more widely known as Egger’s regression test. + +β0 provides an estimate of the overall mean effect after adjusting for publication bias. This is commonly referred to as “Effect Beyond Bias” and is nothing more than a prediction of the estimated effect when SE = 0. + +The test of β0=0 is known as the Precision Effect Test (“PET”). (Hence “FAT-PET”.) If β0 is significant in Equation (1), Stanley & Doucouliagos (2012) recommend that SE be replaced by SE^2 and the associated estimate of the constant term be taken as the preferred “Effect Beyond Bias”. This is what turns FAT-PET into FAT-PET-PEESE. + +**The Univariate FAT-PET**. To set the context, suppose a colleague of yours were to ask you to comment on a draft of a paper they had written estimating the effect of education on wages. They have access to a unique dataset with extensive information on worker and job/occupation characteristics. Yet their paper only reports a simple regression of wages on education. + +Surely you would tell your colleague that they will never get their paper published. They need to hold the influence of other variables constant. They need to do a more extensive regression analysis before concluding anything about the returns to education. + +Yet when it comes to testing for publication bias and estimating the overall mean effect, we give primacy to a simple regression of effect size on standard error — a practice we would typically regard as deficient in other applications. + +**Best practice is univariate + multivariate FAT-PET, right?** A standard response to this criticism is that “best practice” says you should never just estimate a univariate regression. Rather, you should also include the SE/SE^2 variable in a regression specification with other variables that are thought to affect estimated effects. The good news is, at least at first glance, this does indeed appear to be what most meta-analyses in economics do. + +I went through the *Journal of Economic Surveys* *(JOES)* and found the 20 most recently published meta-analyses. (The list of articles is given at the bottom of this blog.) + +Of these, 19 do a univariate, FAT-PET-type regression. Of these nineteen, 17 go on to include a standard error variable in a more fully specified meta-regression. So it looks like good practice is mostly being followed in meta-analyses recently published in *JOES*. Interestingly, Aiello and Bonannno (2019) skip the univariate FAT-PET and go directly to a meta-regression with multiple explanatory variables. + +**A problem with the univariate + multivariate FAT-PET approach**. One problem with the practice of doing both a univariate and a multivariate FAT-PET is that the multivariate MRA is rarely (never?) included in the section on publication bias. That is, when there is a separate section on publication bias, only the univariate version of the test is reported and used to draw a conclusion about the existence of publication bias. + +This can be misleading. Especially when the univariate and multivariate regressions lead to different conclusions. This can occur whenever the SE variable is highly correlated with other study characteristics. In my experience, I have found that this is often the case. + +For example, I presented a paper at last year’s MAER-Net on Social Capital and Economic Growth. There were 18 study characteristics in my meta-regression. A regression of SE on the 18 variables produced an R-squared of 53.8%. + +Things were no better when I substituted sample size for SE. The respective R-squared was even higher, at 68.2%. (As an aside, substantial correlation of sample size with study characteristics is a problem when researchers use sample size as an IV for the standard error variable.) We should not be surprised when the multivariate FAT-PET produces a different conclusion than the univariate FAT-PET in these cases. + +Two examples from my sample of 20 are Churchill et al. (2022) and Georgia et al. (2022). The respective FAT coefficients are reported in the table below, with standard errors in parentheses. In both cases, the univariate FAT estimates indicated the existence of publication bias, while the multivariate estimates did not. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2023/10/image.webp) + +In fact, the record regarding good practice is not as good as it seems. It is true that most meta-analyses in my sample estimated a meta-regression including both the SE variable and other study characteristics. However, not all used the multivariate meta-regression to test for publication bias. There are studies in my sample that estimate a univariate FAT-PET, conclude there is evidence of publication bias, later report a multivariate meta-regression with an insignificant SE variable, but never acknowledge this as evidence against publication bias. + +The univariate FAT-PET can be misleading about the existence of publication bias. Why report it at all? Why not do like Aiello and Bonannno (2019) and go straight to the multivariate FAT-PET? + +It seems to me that estimating the influence of publication bias is conceptually no different than estimating the returns to education. In both cases, one needs to control for other factors. + +**Another problem: Effect beyond bias**. If omitted variable bias affects β1 in Equation (1), then it also affects estimates of β0. If the SE coefficient is positively biased, “Effect beyond bias” will be underestimated (assuming an overall positive effect). If the SE coefficient is negatively biased, it will be overestimated. Obtaining an unbiased estimate of the effect of publication bias is essential for estimating “Effect beyond bias”. + +If good practice calls for estimating a multivariate FAT-PET version of Equation (1), good practice should also include a corresponding estimate of the overall mean effect. That is, there should be a multivariate analogue to “Effect beyond bias” that corresponds to the univariate “Effect beyond bias”. + +This is straightforward to do when the multivariate FAT-PET is estimated using OLS. When using a weighted estimator such as FE or RE, there are some nuanced issues, though these are not difficult to address. Yet this is rarely, if ever, done. None of the 20, most recently published meta-analyses in *JOES* calculate a multivariate “Effect beyond bias”. + +To be fair, many meta-analyses report one or more “best practice” estimates. In my sample, 11 of the 20 meta-analyses predict the estimated effect size using “best study” characteristics. For example, best studies might include those based on randomized control studies, or that correct for endogeneity. Typically, they assume that SE = 0; i.e., no publication bias. + +“Best practice” estimates are good. But they are not the same thing as a multivariate analogue to the univariate FAT-PET regression. They predict the estimated effect size for a particular kind of study. They do not provide an estimate of the overall mean effect for all studies. + +In the absence of a multivariate “Effect beyond bias”, there is nothing to balance the “Effect beyond bias” estimates from the univariate regression. In that case, the univariate estimates will be given undue weight. + +In conclusion, I have two recommendations: + +***1) Meta-analysts should always include a multivariate FAT in the section of their paper that is devoted to testing for publication bias.*** + +***2) Meta-analysts should always include a multivariate “Effect beyond bias” alongside the univariate “Effect beyond bias” estimate.*** + +I am keen to hear what other meta-analysts think. + +*NOTE: Bob Reed is Professor of Economics and *the Director of*[***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*at the University of Canterbury.* He can be reached at* [*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*. Special thanks go to Weilun Wu for his research assistance for this project.* + +**REFERENCES** + +Aiello, F., & Bonanno, G. (2019). Explaining differences in efficiency: A meta‐study on local government literature. *Journal of Economic Surveys*, 33(3), 999-1027. + +de Batz, L., & Kočenda, E. (2023). Financial crime and punishment: A meta-analysis. *Journal of Economic Surveys*, + +Brada, J. C., Drabek, Z., & Iwasaki, I. (2021). Does investor protection increase foreign direct investment? A meta‐analysis. *Journal of Economic Surveys*, 35(1), 34-70. + +Chletsos, M., & Sintos, A. (2023). Financial development and income inequality: A meta‐analysis. *Journal of Economic Surveys*, 37(4), 1090-1119. + +Churchill, S., Luong, H. M., & Ugur, M. (2022). Does intellectual property protection deliver economic benefits? A multi‐outcome meta‐regression analysis of the evidence. *Journal of Economic Surveys*, 36(5), 1477-1509. + +Donovan, S., de Graaff, T., de Groot, H. L., & Koopmans, C. C. (2022). Unraveling urban advantages—A meta‐analysis of agglomeration economies. *Journal of Economic Surveys*. + +Ferreira‐Lopes, A., Linhares, P., Martins, L. F., & Sequeira, T. N. (2022). Quantitative easing and economic growth in Japan: A meta‐analysis. *Journal of Economic Surveys*, 36(1), 235-268. + +Filomena, M., & Picchio, M. (2023). Retirement and health outcomes in a meta‐analytical framework. *Journal of Economic Surveys*. 37(4), 1120–1155 + +Giorgio, D. P., European Commission, & IZA. (2022). Studying abroad and earnings: A meta‐analysis. *Journal of Economic Surveys*, 36(4), 1096-1129. + +Gregor, J., Melecký, A., & Melecký, M. (2021). Interest rate pass‐through: A meta‐analysis of the literature. *Journal of Economic Surveys*, 35(1), 141-191. + +Hansen, C., Block, J., & Neuenkirch, M. (2020). Family firm performance over the business cycle: a meta‐analysis. *Journal of Economic Surveys*, 34(3), 476-511. + +Hirsch, S., Petersen, T., Koppenberg, M., & Hartmann, M. (2023). CSR and firm profitability: Evidence from a meta‐regression analysis. *Journal of Economic Surveys*, 37(3), 993-1032. + +Hubler, J., Louargant, C., Laroche, P., & Ory, J. N. (2019). How do rating agencies’decisions impact stock markets? A meta‐analysis. *Journal of Economic Surveys*, 33(4), 1173-1198. + +Knaisch, J., & Pöschel, C. (2023). Wage response to corporate income taxes: A meta-regression analysis. *Journal of Economic Surveys*, 00, 1–25. + +Kočenda, E., & Iwasaki, I. (2022). Bank survival around the World: A meta‐analytic review. *Journal of Economic Surveys*, 36(1), 108-156. + +Malovaná, S., Hodula, M., Bajzík, J., & Gric, Z. (2023). Bank capital, lending, and regulation: A meta-analysis. *Journal of Economic Surveys*, 00, 1–29. + +Polak, P. (2019). the euro’s trade effect: A meta‐analysis. *Journal of Economic Surveys*, 33(1), 101-124. + +Stanley, T. D., Doucouliagos, H., & Steel, P. (2018). Does ICT generate economic growth? A meta‐regression analysis. *Journal of Economic Surveys*, 32(3), 705-726. + +Vooren, M., Haelermans, C., Groot, W., & Maassen van den Brink, H. (2019). The effectiveness of active labor market policies: a meta‐analysis. *Journal of Economic Surveys*, 33(1), 125-149. + +Xue, X., Cheng, M., & Zhang, W. (2021). Does education really improve health? A meta‐analysis. *Journal of Economic Surveys*, 35(1), 71-105. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/10/10/reed-meta-analysis-and-univariate-regression-tests-for-publication-bias-seriously/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/10/10/reed-meta-analysis-and-univariate-regression-tests-for-publication-bias-seriously/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-more-on-self-correcting-science-and-replications-a-critical-review.md b/content/replication-hub/blog/reed-more-on-self-correcting-science-and-replications-a-critical-review.md new file mode 100644 index 00000000000..11d8c91949e --- /dev/null +++ b/content/replication-hub/blog/reed-more-on-self-correcting-science-and-replications-a-critical-review.md @@ -0,0 +1,171 @@ +--- +title: "REED: More on Self-Correcting Science and Replications: A Critical Review" +date: 2023-04-16 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Citations" + - "Difference-in-Differences" + - "economics" + - "Parallel Trends Assumption" + - "Psychology" + - "replications" + - "Self-correcting Science" +draft: false +type: blog +--- + +NOTE: This is a another long blog. Sorry about that! TL;DR: I provide a common framework for evaluating 5 recent papers and critically compare them. All of the papers have shortcomings. I argue that the view that the psychology papers represent a kind of “gold standard” is not justified. There is a lot left to learn on this subject. + +In a [***previous post***](https://replicationnetwork.com/2023/04/05/reed-is-science-self-correcting-evidence-from-5-recent-papers-on-the-effect-of-replications-on-citations/), I summarized 5 recent papers that attempt to estimate the causal effect of a negative replication on the original study’s citations. In this blog, I want to look a little more closely at how each of the papers attempted to do that. + +Because the three psychology papers utilized replications that were “randomly” selected (more on this below), there is the presumption that their estimates are more reliable. I want to challenge that view. In addition, I want to reiterate some concerns that have been raised by others that I think have not been fully appreciated. + +I also think it is insightful to provide a common framework for comparing and assessing the 5 papers. As I am a co-author on one of those papers, I will attempt to avoid letting my bias affect my judgment, but caveat lector! + +**DID and the Importance of the Parallel Trends Assumption** + +The three psychology papers — [***Serra-Garcia & Gneezy (2021)***](https://www.science.org/doi/pdf/10.1126/sciadv.abd1705), [***Schafmeister (2021)***](https://journals.sagepub.com/doi/pdf/10.1177/09567976211005767), and [***von Hippel (2022)***](https://journals.sagepub.com/doi/pdf/10.1177/17456916211072525?casa_token=YUDI8W6J9C4AAAAA:vmCbqg2LzQJoHS6T2ix2H_2I1BX2f11ZmF2s_mVmLy4h_dfE6ugXmGFMv25qn4S4spNxYdsMx-6OXQ) – all employ a Difference-in-Difference (DID) identification strategy that relies on the assumption of parallel trends (PT). (Von Hippel also employs an alternative strategy that does not assume PT, but more on that below.) Before looking at the papers more closely, it is good to refresh ourselves on the importance of the PT assumption in DID estimation. + +FIGURE 1 below shows trends in citations for studies that had failed replications (black line) and studies that had successful replications (blue line). The treatment is revelation of the outcome of the respective replications (failed replication, successful replication), and the start time of the treatment is the date that the replication was published, T\*. + +[![](/replication-network-blog/image-9.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-9.webp) + +A researcher wants to estimate the citation effect of failed versus successful replications. The solid lines represent the observed citation trends of the original studies before and after the treatment. For the counterfactuals, the researcher assumes that the pre-treatment trends would have continued had the studies not been replicated. This is represented by the black and blue dotted lines, respectively. + +In the figure, failed replications result in fewer citations per year, represented by the flatter slope of the black line. The associated treatment effect of failed replications is the difference in slopes between the actual trend line and the counterfactual trend line, given by A, where A < 0. + +Successful replications result in more citations per year, represented by the steeper slope of the blue line. The treatment effect for positive replications is again the difference in slopes, given by B, where B > 0. + +The estimate of the total citation effect of a failed replication versus a successful replication is given by (A-B). + +The importance of the PT assumption is illustrated in FIGURE 2. Here, originals with failed replications have a steeper trend in the pre-treatment period than originals with successful replications. + +[![](/replication-network-blog/image-10.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-10.webp) + +If the researcher were to mistakenly assume that they had the same trend, say use the average of the pre-treatment trends, they would underestimate both A and B, and thus underestimate the effect of a failed replication versus a successful replication. + +This is illustrated below. The red line averages the pre-treatment trends of studies with failed replications and studies with successful replications. + +[![](/replication-network-blog/image-11.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-11.webp) + +When the averaged, common trend is used to establish the respective counterfactuals, both |A| and |B| are underestimated, so that the total citation effect of a failed replication versus a successful replication is underestimated. This is illustrated by the dotted red lines below. + +[![](/replication-network-blog/image-12.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-12.webp) + +**How the PT Assumption is Incorporated in Regression Specifications** + +Now I show how the three studies incorporate the PT assumption in their regressions. + +Serra-Garcia & Gneezy: + +(1) Yit = β0i + β1Successit + β2AfterReplicationit+ β3Success×AfterReplicationit+ **Year Fixed Effects** + Control Variables + +Schafmeister: + +(2) Yit = β0 + β1Successfulit + β2Failedit+ Study Fixed Effects + **Year Fixed Effects** + Control Variables + +Von Hippel: + +(3) Yit = β0 + β1AfterFailureit + Study Fixed Effects + **Year Fixed Effects** + Control Variables + +In the equations above, Yit represents citations of study *i* in year *t*. The estimate of the citation penalty for failed versus successful replications is respectively given by β3 (Equation 1), β2 – β1 (Equation 2), and β1 (Equation 3). + +While the three specifications have some differences, all three equations include a common time trend for both failed and successful replications, represented by “Year Fixed Effects”. This imposes the PT assumption on the estimating equations. + +**What is the Basis for the PT Assumption?** + +My reading of the respective articles is that each of them depends, explicitly or implicitly, on the research design of the Reproducibility Project: Psychology (RPP) and the related Camerer et al. (2016, 2018) studies to support the assumption of PT. + +As discussed in my prior blog, RPP “randomly” selected which experiments to replicate, without regard to whether they thought the replications would be successful. As such, one could argue that there is no reason to expect pre-treatment citation trends to differ, since there was nothing about the original studies that affected the choice to replicate. + +However, random selection of experiments does not mean random assignment of outcomes. As was first pointed out to me by Paul von Hippel, just because the choice of articles to replicate was “random” does not mean that the assignment of treatments (failed/successful replications) to citation trends will be random. There could well be features of studies that affect both their likelihood of being successfully replicated and their likelihood of being cited. + +In fact, this is exactly the main point of Serra-Garcia & Gneezy’s article “Nonreplicable publications are cited more than replicable ones.” They show that nonreplicable papers were cited more frequently EVEN BEFORE it was demonstrated they were nonreplicable. + +Serra-Garcia & Gneezy have an explanation for this: “*Existing evidence … shows that experts predict well which papers will be replicated. Given this prediction, why are nonreplicable papers accepted for publication in the first place? A possible answer is that the review team faces a trade-off. When the results are more “interesting,” they apply lower standards regarding their reproducibility*.” In other words, “interesting-ness” is a confounder for both pre-treatment citations and replicability. + +FIGURE 3 from their paper (reproduced below) supports scepticism about the PT assumption. It shows pre-treatment citation trends for three sets of replicated studies. For two of them “Nature/Science” and “Psychology in rep. markets” (which corresponds to the RPP), the citation trends for original studies with failed replications show substantially higher rates of citation before treatment than those with successful replications. This is a direct violation of the PT assumption. + +[![](/replication-network-blog/image-13.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-13.webp) + +In personal correspondence about my paper with Tom Coupé (discussed below), one researcher wrote me, “*I would be worried about selection bias—which papers were chosen (by others) for replication? [A] ‘trick’ to avoid selection [is] to base [your] study on papers that were replicated systematically (‘ALL experimental papers in journal X for the year Y’)*.” As should be clear from the above, studies that rely on replications from RPP and the Camerer et al. studies are not beyond criticism on this regard. + +**But wait, there’s more!** + +Tom Hardwicke, who has also examined the effect of failed replications on citations (Hardwicke et al., 2021), pointed out to me two other issues. Underlying the citation analyses of the RPP replications is the assumption that once RPP was published, readers were immediately made aware of the non-replicability of the respective studies. + +Not so fast. It is not easy to identify the studies that failed replication in RPP. They are not listed in the paper. And they are not listed in the supplementary documentation. To find them, you have to go back to the original RPP spreadsheet that contains their data. Even assuming one made it that far, identifying the studies that failed replication is not so easy. Don’t take my word for it. Check it out yourself [***here***](https://osf.io/fgjvw). + +Even assuming that one identified which studies failed replication, there is the question of whether the evidence was strong enough to change one’s views about the original study. Etz & Vandekerckhove (2021) concluded that it did not: “*Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak (i.e., Bayes factor < 10). The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication, and no replication attempts provided strong evidence in favor of the null*.” + +**Where does Ankel-Peters, Fiala, and Neubauer fit within the DID framework?** + +Before moving on to strategies that do not rely on the PT assumption, it is helpful to place [***Ankel-Peters, Fiala, and Neubauer (2023)***](https://www.rwi-essen.de/fileadmin/user_upload/RWI/Publikationen/Ruhr_Economic_Papers/REP_23_1005.pdf) in the context of the analysis above. Their main argument is represented by FIGURE 6 from their paper, reproduced below. The black line is the citations trend of original papers whose replications failed. + +[![](/replication-network-blog/image-14.webp)](https://replicationnetwork.com/wp-content/uploads/2023/04/image-14.webp) + +While they do not report this in their paper, my own analysis of AER “Comments” is that the AER rarely, very rarely, publishes successful replications. Given that almost all the replications in their dataset are failed replication, their paper can be understood as estimating the treatment effect exclusively from the solid black line in FIGURES 1 and 2; i.e., no dotted lines, no blue lines. + +**No PT Assumption: Approach #1** + +Of the 5 papers reviewed here, only two provide citation effect estimates without invoking the PT assumption. In addition to the model presented above, von Hippel estimates something he calls the “lagged model”: + +(4)         ln(Y­I,t>2015) = β0 + β1 ln(Y­I,t<2015) + β2 Failurei + β3 ln(Y­I,t<2015) x Failurei + +where Y­I,t>2015 and Y­I,t<2015 are the total citations received by original study *i* in the years before and after the RPP replications were published in 2015. Despite its apparent similarity with a DID, the “treatment variable” in Equation (4) is NOT represented by the interaction term. The treatment effect is given by β2. The interaction term allows the citation trend for original studies with failed replications to have a different “slope” than those with successful replications. + +**No PT Assumption: Approach #2** + +Last but not least (there’s my bias slipping in!) is [***Coupé and Reed (2023)***](https://ideas.repec.org/p/cbt/econwp/22-16.html). As discussed in the previous blog, they use a matching strategy to join the original studies with studies that have not been replicated but have near identical, pre-treatment citation histories. This identification strategy is easily placed within FIGURE 2. + +Consider first original studies with failed replications. Since each of these is matched with control studies with near identical pre-treatment citation histories, one can think of two citation trends that lie on top of each other in the pre-treatment period. Accordingly, the solid black line in FIGURE 2 in the period T < T\* now represents citation trends for both the original studies and their matched controls. Once we enter the post-treatment period, the two citation trends diverge. + +The solid black line in the period T > T\* represents the citation trend for the studies with failed replications after the results of the replication have been published. The dotted black line is the observed citation trend of the matched controls, which serve as the counterfactual for the original studies. The difference in slopes represents the treatment effect of a failed replication. + +The same story applies to the studies with successful replications, which are now represented by the blue line in FIGURE 2. Note that the black and blue lines are allowed to have different, pre-treatment slopes. Thus Coupé & Reed’s matching strategies, like von Hippel’s lagged model, avoids imposing the PT assumption. + +Coupé and Reed’s approach is not entirely free from potential problems. Because the replications were not selected “randomly”, there is concern that their approach may suffer from sample selection. However, the sample selection is not the obvious one of replicators choosing highly-cited papers that they think will fail because that brings them the most attention. That sample selection is addressed by matching on the pre-treatment citation history. + +Rather, the concern is that even after controlling for identical pre-treatment citation histories, there remains some unobserved factor that (i) causes original studies and their matched controls to diverge after a replication has been published and (ii) is spuriously correlated with whether a paper has been successfully replicated. Having acknowledged that possibility, it’s not clear what that unobserved factor could be. + +**Summary** + +There is no silver bullet when it comes to identifying the citation effect of failed replications. Criticisms can be levelled against each of the 5 papers. This brings me to my conclusion about the current literature. + +First, while all the studies have shortcomings, they collectively provide some insight into the relationship between replications and citations. None are perfect, but I don’t think their flaws are so great as to render their analyses useless. As an aside, because of their flaws, I think there is room for more studies like Hardwicke et al. (2021) that take a case study approach. + +Second, while the evidence to date appears to indicate that neither psychology or economics is self-correcting when it comes to failed replications, there is room for more work to be done. This is, after all, an important question. + +Comments welcome! + +**REFERENCES** + +Ankel-Peters, J., Fiala, N., & Neubauer, F. (2023). Is economics self-correcting? Replications in the American Economic Review.  Ruhr Economic Papers, #1005. + +Camerer, C. F., Dreber, A., Forsell, E., Ho, T. H., Huber, J., Johannesson, M., … & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. *Science*, 351(6280), 1433-1436. + +Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T. H., Huber, J., Johannesson, M., … & Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. *Nature Human Behaviour*, 2(9), 637-644. + +Coupé, T. & Reed, W.R. (2023). Do Replications Play a Self-Correcting Role in Economics? Mimeo, University of Canterbury. + +Etz, A., & Vandekerckhove, J. (2016). A Bayesian perspective on the reproducibility project: Psychology. *PloS One*, *11*(2), e0149794. + +Hardwicke, T. E., Szűcs, D., Thibault, R. T., Crüwell, S., van den Akker, O. R., Nuijten, M. B., & Ioannidis, J. P. (2021). Citation patterns following a strongly contradictory replication result: Four case studies from psychology. *Advances in Methods and Practices in Psychological Science*, *4*(3), 1-14. + +Open Science Collaboration (2015). Psychology. Estimating the reproducibility of psychological science. *Science*, 349(6251), aac4716. + +Schafmeister, F. (2021). The effect of replications on citation patterns: Evidence from a large-scale reproducibility project. *Psychological Science,* 32(10), 1537-1548. + +Serra-Garcia, M., & Gneezy, U. (2021). Nonreplicable publications are cited more than replicable ones. *Science Advances*, 7(21), eabd1705. + +von Hippel, P. T. (2022). Is psychological science self-correcting? Citations before and after successful and failed replications. *Perspectives on Psychological Science*, 17(6), 1556-1565. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/04/16/reed-more-on-self-correcting-science-and-replications-a-critical-review/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/04/16/reed-more-on-self-correcting-science-and-replications-a-critical-review/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-p-values-come-let-us-reason-together.md b/content/replication-hub/blog/reed-p-values-come-let-us-reason-together.md new file mode 100644 index 00000000000..20e5a42886f --- /dev/null +++ b/content/replication-hub/blog/reed-p-values-come-let-us-reason-together.md @@ -0,0 +1,73 @@ +--- +title: "REED: P-Values: Come, Let Us Reason Together" +date: 2019-05-14 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "American Statistical Association" + - "Binary thinking" + - "Dichotomous thinking" + - "Journal policies" + - "p-values" + - "Statistical inference" + - "The American Statistician" +draft: false +type: blog +--- + +###### Like many others, I was aware that there was controversy over null-hypothesis statistical testing. Nevertheless, I was shocked to learn that leading figures in the American Statistical Association (ASA) recently called for abolishing the term “statistical significance”. + +###### In an editorial in the ASA’s flagship journal, *The American Statistician*, ***[Ronald Wasserstein, Allen Schirm, and Nicole Lazar write](https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913)***: “*Based on our review of the articles in this special issue and the broader literature, we conclude that it is time to stop using the term ‘statistically significant’ entirely*.” + +###### The ASA advertises itself as “***[the world’s largest community of statisticians](https://www.amstat.org/ASA/about/home.aspx?hkey=6a706b5c-e60b-496b-b0c6-195c953ffdbc)***”. For many who have labored through an introductory statistics course, the heart of statistics consists of testing for statistical significance. The fact that leaders of “the world’s largest community of statisticans” are now calling for abolishing “statistical significance” is jarring. + +###### The fuel for this insurgency is an objection to dichotomous thinking: Categorizing results as either “*worthy*” or “*unworthy*”. P-values are viewed as complicit in this crime against scientific thinking because researchers use them to “*select which findings to discuss in their papers*.” + +###### Against this the authors argue: “*No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being improbable, absent, false, or unimportant. For the integrity of scientific publishing and research dissemination, therefore, whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight.*” + +###### Are *p*-values a gateway drug to dichotomous thinking? While the authors caution about the use of *p*-values, they stop short of calling for their elimination. In contrast, a number of prominent journals now ban their use (see ***[here](https://thenewstatistics.com/itns/2018/02/03/banning-p-values-the-journal-political-analysis-does-it/)*** and ***[here](https://www.sciencenews.org/blog/context/p-value-ban-small-step-journal-giant-leap-science)***). Like many controversies in statistics, the issue revolves around causality. Does the use of *p*-values cause dichotomous thinking, or does dichotomous thinking cause the use of *p*-values? + +###### Like it or not, we live in a dichotomous world. Roads have forks in them. Limited journal space forces researchers to decide which results to report. Limited attention spans force readers to decide which results to focus on. To suggest that eliminating *p*-values will change the dichotomous world we live in is to confuse correlation with causation. The relevant question is whether *p*-values are a suitable statistic for selecting among empirical findings. + +###### Are *p*-values “wrong”? Given all the bad press about *p*-values, one might think that there was something inherently flawed about *p*-values. As in, they mismeasure or misrepresent something. But nobody has ever accused *p*-values of being “wrong”. *P*-values measure exactly what they are supposed to. Assuming that (1) one has correctly modelled the data generating process (DGP) and associated sampling procedure, and (2) the null hypothesis is correct, *p*-values tell one how likely it is to have estimated a parameter value that is as far away, or farther, from the hypothesized value as the one observed. It is a statement about the likelihood of observing particular kinds of data conditional on the validity of given assumptions. That’s what *p*-values do, and to date nobody has accused *p*-values of doing that incorrectly. + +###### Do the assumptions underlying *p*-values render then useless? The use of single-valued hypotheses (such as the null hypothesis) and parametric assumptions about the DGP certainly vitiate the validity and robustness of statistical inference, and *p*-values. However, this can’t be the main reason why *p*-values are objectionable. The major competitors to frequentist statistics, the likelihood paradigm and Bayesian statistics, also rely on single-valued hypotheses and parametric assumptions of the DGP. Further, for those bothered by the parametric assumptions underlying the DGP, non-parametric methods are available. + +###### Do *p*-values answer the right question? Whether *p*-values are useful depends on the question one is trying to answer. Much, if not most, of estimation is concerned with estimating quantities, such as the size of the relationship between variables. In contrast, the most common use of *p*-values is to determine the existence of a relationship. However, the two are not unrelated. In measuring a quantity, it is natural to ask whether the relationship really exists or, alternatively, whether the observed relationship is the result of random chance. + +###### It is precisely on the question of existence where the controversy over *p*-values enters. *P*-values are not well-suited to determine existence. Wasserstein, Schirm, and Lazar state: “*No p-value can reveal the plausibility, presence, truth, or importance of an association or effect.”*  Technically, *p*-values report probabilities about observing certain kinds of data conditional on an underlying hypothesis being correct. They do not report the probability that the underlying hypothesis is correct. + +###### This is true. Kind of. And therein lies the rub. Consider the following thought experiment: Imagine you run two regressions. In the first regression, you regress Y on X1 and test whether the coefficient on X1 equals 0. You get a *p*-value of 0.02. In the second regression, you regress Y on X2 and test whether the coefficient on X2 equals 0. You get a *p*-value of 0.79. Which variable is more likely to have an effect on Y? X1 or X2? + +###### If you respond by saying that you can’t answer that question because “*No p-value can reveal the plausibility, presence, truth, or importance of an association or effect”,* I am going to say that I don’t believe you really believe that. Yes – the *p*-value is a probability about the data, not the hypothesis. Yes – if the coefficient equals zero, you are just as likely to get a *p*-value of 0.02 as 0.79. Yes – the coefficient either equals zero or it doesn’t equal zero, and it almost certainly does not equal exactly zero, so both null hypotheses are wrong. But if you had to make a choice, even knowing all that, I contend that most researchers would choose X1. And not without reason. + +###### In the long run, performing many tests, they are more likely to be correct if they choose the variable with the lower *p*-value. Further, experience tells them that variables with low *p*-values generally have more substantial effects than variables with high *p*-values. So while it is difficult to know exactly *what* *p*-values have to say about the tested hypothesis, they say *something*. In other words, *p*-values contain information about the probability that the tested hypothesis is true. + +###### For purists who can’t bring themselves to admit this, consider some further arguments. In deciding between two competing hypotheses, Bayes factors are commonly used as evidence for/against the null hypothesis versus an alternative. But there is a one-to-one mapping between Bayes factors and *p*-values. Logic dictates that if Bayes factors contain evidentiary information about the null hypothesis, and *p*-values map one-to-one to Bayes factors, then *p*-values must also contain evidentiary information about the null hypothesis. + +###### But wait, there’s more! A recent simulation study published in the journal *Meta-Psychology* used “signal detection theory” to compare a variety of approaches for distinguishing “signals” from “no signals”. It concluded: “…*p*-values were effective, though not perfect, at discriminating between real and null effects” (***[Witt, 2019](https://open.lnu.se/index.php/metapsychology/article/view/871)***). Consistent with that, recent studies on reproducibility have found that a strong predictor of replication success is the *p*-value reported in the original study (***[Center for Open Science, 2015](https://www.researchgate.net/publication/281286234_Estimating_the_Reproducibility_of_Psychological_Science)***; ***[Altmejd et al., 2019](https://osf.io/preprints/metaarxiv/zamry/)***). + +###### Taken together, I believe the arguments above make a compelling case that *p*-values contain information in discriminating between real and spurious empirical findings, and that this information can be useful in selecting variables. + +###### *P*-values are useful, but how useful? Unfortunately, while *p*-values contain information about the existence of observed relationships, the specific content of that information is not well defined. Not only is the information ill-defined, but the measure itself is a noisy one: In a given application, the range of observed *p*-values can be quite large. For example, if the null hypothesis is true, the distribution of *p*-values will be uniform over [0,1]. Thus one is just as likely to obtain a *p*-value of 0.02 as 0.79 if there is no effect. + +###### Further complicating the interpretation of *p*-values is the fact that the computation of a *p*-value assumes a particular DGP, and this DGP is almost certainly never correct. For example, statistical inference typically assumes that the population effect is homogeneous. Specifically, it assumes the effect is the same for all the subjects in the sample, and the same for the subjects in the sample and the associated population. It is highly unlikely that this would ever be correct when human subjects are involved. If the underlying population effects are heterogeneous, ***[p-values will be “too small”, and so will the associated confidence intervals](https://replicationnetwork.com/2019/05/01/your-p-values-are-too-small-and-so-are-your-confidence-intervals/)***. + +###### Conclusion. We live in a dichotomous world. Banning statistical inference will not change that. Limited journal space and limited attention spans mean that researchers will always be making decisions about which results to report, and which results to pay attention to. P-values can help researchers make those decisions. + +###### That being said it, it should always be remembered that *p*-values are noisy indicators and should not be overly relied upon. The evidentiary value of a *p*-value = 0.04 is practically indistinguishable from a *p*-value = 0.06. In contrast, the evidentiary value against the null hypothesis is stronger for a *p*-value = 0.02 compared to a *p*-value = 0.79. How much stronger? That is not clear. Statistics can only take us so far. + +###### *P*-values should be one part, but only one part, of a larger suite of estimates and analyses that researchers use to learn from data. Statisticians and their ilk could do us a real service by providing greater guidance on how best to do that. The discussion about statistical inference and *p*-values would profit by veering more in this direction. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +###### + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/05/14/reed-p-values-come-let-us-reason-together/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/05/14/reed-p-values-come-let-us-reason-together/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-post-hoc-power-analyses-good-for-nothing.md b/content/replication-hub/blog/reed-post-hoc-power-analyses-good-for-nothing.md new file mode 100644 index 00000000000..c57c73e4239 --- /dev/null +++ b/content/replication-hub/blog/reed-post-hoc-power-analyses-good-for-nothing.md @@ -0,0 +1,68 @@ +--- +title: "REED: Post-Hoc Power Analyses: Good for Nothing?" +date: 2017-05-23 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Daniel Lakens" + - "Observed Power" + - "Post-hoc Power" + - "Power" +draft: false +type: blog +--- + +###### *Observed power (or post-hoc power) is the statistical power of the test you have performed, based on the effect size estimate from your data. Statistical power is the probability of finding a statistical difference from 0 in your test (aka a ‘significant effect’), if there is a true difference to be found. Observed power differs from the true power of your test, because the true power depends on the true effect size you are examining. However, the true effect size is typically unknown, and therefore it is tempting to treat post-hoc power as if it is similar to the true power of your study. In this blog, I will explain why you should never calculate the observed power (except for blogs about why you should not use observed power). Observed power is a useless statistical concept.*–Daniël Lakens from ***[his blog](http://daniellakens.blogspot.co.nz/2014/12/observed-power-and-what-to-do-if-your.html)*** “Observed power, and what to do if your editor asks for post-hoc power analyses” at *The 20% Statistician* + +###### Is observed power a useless statistical concept?  Consider two researchers, each interested in estimating the effect of a treatment *T* on an outcome variable *Y*.  Each researcher assembles an independent sample of 100 observations.  Half the observations are randomly assigned the treatment, with the remaining half constituting the control group. The researchers estimate the equation *Y = a + bT + error*. + +###### The first researcher obtains the results: + +###### Equation1 + +###### The estimated treatment effect is relatively small in size, statistically insignificant, and has a p-value of 0.72.  A colleague suggests that perhaps the researcher’s sample size is too small and, sure enough, the researcher calculates a post-hoc power value of 5.3%. + +###### The second researcher estimates the treatment effect for his sample, and obtains the following results: + +![Equation2.jpg](/replication-network-blog/equation2.webp) + +###### The estimated treatment effect is relatively large and statistically significant with a p-value below 1%.  Further, despite having the same number of observations as the first researcher, there is apparently no problem with power here, because the post-hoc power associated with these results is 91.8%. + +###### Would it surprise you to know that both samples were drawn from the same data generating process (DGP): *Y = 1.984**×T  + e,* where *e ~* N(0, 5)?  The associated study has a true power of 50%. + +###### The fact that post-hoc power can differ so substantially from true power is a point that has been previously made by a number of researchers (e.g., Hoenig and Heisey, 2001), and highlighted in Lakens’ excellent blog above. + +###### The figure below presents a histogram of 10,000 simulations of the DGP, *Y = 1.984**×T  + e,* where *e ~* N(0, 5), each with 100 observations, and each calculating post-hoc power following estimation of the equation.  The post-hoc power values are distributed uniformly between 0 and 100%. + +###### Distribution.jpg + +###### So are post-hoc power analyses good for nothing?  That would be the case if a finding that an estimated effect was “underpowered” told us nothing more about its true power than a finding that it had high, post-hoc power.  But that is not the case.  In general, the expected value of a study’s true power will be lower for studies that are calculated to be “underpowered.” + +###### Define “underpowered” as having a post-hoc power less than 80%, with studies having post-hoc power greater than or equal to 80% deemed to be “sufficiently powered.”  The table below reports the results of a simulation exercise where “*Beta*” values are substituted into the DGP,   *Y = Beta* *× T  + e, e ~* N(0, 5), such that true power values range from 10% to 90%.  A 1000 simulations for each *Beta* value were run and the percent of times recorded that the estimated effects were calculated to be “underpowered.” + +![Table](/replication-network-blog/table.webp) + +###### + +###### If studies were uniformly distributed across power categories, the expected power for an estimated treatment effect that was calculated to be “underpowered” would be approximately 43%.  The expected power for an estimated treatment effect that was calculated to be “sufficiently powered” would be approximately 70%.  More generally, E(true power| “underpowered”) ≥ E(true power|“sufficiently powered”). + +###### At the extreme other end, if studies were massed at a given power level, say 30%, then E(true power|“underpowered”) = E(true power|“sufficiently powered”) = 30%, and there would be nothing learned from calculating post-hoc power. + +###### Assuming that studies do not all have the same power, it is safe to conclude that E(true power| “underpowered”) > E(true power|“sufficiently powered”):  Post-hoc “underpowered” studies will generally have lower true power than post-hoc “sufficiently powered” studies.  But that’s it.  Without knowing the distribution of studies across power values, we cannot calculate the expected value of true power from post-hoc power. + +###### In conclusion, it’s probably too harsh to say that post-hoc power analyses are good for nothing.  They’re just not of much practical value, since they cannot be used to calculate the expected value of the true power of a study. + +###### *Bob Reed is Professor of Economics at the University of Canterbury in New Zealand and co-founder of The Replication Network.  He can be contacted at bob.reed@canterbury.ac.nz.* + +###### REFERENCES + +###### Hoenig, John M., & Heisey, Dennis M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. *The American Statistician*, Vol. 55, No. 1, pp. 19-24. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/05/23/reed-post-hoc-power-analyses-good-for-nothing/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/05/23/reed-post-hoc-power-analyses-good-for-nothing/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-replications-in-economics-are-different-from-replications-in-psychology-and-other-thoughts.md b/content/replication-hub/blog/reed-replications-in-economics-are-different-from-replications-in-psychology-and-other-thoughts.md new file mode 100644 index 00000000000..9d3e183e5d5 --- /dev/null +++ b/content/replication-hub/blog/reed-replications-in-economics-are-different-from-replications-in-psychology-and-other-thoughts.md @@ -0,0 +1,88 @@ +--- +title: "REED: Replications in Economics are Different from Replications in Psychology, and Other Thoughts" +date: 2019-03-02 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "Economics E-Journal" + - "Pre-registration" + - "Psychology" + - "Replication success" + - "replications" +draft: false +type: blog +--- + +###### In July 2017, *Economics: The Open Access, Open Assessment E-Journal* issued a call for papers for a special issue on the practice of replication. The call stated, “This special issue is designed to highlight alternative approaches to doing replications, while also identifying core principles to follow when carrying out a replication. Contributors to the special issue will each select an influential economics article that has not previously been replicated, with each contributor selecting a unique article.  Each paper will discuss how they would go about “replicating” their chosen article, and what criteria they would use to determine if the replication study “confirmed” or “disconfirmed” the original study.” + +###### The ***[special issue](http://www.economics-ejournal.org/special-areas/special-issues/the-practice-of-replication)*** was published late last year, with an accompanying ***[“Takeaways” commentary](http://www.economics-ejournal.org/economics/journalarticles/2019-13)*** appearing early this year. A total of eight articles were published in the special issue. The authors and paper titles are identified below. What follows are some thoughts from that exercise. + +###### Reed1(20190301) + +###### **Replications in economics are different from replications in psychology** + +###### The first takeaway from the special issue is that replications in economics are different from replications in psychology. It is common in psychology to categorize replications discretely into two categories: direct and conceptual. A good example of this is provided by the website *Curate Science*, which identifies a continuum of replications running from “direct” to “conceptual”. + +###### Reed2(20190301) + +###### Psychology replications are more easily fitted onto a one-dimensional scale. Replications in psychology generally involve experiments. A typical concern is whether, and how closely, the replication matches the original study’s experimental design and implementation. + +###### In contrast, most empirical economic studies are based on observational, versus experimental, data (experimental/behavioral economics being a notable exception). Problems that consume economic studies, such as endogeneity or non-stationarity, are not major concerns in psychology. This cuts down on the need for a vast arsenal of econometric procedures and reduces the relative importance of alternative statistical methodologies. + +###### Another major difference is that the number of variables and observations that characterize observational studies are large relative to studies that use experimental data. Datasets in economics often have hundreds of potential variables and many thousands of observations. As a result, the garden of forking paths is bigger in economics. With more paths to explore, there is greater value in re-analyzing existing data to check for robustness. + +###### The bottom line is that economic replications are not easily compressed onto a one-dimensional scale. Consider the following two-dimensional taxonomy for replications in economics: + +###### Reed3(20190301) + +###### Here the dimension of measurement and analysis is distinguished from the dimension of target population. While I know of no data to support this next statement, I conjecture that a far greater share of replication studies in economics are concerned with the “vertical dimension” of empirical procedures. + +###### In fact, this is exactly what shows in the eight studies of the special issue. The table below sorts the eight studies across the two dimensions of target population and methodology. Noteworthy is that most of the replications focus on re-analyzing the same data, either using the same or different empirical procedures. Only one study has interest in exploring the “boundaries” that determine the external validity of the original study. + +###### Reed4(20190301) + +###### Unless I am mistaken, this is also another difference with psychology. It seems to me that psychology has a greater interest in understanding effect heterogeneity. For example, an original study reports that men are more upset than women when their partner commits a sexual versus emotional infidelity. The original study found this result for a sample of young people (***[Buss et al., 1999](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1475-6811.1999.tb00215.x)***). A later replication was interested in exploring this result for older people (***[Shackleford et al., 2004](https://dx.doi.org/10.1007/s12110-004-1010-z)***). It is my sense, again stated without supporting evidence, that these kinds of replication studies are more common in psychology than in economics. In my opinion, this is a shortcoming of replications in economics. + +###### Compressing replications into the one-dimensional taxonomy common in psychology loses the distinction between replications focused on measurement and empirical procedures, and replications focused on establishing boundaries for external validity. Blurring this distinction may not be a great loss for psychology, but it is for economics, because it can hide “gaps” in the things that economic replications study (e.g., effect heterogeneity). + +###### **Whatever you call replications, you should call them replications** + +###### As represented in FIGURE 2, the special issue used a taxonomy that identified no less than six types of replications: Reproductions, Repetitions, Extensions, and three types of robustness checks. The number of such taxonomies is large and growing. In addition to Direct versus Indirect Replications, other classifications include (i) ***[Verification, Reproduction, Re-analysis, and Extension](http://ftp.iza.org/dp9000.pdf)***, (ii) ***[Replication, Reproduction and Re-analysis](https://pdfs.semanticscholar.org/db6b/69df320e1dd318f890234ec7f799d8597d74.pdf)***, (iii) ***[Reproducibility, Replicability, and Generalization](https://www.nsf.gov/sbe/SBE_Spring_2015_AC_Meeting_Presentations/Bollen_Report_on_Replicability_SubcommitteeMay_2015.pdf)***, and (iv) ***[Pure replication, Statistical replication, and Scientific replication](http://ftp.iza.org/dp2760.pdf)***. + +###### Does it make a difference? Yes, it makes a huge difference. But not for the reason most people give. Most commentators argue for a particular classification system in order to distinguish different types of replications. Much more important than distinguishing different shades of replications is that the literature be able to distinguish, and identify, replications from other types of empirical studies. + +###### The biggest problem with replications is being able to find them. The confusing tangle of alternative replication vocabularies is not helping. For replications to make a difference, researchers need to know of their existence. They need to be easily identifiable in search algorithms. If a study calls itself a “re-analysis” rather than a replication, a researcher who searches for replications may miss it. Who cares about the fine point of distinguishing one type of replication from another when the replication is never read? + +###### I don’t know which taxonomy is best. But I believe that all taxonomies should have the word “replication” in each of the categories so that they can be easily identified by search algorithms. Thus, I don’t care if somebody wants to use “Pure replication”/“Statistical replication”/ “Scientific replication”, or “Verification replication”/“Reproduction replication”/ “Re-analysis replication”/“Extension replication”, as long as the word “replication” appears in the text, ideally in the abstract. That makes it easy for search algorithms to find the paper, which is crucial if the paper is to be read. + +###### **There is no single standard for replication success** + +###### The eight papers in the special issue offered a variety of criteria for “replication success”. How one defines replication “success” depends on the goal of the replication. If the goal is to double check that the numbers in a published study are correct, then, as McCullough emphasizes, anything less than 100% reproduction is a failure: “For linear procedures with moderately-sized datasets, there should be ten digit agreement, for nonlinear procedures there may be as few as four or five digits of agreement” (McCullough, 2018, page 3). + +###### Things become complicated if, instead, the goal is to determine if the claim from an original study is “true.” This is illustrated by the variety of criteria for replication “success” offered by the studies of the special issue. For Hannum, success depends on the significance of the estimated coefficient of a key variable. Owen suggests a battery of tests based upon significance testing, but acknowledges “fallacies of acceptance and rejection” as challenges to interpreting test results. Coupé proposes counting all the parameters that are reproduced exactly and calculating a percentage correct index, perhaps weighted by the importance of the respective parameters. Daniels & Kakar identify success if the replicated parameters have “the same size and significance for all specifications”, though they do not define what constitutes “the same”. Wood & Vasquez shy away from even using the words “success” or “failure”. Instead, they see the purpose of replication as contributing to a “research dialogue”. They advocate a holistic approach, “looking for similar coefficient sizes, direction of coefficients, and statistical significance”. + +###### The nut of the problem is illustrated by ***[Reed (2018)](https://wol.iza.org/articles/replication-in-labor-economics)*** in the following example: “Suppose a study reports that a 10% increase in unemployment benefits is estimated to increase unemployment duration by 5%, with a 95% confidence interval of [4%, 6%]. Two subsequent replications are undertaken. Replication #1 finds a mean effect of 2% with corresponding confidence interval of [1%, 3%]. Replication #2 estimates a mean effect of 5%, but the effect is insignificant with a corresponding confidence interval of [0%, 10%]. In other words, consistent with the original study, Replication #1 finds that unemployment durations are positively and significantly associated with unemployment insurance benefits. However, the estimated effect falls significantly short of the effect reported by the original study. Replication #2 estimates a mean effect exactly the same as the original, but due to its imprecision, the effect is statistically insignificant. Did either of the two replications “successfully replicate” the original? Did both? Did none?” + +###### This problem is not unique to economics and observational studies. Despite the fact that many experimental studies define success as “a significant effect in the same direction as the original study” (***[Camerer et al., 2018](https://www.nature.com/articles/s41562-018-0399-z)***), there exist many definitions of “replication success” in the experimental literature. ***[Open Science Collaboration (2015)](http://science.sciencemag.org/content/349/6251/aac4716?ijkey=bf1072f0a1a07d1ff2ca2729a1ecf34e96cde311&keytype2=tf_ipsecsha)*** used five definitions of replication success. And *Curate Science* identifies six outcomes for categorizing replication outcomes (see below). + +###### Reed5(20190301) + +###### This has important implications for assessments of the “reproducibility” of science. For example, ***[the recently announced, DARPA-funded, SCORE Project](https://cos.io/about/news/can-machines-determine-credibility-research-claims-center-open-science-joins-new-darpa-program-find-out/)*** (“Systematizing Confidence in Open Research and Evidence”) intends to develop algorithms for assessing approximately 30,000 findings from the social-behavioral sciences. Towards that end, experts will “review and score about 3,000 of those claims in surveys, panels, or prediction markets for their likelihood of being reproducible findings.” The criteria used to define “replication success” will have a huge influence on the results of the project, and the interpretation of those results. + +###### **The value of pre-registration** + +###### Pre-registration has received ***[much attention](https://www.pnas.org/content/pnas/early/2018/03/08/1708274114.full.pdf)*** by the practitioners of open science. There is hope that pre-registration can help solve the “replication crisis.” As part of a series on pre-registration hosted by the Psychonomic Society, ***[Klaus Oberauer](https://featuredcontent.psychonomic.org/preregistration-of-a-forking-path-what-does-it-add-to-the-garden-of-evidence/)*** argues that our efforts should not be focused on pre-registration, but on making data and code available so other researchers can explore alternative forking paths: “If there are multiple equally justifiable analysis paths, we should run all of them, or a representative sample, to see whether our results are robust. … making the raw data publicly available enables other researchers … to run their own analyses … It seems to me that, once publication of the raw data becomes common practice, we have all we need to guard against bias in the choice of analysis paths without giving undue weight to the outcome of one analysis method that a research team happens to preregister.” + +###### I agree with Oberauer that the bigger issue is making data and code available. As is ensuring that there are outlets to publish the results of replications. However, even if data and code are ubiquitous and replications publishable, there will still be value in pre-registering replication studies. In assessing the results of a replication study, there is a difference in how one interprets “I did one thing that I thought was most important and the results did not replicate” and “I did 10 things looking for problems and found one thing that didn’t replicate.” Pre-registration can establish which of these applies. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/03/02/reed-replications-in-economics-are-different-from-replications-in-psychology-and-other-thoughts/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/03/02/reed-replications-in-economics-are-different-from-replications-in-psychology-and-other-thoughts/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-the-devil-the-deep-blue-sea-and-replication.md b/content/replication-hub/blog/reed-the-devil-the-deep-blue-sea-and-replication.md new file mode 100644 index 00000000000..f67419e92c7 --- /dev/null +++ b/content/replication-hub/blog/reed-the-devil-the-deep-blue-sea-and-replication.md @@ -0,0 +1,45 @@ +--- +title: "REED: The Devil, the Deep Blue Sea, and Replication" +date: 2018-12-01 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Danielle Navarro" + - "Model selection" + - "Overfitting" + - "Replication success" + - "Underfitting" +draft: false +type: blog +--- + +###### In a recent article (“***[Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection](https://link.springer.com/article/10.1007/s42113-018-0019-z)***” published in *Computational Brain & Behavior),*Danielle Navarro identifies blurry edges around the subject of model selection. The article is a tour de force in thinking largely about statistical model selection. She writes, + +###### “*What goal does model selection serve when all models are known to be systematically wrong? How might “toy problems” tell a misleading story? How does the scientific goal of explanation align with (or differ from) traditional statistical concerns? I do not offer answers to these questions, but hope to highlight the reasons why psychological researchers cannot avoid asking them.”* + +###### She goes on to say that researchers often see model selection as… + +###### “…*a perilous dilemma in which one is caught between two beasts from classical mythology, the Scylla of overfitting and the Charybdis of underfitting. I find myself often on the horns of a quite different dilemma, namely the tension between the devil of statistical decision making and the deep blue sea of addressing scientific questions. If I have any strong opinion at all on this topic, it is that much of the model selection literature places too much emphasis on the statistical issues of model choice and too little on the scientific questions to which they attach*.” + +###### The article never mentions nor alludes to replication, but it seems to me the issue of model selection is conceptually related to the issue of “replication success” in economics and other social sciences. Numerous attempts have been developed to quantitatively define “replication success” (for a recent effort, ***[see here](https://twitter.com/HeldLeonhard/status/1067339846434332673)***). But just as the issue of model selection demands more than a goodness-of-fit number can supply, so the issue of “replication success” requires more than constructing a confidence interval for “the true effect” or calculating a p-value for some hypothesis about “the true effect”. + +###### For starters, it’s not clear there is a single, “true effect.” Let’s suppose there is. Maybe the original study was content to demonstrate the existence of “an effect.” So replication success should be content with this as well. Alternatively, maybe the goal of the original study was to demonstrate that “the effect” was equal to a specific numerical value.  This is a common situation in economics. For example, in the evaluation of public policies, it not sufficient to show that a policy will have a desirable outcome, but rather that the benefit it produces is greater than the cost. The numbers matter, not just the sign of the effect. Accordingly, the definition of replication success will be different. + +###### This is exactly the conclusion from the ***[recent issue](http://www.economics-ejournal.org/special-areas/special-issues/the-practice-of-replication)*** in *Economics*on *The Practice of Replication* (***[see here for “takeaways” from that issue](https://ideas.repec.org/p/cbt/econwp/18-22.html)***). There is no single measure of replication success because scientific studies do not all have the same purpose. While it may be the case that the purposes of studies can be categorized, and that replication success can be defined within specific categories — it may be the case, though this is yet to be demonstrated — it is certainly the case that there is no single scientific purpose, and thus no single measure of replication success. + +###### Speaking of her own field of human cognition, Navarro writes, + +###### *“To my way of thinking, understanding how the qualitative patterns in the empirical data emerge naturally from a computational model of a psychological process is often more scientifically useful than presenting a quantified measure of its performance, but it is the latter that we focus on in the “model selection” literature. Given how little psychologists understand about the varied ways in which human cognition works, and given the artificiality of most experimental studies, I often wonder what purpose is served by quantifying a model’s ability to make precise predictions about every detail in the data.”* + +###### People are complex and complicated. Aggregating them to markets and economies does not make them easier to understand. Thus the points that Navarro makes apply *a fortiori* to replication and economics. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/12/01/reed-the-devil-the-deep-blue-sea-and-replication/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/12/01/reed-the-devil-the-deep-blue-sea-and-replication/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-the-replication-crisis-a-single-replication-can-make-a-big-difference.md b/content/replication-hub/blog/reed-the-replication-crisis-a-single-replication-can-make-a-big-difference.md new file mode 100644 index 00000000000..48a88ac351a --- /dev/null +++ b/content/replication-hub/blog/reed-the-replication-crisis-a-single-replication-can-make-a-big-difference.md @@ -0,0 +1,84 @@ +--- +title: "REED: The Replication Crisis – A Single Replication Can Make a Big Difference" +date: 2018-01-05 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "false positive rate" + - "replication" + - "Reproducibility crisis" + - "RSS" +draft: false +type: blog +--- + +###### *[This post is based on the paper, **[“A Primer on the ‘Reproducibility Crisis’ and Ways to Fix It”](http://www.econ.canterbury.ac.nz/RePEc/cbt/econwp/1721.pdf)** by the author]* + +###### In a ***[previous post](https://replicationnetwork.com/2017/12/15/reed-why-lowering-alpha-to-0-005-is-unlikely-to-help/)***, I argued that lowering *α* from 0.05 to 0.005, as advocated by ***[Benjamin et al. (2017](https://www.nature.com/articles/s41562-017-0189-z))*** – henceforth B72 for the 72 coauthors on the paper, would do little to improve science’s reproducibility problem. Among other things, B72 argue that reducing *α* to 0.005 would reduce the “false positive rate” (*FPR*). A lower *FPR* would make it more likely that significant estimates in the literature represented real results. This, in turn, should result in a higher rate of reproducibility, directly addressing science’s reproducibility crisis. However, B72’s analysis ignores the role of publication bias; i.e., the preference of journals and researchers to report statistically significant results. As my previous post demonstrated, incorporating reasonable parameters for publication bias nullifies the *FPR* benefits of reducing *α.* + +###### What, then, can be done to improve reproducibility? In this post, I return to B72’s *FPR* framework to demonstrate that replications offer much promise. In fact, a single replication has a sizeable effect on the *FPR* over a wide variety of parameter values. + +###### Let *α* and *β* represent the rates of Type I and Type II error associated with a 5 percent significance level, with *Power* accordingly being given by (1-*β*).  Let *ϕ* be the prior probability that *H0* is true. Consider a large number of “similar” studies, all exploring possible relationships between different *x*’s and *y*’s. Some of these relationships will really exist in the population, and some will not. *ϕ* is the probability that a randomly chosen study estimates a relationship where none really exists. *ϕ* is usefully transformed to *Prior Odds*, defined as Pr(*H1*)/Pr(*H0*) = (1- *ϕ*)/*ϕ*, where *H1* and *H0* correspond to the hypotheses that a real relationship exists and does not exist, respectively. B72 posit the following range of *Prior Odds* values as plausible for real-life research scenarios: (i) 1:40, (ii) 1:10, and (iii) 1:5. + +###### We are now in position to define the *False Positive Rate*. Let *ϕα* be the probability that no relationship exists but Type I error nevertheless produces a significant finding. Let (1-*ϕ*)(1-*β*) be the probability that a relationship exists and the study has sufficient power to identify it. The percent of significant estimates in published studies for which there is no underlying, real relationship is thus given by + +###### (1) *False Positive Rate(FPR) = ϕα / [ϕα +(1-ϕ)(1-β)] .* + +###### Table 1 reports *FPR* values for different *Prior Odds* and *Power* values when *α* = 0.05.  The *FPR* values in the table range from 0.24 to 0.91. For example, given 1:10 odds that a studied effect is real, and assuming studies have *Power* equal to 0.50 – the same *Power* value that ***[Christensen and Miguel (2017)](https://escholarship.org/uc/item/52h6x1cq)*** assume in their analysis – the probability that a statistically significant finding is really a false positive is 50%. Alternatively, if we take a *Power* value of 0.20, which is about equal to the value that ***[Ioannidis et al. (2017)](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12461/full)*** report as the median value for empirical research in economics, the *FPR* rises to 71%. + +###### Table1 + +###### Table 1 illustrates the reproducibility problem highlighted by B72. The combination of (i) many thousands of researchers searching for significant relationships, (ii) relatively small odds that any given study is estimating a relationship that really exists, and (iii) a 5% Type I error rate, results in the published literature reporting a large number of false positives, even without adding in publication bias. In particular, for reasonable parameter values, it is very plausible that over half of all published, statistically significant estimates represent null effects. + +###### I use this framework to show what a difference a single replication can make. The *FPR* values in Table 1 present the updated probabilities (starting from *ϕ*) that an estimated relationship represents a true null effect after an original study is published that reports a significant finding. I call these “Initial FPR” values. Replication allows a further updating, with the new, updated probabilities depending on whether the replication is successful or unsucessful. These new, updated probabilities are given below. + +###### (2a) *Updated FPR(Replication Successful) ) = InitialFPR∙α / [InitialFPR∙α +(1-InitialFPR)∙(1-β)] .* + +###### (2b) *Updated FPR(Replication Unsuccessful) ) = InitiailFPR∙(1-α) / [InitialFPR∙(1-α) +(1-InitialFPR)∙β] .* + +###### Table 2 reports the *Updated FPR* values, depending on whether a replication is successful or unsuccessful, with *Initial FPR* values roughly based on the values in Table 1. Note that *Power* refers to the power of the replication studies. + +###### Table2 + +###### The *Updated FPR* values show what a difference a single replication can make. Suppose that the *Initial FPR* following the publication of a significant finding in the literature is 50%. A replication study is conducted using independent data drawn from the same population. If we assume the replication study has *Power* equal to 0.50, and if the replication fails to reproduce the significant finding of the original study, the *FPR* increases from 50% to 66%. However, if the replication study successfully replicates the original study, the *FPR* falls to 9%. In other words, following the replication, there is now a 91% probability that the finding represent a real effect in the population. + +###### Table 2 demonstrates that replications have a sizeable effect on *FPRs* across a wide range of *Power* and *Initial FPR* values. In some cases, the effect is dramatic. For example, consider the case (*Initial FPR* = 0.80, *Power* = 0.80). In this case, a single, successful replication lowers the false positive rate from 80% to 20%.  As would be expected, the effects are largest for high-powered replication studies. But the effects are sizeable even when replication studies have relatively low power. For example, given (*Initial FPR* = 0.80, *Power* = 0.20), a successful replication lowers the *FPR* from 80% to 50%. + +###### Up to now, we have ignored the role of publication bias. As noted above, publication bias greatly affects the *FPR* analysis of B72.  One might similarly ask how publication bias affects the analysis above. If we assume that publication bias is, in the words of ***[Maniadis et al. (2017)](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12527/full)*** “adversarial” – that is, the journals are more likely to publish a replication study if it can be shown to refute an original study – then it turns out that publication bias has virtually no effect on the values in Table 2. + +###### This is most easily seen if we introduce publication bias to Equation (2a) above. Following ***[Maniadis et al. (2017)](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12527/full)***, let *ω* represent the decreased probability that a replication study reports a significant finding due to adversarial publication bias. Then if the probability of obtaining a significant finding given no real effect is *InitialFPR∙α* in the absence of publication bias, the associated probability with publication bias will be *InitialFPR∙α∙(1-ω)*. Likewise, if the probability of obtaining a significant finding when a real effect exists is *(1-InitialFPR)∙(1-β)* in the absence of publication bias, the associated probability with publication bias will be *(1-InitialFPR)∙(1-β)∙(1-ω)*. It follows that the *Updated FPR* from a successful replication given adversarial publication bias is given by + +###### (3) *Updated FPR(Replication Successful|Adversarial Publication Bias) = FPR∙α∙(1-ω) / [FPR∙α∙(1-ω) +(1-FPR)∙(1-β)∙(1-ω)] .* + +###### Note that the publication bias term in Equation (3), *(1-ω)*, cancels out from the numerator and denominator, so that the *Updated FPR* in the event of a successful replication is unaffected. The calculation for unsuccessful replications is not quite as straightforward, but the result is very similar: the *Updated FPR* is little changed by the introduction of adversarial publication bias. + +###### It needs to be pointed out that the analysis above refers to a special type of replication, one which reproduces the experimental conditions (data preparation, analytical procedures, etc.) of the original study, albeit using independent data drawn from an identical population. In fact, there are many types of replications. Figure 1 (see below) from ***[Reed (2017)](http://www.econ.canterbury.ac.nz/RePEc/cbt/econwp/1721.pdf)*** presents six different types of replications. The analysis above clearly does not apply to some of these. + +###### For example, *Power* is an irrelevant concept in a Type 1 replication study, since this type of replication (“Reproduction”) is nothing more than a checking exercise to ensure that numbers are correctly calculated and reported. The *FPR* calculations above are most appropriate for Type 3 replications, where identical procedures are applied to data drawn from the same population as the original study. The further replications deviate from a Type 3 model, the less applicable are the associated *FPR* values. Even so, the numbers in Table 2 are useful for illustrating the potential for replication to substantially alter the probability that a significant estimate represents a true relationship. + +###### Figure1 + +###### There is much debate about how to improve reproducibility in science. Pre-registration of research, publishing null findings, “badges” for data and code sharing, and results-free review have all received much attention in this debate. All of these deserve support. While replications have also received attention, this has not translated into a dramatic increase in the number of published replication studies (*[see here](https://replicationnetwork.com/replication-studies/)*). The analysis above suggests that maybe, when it comes to replications, we should take a lead from the title of that country-western classic: “***[A Little Less Talk And A Lot More Action](https://www.youtube.com/watch?v=XI7YzUKE_wI)***”. + +###### Of course, all of the above ignores the debate around whether null hypothesis significance testing is an appropriate procedure for determining “replication success.” But that is a topic for another day. + +###### **REFERENCES** + +###### ***[Benjamin, D.J., Berger, J.O., Johannesson, M. Nosek, B.A., Wagenmakers, E.-J., Berk, R., …, Johnson, V.E. (2017). Redefine statistical significance. Nature Human Behaviour, 1(0189).](https://www.nature.com/articles/s41562-017-0189-z)*** + +###### ***[Christensen, G.S. and Miguel, E. (2016). Transparency, reproducibility, and the credibility of economics research. CEGA Working Paper Series No. WPS-065. Center for Effective Global Action. University of California, Berkeley.](https://escholarship.org/uc/item/52h6x1cq)*** + +###### ***[Ioannidis, J.P., Doucouliagos, H. and Stanley, T. (2017). The power of bias in economics. Economic Journal 127(605): F236-65.](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12461/full)*** + +###### ***[Maniadis, Z., Tufano, F., and List, J.A. (2017). To replicate or not to replicate? Exploring reproducibility in economics through the lens of a model and a pilot study. Economic Journal, 127(605): F209-F235.](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12527/full)*** + +###### ***[Reed, W.R. (2017). A primer on the “reproducibility crisis” and ways to fix it. Working Paper No. 21/2017, Department of Economics and Finance, University of Canterbury, New Zealand.](http://www.econ.canterbury.ac.nz/RePEc/cbt/econwp/1721.pdf)*** + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/01/05/reed-a-single-replication-can-make-a-big-difference/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/01/05/reed-a-single-replication-can-make-a-big-difference/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-1.md b/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-1.md new file mode 100644 index 00000000000..7efd69e7308 --- /dev/null +++ b/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-1.md @@ -0,0 +1,64 @@ +--- +title: "REED: The State of Replications in Economics – A 2020 Review (Part 1)" +date: 2021-01-06 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Definition" + - "economics" + - "Economics Journals" + - "Growth" +draft: false +type: blog +--- + +This post is based on a keynote presentation I gave at the Editor’s Meeting of the *International Journal for Re-Views of Empirical Economics* in June 2020. It loosely follows up two previous attempts to summarize the state of replications in economics: (i) An initial paper by Maren Duvendack, Richard Palmer-Jones, and myself entitled “[***Replications in Economics: A Progress Report***](https://econjwatch.org/articles/replications-in-economics-a-progress-report)”, published in *Econ Journal Watch* in 2015; and (ii) a blog I wrote for The Replication Network (*TRN*) entitled “[***An Update on the Progress of Replications in Economics***](https://replicationnetwork.com/2018/10/31/reed-an-update-on-the-progress-of-replications-in-economics/)”, posted in October 2018. + +In this instalment, I address two issues: + +– Are there more replications in economics than there used to be? + +– Which journals publish replications? + +**Are there more replications in economics than there used to be?** + +Before we count replications, we need to know what we are counting. Researchers use different definitions of replications, which produce different numbers. For example, at the time of this writing, [***Replication Wiki***](http://replication.uni-goettingen.de/wiki/index.php/Main_Page) reports 670 replications at their website. In contrast, *TRN*, which relies heavily on Replication Wiki, [***lists 491 replications***](https://replicationnetwork.com/replication-studies/). + +Why the difference? *TRN* employs a narrower definition of a replication. Specifically, it defines a replication as “any study published in a peer-reviewed journal whose main purpose is to determine the validity of one or more empirical results from a previously published study.” + +Replications come in many sizes and shapes. For example, sometimes a researcher will develop a new estimator and want to see how it compares with another estimator. Accordingly, they replicate a previous study using the new estimator. An example is De Chaisemartin & d’Haultfoeuille’s “[***Fuzzy differences-in-differences***](https://academic.oup.com/restud/article/85/2/999/4096388?casa_token=kWBMX6-AkN4AAAAA:fFl5zN5USxeOFdY0pZc2nB2VygICrc24PtD67bwOls2_ifuzYBnwMmPuqg4qZINFnDwF0iOcmFrNbA)” (*Review of Economic Studies*, 2018). D&H develop a DID estimator that accounts for heterogeneous treatment effects when the rate of treatment changes over time. To see the difference it makes, they replicate [***Duflo (2001)***](https://www.aeaweb.org/articles?id=10.1257/aer.91.4.795) which uses a standard DID estimator. + +Replication Wiki counts D&H as a replication. *TRN* does not. The reason *TRN* does not count D&H as a replication is because the main purpose of D&H is not to determine whether Duflo (2001) is correct. The main purpose of D&H is to illustrate the difference their estimator makes. This highlights the grey area that separates replications from other studies. + +Reasonable people can disagree about the “best” definition of replication. I like *TRN’s* definition because it restricts attention to studies whose main goal is to determine “the truth” of a claim by a previous study. Studies that meet this criterion tend to be more intensive in their analysis of the original study and give it a more thorough empirical treatment. A further benefit is that *TRN* has consistently applied the same definition of replication over time, facilitating time series comparisons. + +FIGURE 1 shows the growth in replications in economics over time. The graph is somewhat misleading because 2019 was an exceptional year, driven by special replication issues at the *Journal of Development Studies*, the *Journal of Development Effectiveness*, and, especially, *Energy Economics*. In contrast, 2020 will likely end up having closer to 20 replications. Even ignoring the big blip in 2019, it is clear that there has been a general upwards creep in the number of replications published in economics over time. It is, however, a creep, and not a leap. Given that there are ***[approximately 40,000 articles published annually in Web of Science economics journals](https://www.aeaweb.org/articles?id=10.1257/jel.51.1.144)***, the increase over time does not indicate a major shift in how the economics discipline values replications. + +[![](/replication-network-blog/trn120210106.webp)](https://replicationnetwork.com/wp-content/uploads/2021/01/trn120210106.webp) + +**Which journals publish replications?** + +TABLE 1 reports the top 10 economics journals in terms of total number of replications published in their journal lifetimes. Over the years, a consistent leader in the publishing of replications has been the *Journal of Applied Econometrics*. In second place is the *American Economic Review*. However, an important distinction between these two journals is that *JAE* publishes both positive and negative replications; that is, replications that both confirm and refute the original studies. In contrast, the *AER* only very rarely publishes a positive replication. + +[![](/replication-network-blog/trn220210106.webp)](https://replicationnetwork.com/wp-content/uploads/2021/01/trn220210106.webp) + +There have been several new initiatives by journals to publish replications. Notably, the [***International Journal for Re-Views of Empirical Economics (IREE)***](https://www.iree.eu/) was started in 2017 and is solely dedicated to the publishing of replications. It is an open access journal with no author processing charges (APCs), supported by a consortium of private and public funders. As of January 2021, it had published 10 replication studies. + +To place the numbers in TABLE 1 in context, there are approximately 400 mainline economics journals. About one fourth (96) have ever published a replication. 2 journals account for approximately 25% of all replications that have ever been published. 9 journals account for over half of all replication studies. Only 25 journals (about 6% of all journals) have ever published more than 5 replications in their lifetimes. + +**Conclusion** + +While a little late to the party, economists have recently made noises about the importance of replication in their discipline. Notably, the ***[2017 Papers and Proceedings issue of the American Economic Review](https://www.aeaweb.org/issues/465)*** prominently featured 8 articles addressing various aspects of replications in economics. And indeed, there has been an increase in the number of replications over time. However, the growth in replications is best described as an upwards creep rather than a bold leap. + +Perhaps the reason replications have not really caught on is because fundamental questions about replications have not been addressed. Is there a replication crisis in economics? How should “replication success” be measured? What is the “success rate” of replications in economics? How should the results of replications be interpreted? Do replications have a unique role to play in contributing to our understanding of economic phenomena? I take these up in subsequent instalments of this blog (to read the next instalment, click ***[here](https://replicationnetwork.com/2021/01/07/reed-the-state-of-replications-in-economics-a-2020-review-part-2/)***). + +*Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at [**UCMeta**](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/). He can be contacted at bob.reed@canterbury.ac.nz.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2021/01/06/reed-the-state-of-replications-in-economics-a-2020-review-part-1/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2021/01/06/reed-the-state-of-replications-in-economics-a-2020-review-part-1/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-2.md b/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-2.md new file mode 100644 index 00000000000..917b36f9921 --- /dev/null +++ b/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-2.md @@ -0,0 +1,74 @@ +--- +title: "REED: The State of Replications in Economics – A 2020 Review (Part 2)" +date: 2021-01-07 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "Economics Journals" + - "Machine learning" + - "replication rate" + - "Replication success" + - "Statistical power" +draft: false +type: blog +--- + +This instalment follows on ***[yesterday’s post](https://replicationnetwork.com/2021/01/06/reed-the-state-of-replications-in-economics-a-2020-review-part-1/)*** where I addressed two questions: Are there more replications in economics than there used to be? And, Which journals publish replications? These questions deal with the descriptive aspect of replications. We saw that replications seemingly constitute a relatively small — arguably negligible – component of the empirical output of economists. And while that component appears to be growing, it is growing at a rate that is, for all practical purposes, inconsequential. I would like to move on to more prescriptive/normative subjects. + +Before I can get there, however, I need to acknowledge that the assessment above relies on a very specific definition of a replication, and that the sample of replications on which it is based is primarily drawn from one data source: ***[Replication Wiki](http://replication.uni-goettingen.de/wiki/index.php/Main_Page)***. Is it possible that there are a lot more replications “out there” that are not being counted? More generally, is it even physically possible to know how many replications there are? + +**Is it possible to know how many replications there are?** + +One of the most comprehensive assessments of the number of replications in economics was done in a study by Frank Mueller-Langer, Benedikt Fecher, Dietmar Harhoff, and Gert Wagner, published in *Research Policy* in 2019 and blogged about ***[here](https://replicationnetwork.com/2018/10/19/mueller-langer-et-al-replication-in-economics/)***. ML et al. reviewed all articles published in the top 50 economics journals between 1974 and 2014. They calculated a “replication rate” of 0.1%. That is, 0.1% of all the articles in the top 50 economics journals during this time period were replication studies. + +0.1% is likely an understatement of the overall replication rate in economics, as replications are likely to be underrepresented in the top journals. With 400 mainline economics journals, each publishing an average of approximately 100 articles a year, it is a daunting task to assess the replication rate for the whole discipline. + +One possibility is to scrape the internet for economics articles and use machine learning algorithms to identify replications. In unpublished work, colleagues of mine at the University of Canterbury used “convolutional neural networks” to perform this task. They compared the texts of the ***[replication studies listed at The Replication Network (TRN)](https://replicationnetwork.com/replication-studies/)*** with a random sample of economics articles from ***[RePEc](https://econpapers.repec.org/)***. + +Their final analysis produced a false negative error rate (the rate at which replications are mistakenly classified as non-replications) of 17%. The false positive rate (the rate at which non-replications are mistakenly classified as replications) was 5%. + +To give a better feel for what these numbers means, consider a scenario where the replication rate is 1%. Suppose we have a sample of 10,000 papers, of which 100 are replications. Applying the false negative and positive rates above produces the numbers in TABLE 1. + +[![](/replication-network-blog/trn120210107.webp)](https://replicationnetwork.com/wp-content/uploads/2021/01/trn120210107.webp) + +Given this sample, a researcher would identify 578 replications, of which 83 would be true replications, and 495 would be “false replications”, that is, non-replication studies falsely categorized as replication studies. One would have to get a false positive rate below 1% before even half of the identified “replications” were true replications. Given a relatively low replication rate (here 1%), it is obvious that it is highly unlikely that machine learning will ever be accurate enough to produce reliable estimates of the overall replication rate in the discipline. + +A final alternative is to follow the procedure of ML et al., but choose a set of 50 journals outside the top economics journals. However, as reported in yesterday’s blog, replications tend to be clustered in a relatively small number of journals. Results of replication rates would likely depend greatly on the particular sample of journals that was used. + +Putting the above together, the answer to the question “Is it possible to know how many replications there are” appears to be no. + +I now move on to assessing what we have learned from the replications that have been done to date. Specifically, have replications uncovered a reproducibility problem in economics? + +**Is there a replication crisis in economics?** + +The last decade has seen increasing concern that science has a ***[reproducibility problem](https://en.wikipedia.org/wiki/Replication_crisis)***. So it is fair to ask, is there a replication crisis in economics? Probably the most famous study of replication rates is the study by [***Brian Nosek and the Open Science Collaboration (Science, 2015)***](https://science.sciencemag.org/content/349/6251/aac4716) that assessed the replication rate of 100 experiments in psychology. They reported an overall “successful replication rate” of 39%. Similar studies focused more on economics report higher rates (see TABLE 2). + +[![](/replication-network-blog/trn220210107.webp)](https://replicationnetwork.com/wp-content/uploads/2021/01/trn220210107.webp) + +The next section will delve a little more into the meaning of “replication success”. For now, let’s first ask, what rate of success should we expect to see if science is performing as it is supposed to? In a blog for TRN (“***[The Statistical Fundamentals of (Non-)Replicability](https://replicationnetwork.com/2019/01/15/miller-the-statistical-fundamentals-of-non-replicability/)***”), Jeff Miller considers the case where a replication is defined to be “successful” when it reproduces a statistically significant estimate reported in a previous study (see FIGURE 1 below). + +[![](/replication-network-blog/trn320210107.webp)](https://replicationnetwork.com/wp-content/uploads/2021/01/trn320210107.webp) + +FIGURE 1 assumes 1000 studies each assess a different treatment. 10% of the treatments are effective. 90% have no effect. Statistical significance is set at 5% and all studies have statistical power of 60%. The latter implies that 60 of the 100 studies with effective treatments produce significant estimates.  The Type I error rate implies that 45 of the remaining 900 studies with ineffectual treatments also generate significant estimates. As a result, 105 significant estimates are produced from the initial set of 1000 studies. + +If these 105 studies are replicated, one would expect to see approximately 38 significant estimates, leading to a replication “success rate” of 36% (see bottom right of FIGURE 1). Note that there is no publication bias here. No “file drawer effect”. Even when science works as it is supposed to, we should not expect a replication “success rate” of 100%. “Success rates” far less than 100% are perfectly consistent with well-functioning science. + +**Conclusion** + +Replications come in many sizes, shapes, and flavors. Even if we could agree on a common definition of a replication, it would be very challenging to make discipline-level conclusions about the number of replications that get published. Given the limitations of machine learning algorithms, there is no substitute for personally assessing each article individually. With approximately 400 mainline economics journals, each publishing approximately 100 articles a year, that is a monumental, seemingly insurmountable, challenge. + +Beyond the problem of defining a replication, beyond the problem of defining “replication success”, there is the further problem of interpreting “success rates”. One might think that a 36% replication success rate was an indicator that science was failing miserably. Not necessarily so. + +The final instalment of this series will explore these topics further. The goal is to arrive at an overall assessment of the potential for replications to make a substantial contribution to our understanding of economic phenomena (to read the next instalment, ***[click here](https://replicationnetwork.com/2021/01/08/reed-the-state-of-replications-in-economics-a-2020-review-part-3/)***). + +*Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at [**UCMeta**](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/). He can be contacted at bob.reed@canterbury.ac.nz.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2021/01/07/reed-the-state-of-replications-in-economics-a-2020-review-part-2/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2021/01/07/reed-the-state-of-replications-in-economics-a-2020-review-part-2/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-3.md b/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-3.md new file mode 100644 index 00000000000..df7577ebe46 --- /dev/null +++ b/content/replication-hub/blog/reed-the-state-of-replications-in-economics-a-2020-review-part-3.md @@ -0,0 +1,72 @@ +--- +title: "REED: The State of Replications in Economics – A 2020 Review (Part 3)" +date: 2021-01-08 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "false positive rate" + - "Ioannidis et al." + - "Open Science Collaboration" + - "Replication success" + - "Statistical power" + - "Statistical significance" + - "Type I error rate" +draft: false +type: blog +--- + +This final instalment on the state of replications in economics, 2020 version, continues the discussion of how to define “replication success” (see ***[here](https://replicationnetwork.com/2021/01/06/reed-the-state-of-replications-in-economics-a-2020-review-part-1/)*** and [***here***](https://replicationnetwork.com/2021/01/07/reed-the-state-of-replications-in-economics-a-2020-review-part-2/) for earlier instalments). It then delves further into interpreting the results of a replication. I conclude with an assessment of the potential for replications to contribute to our understanding of economic phenomena. + +**How should one define “replication success”?** + +In their seminal article assessing the rate of replication in psychology, ***[Open Science Collaboration (2015)](https://science.sciencemag.org/content/349/6251/aac4716)*** employed a variety of definitions of replication success. One of their measures has come to dominate all others: obtaining a statistically significant estimate with the same sign as the original study (“SS-SS”). For example, this is the definition of replication success employed by the massive [***SCORE project***](https://www.cos.io/score) currently being undertaken by the Center for Open Science. + +The reason for the “SS-SS” definition of replication success is obvious. It can easily be applied across a wide variety of circumstances, allowing a one-size, fits-all measure of success. It melds two aspects of parameter estimation – effect size and statistical significance – into a binary measure of success. However, studies differ in the nature of their contributions. For some studies, statistical significance may be all that matters, say when establishing the prediction of a given theory. For others, the size of the effect may be what’s important, say when one is concerned about the effect of a tax cut on government revenues. + +The following example illustrates the problem. Suppose a study reports that a 10% increase in unemployment benefits is estimated to increase unemployment duration by 5%, with a 95% confidence interval of [4%, 6%]. Consider two replication studies. Replication #1 estimates a mean effect of 2% with corresponding confidence interval of [1%, 3%]. Replication #2 estimates a mean effect of 5%, but the effect is insignificant with a corresponding confidence interval of [0%, 10%]. + +Did either of the two replications “successfully replicate” the original? Did both? Did none? The answer to this question largely depends on the motivation behind the original analysis. Was the main contribution of the original study to demonstrate that unemployment benefits affect unemployment durations? Or was the motivation primarily budgetary? So that the size of the effect was the important empirical contribution? + +There is no general right or wrong answer to these questions. It is study-specific. Maybe even researcher-specific. For this reason, while I understand the desire to develop one-size-fits-all measures of success, it is not clear how to interpret these “success rates”. This is especially true when one recognizes — and as I discussed in the previous instalment to this blog — that “success rates” below 100%, even well below 100%, are totally compatible with well-functioning science. + +**How should we interpret the results of a replication?** + +The preceding discussion might give the impression that replications are not very useful. While measures of the overall “success rate” of replications may not tell us much, they can be very insightful in individual cases. + +In a blog I wrote for *TRN* entitled “[***The Replication Crisis – A Single Replication Can Make a Big Difference***](https://replicationnetwork.com/2018/01/05/reed-a-single-replication-can-make-a-big-difference/)”, I showed how a single replication can substantially impact one’s assessment of a previously published study. + +Define “Prior Odds” as the Prob(*Treatment is effective*):Prob(*Treatment is ineffective*). Define the “False Positive Rate” (FPR) as the percent of statistically significant estimates in published studies for which the true underlying effect is zero; i.e, the treatment has no effect. If the prior odds of a treatment being effective are relatively low, Type I error will generate a large number of “false” significant estimates that can overwhelm the significant estimates associated with effective treatments, causing the FPR to be high. TABLE 1 below illustrates this. + +[![](/replication-network-blog/trn120210108.webp)](https://replicationnetwork.com/wp-content/uploads/2021/01/trn120210108.webp) + +The FPR values in the table range from 0.24 to 0.91. For example, given 1:10 odds that a randomly chosen treatment is effective, and assuming studies have Power equal to 0.50, the probability that a statistically significant estimate is a false positive is 50%. Alternatively, if we take a Power value of 0.20, which is approximately equal to the value that ***[Ioannidis et al. (2017)](https://onlinelibrary.wiley.com/doi/full/10.1111/ecoj.12461)*** report as the median value for empirical research in economics, the FPR rises to 71%. + +It needs to be emphasized that these high FPRs have nothing to do with publication bias or file drawer effects. They are the natural outcomes of a world of discovery in which Type I error is combined with a situation where most studied phenomena are non-existent or economically negligible. + +TABLE 2 reports what happens when a researcher in this environment replicates a randomly selected significant estimate. The left column reports the researcher’s initial assessment that the finding is a false positive (as per TABLE 1). The table shows how that probability changes as a result of a successful replication. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2021/01/image.webp) + +For example, suppose the researcher thinks there is a 50% chance that a given empirical claim is a false positive (Initial FPR = 50%). The researcher then performs a replication and obtains a significant estimate. If the replication study had 50% Power, the updated FPR would fall from 50% to 9%. + +TABLE 2 demonstrates that successful replications produce substantial decreases in false positive rates across a wide range of initial FPRs and Power values. In other words, while discipline-wide measures of “success rates” may not be very informative, replications can have a powerful impact on the confidence that researchers attach to individual estimates in the literature. + +**Do replications have a unique role to play in contributing to our understanding of economic phenomena?** + +To date, replications have not had much of an effect on how economists do their business. The discipline has made great strides in encouraging transparency by ***[requiring authors to make their data and code available](https://www.aeaweb.org/journals/data/data-code-policy#:~:text=It%20is%20the%20policy%20of,non%2Dexclusive%20to%20the%20authors.)***. However, this greater transparency has not resulted in a meaningful increase in published replications. While there are no doubt many reasons for this, one reason may be that economists do not appreciate the unique role that replications can play in contributing to our understanding of economic phenomena. + +The potential for empirical analysis to inform our understanding of the world is conditioned on the confidence researchers have in the published literature. While economists may differ in their assessment of the severity of false positives, the message of TABLE 2 is that, for virtually all values of FPRs, replications substantially impact that assessment. A successful replication lowers, often dramatically lowers, the probability that a given empirical finding is a false positive. + +It is worth emphasizing that replications are uniquely positioned to make this contribution. New studies fall under the cloud of uncertainty that hangs over all original findings; namely, the rational suspicion that reported results are merely a statistical artefact. Replications, because of their focus on individual findings, are able to break through the fog. It is hoped that economists will start to recognize the unique role that replications can play in the process of scientific discovery. And that publishing opportunities for well-done replications; and appropriate professional rewards for the researchers who do them, follow. + +*Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at* [***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*. He can be contacted at bob.reed@canterbury.ac.nz.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2021/01/08/reed-the-state-of-replications-in-economics-a-2020-review-part-3/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2021/01/08/reed-the-state-of-replications-in-economics-a-2020-review-part-3/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-using-the-r-package-specr-to-do-specification-curve-analysis.md b/content/replication-hub/blog/reed-using-the-r-package-specr-to-do-specification-curve-analysis.md new file mode 100644 index 00000000000..c3e6f999dbe --- /dev/null +++ b/content/replication-hub/blog/reed-using-the-r-package-specr-to-do-specification-curve-analysis.md @@ -0,0 +1,191 @@ +--- +title: "REED: Using the R Package “specr” To Do Specification Curve Analysis" +date: 2024-11-05 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Happiness" + - "Specification curve analysis" + - "specr" + - "Tom Coupé" + - "War" +draft: false +type: blog +--- + +*NOTE: The data (“COUPE.Rdata”) and code (“specr\_code.R”) used for this blog can be found here: * + +**A Tutorial on “specr”** + +In a recent ***[post](https://replicationnetwork.com/2024/05/09/coupe-why-you-should-add-a-specification-curve-analysis-to-your-replications-and-all-your-papers/)***, Tom Coupé encouraged readers to create specification curves to represent the robustness of their results (or lack thereof). He illustrated how this could be done using the R package “**specr**” (Masur & Scharkow. 2020). + +In this blog, I provide instructions that allow one to reproduce Coupé’s results using a modified version of his code. In providing a line-by-line explanation of the code that reproduces Coupe’s results, it is hoped that the reader will have sufficient understanding to enable them to use the “**specr**” package for their own applications. Later, in a follow-up blog, I will do the same using Stata’s “**speccurve**” program. + +**Specification Curve Analysis** + +Specification curve analysis (Simonsohn, Simmons & Nelson, 2020), also known as multiverse analysis (Steegen et al., 2016) is used to investigate the “garden of forking paths” inherent in empirical analysis. As everyone knows, there is rarely a single, best way to estimate the relationship between a dependent variable y and a causal or treatment variable x. Researchers can disagree about the estimation procedure, control variables, samples, and even the specific variables to best represent the treatment and the outcome of interest. + +When the list of equally plausible alternatives (Del Giudice & Gangestad, 2021) is relatively short, it is an easy task to estimate, and report, all alternatives. But suppose the list of equally plausible alternatives is long? What can one do then? + +Specification curve analysis provides a way of estimating all reasonable alternatives and presenting the results in a way that makes it easy to determine the robustness of one’s conclusions. + +**The Coupé Study** + +In Coupé’s post, he studied five published articles, all of which investigated the long-term impact of war on life satisfaction. Despite using  the same dataset (the “Life in Transition Survey” dataset), the five articles came to different conclusions about both the sign and statistical significance of the relationship between war and life satisfaction. + +Complicating their interpretation, the five studies used different estimation methods, different samples, different measures of life satisfaction, and different sets of control variables. Based on these five studies, Coupé identified 320 plausible alternatives. He then produced the following specification curve. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image.webp) + +FIGURE 1 sorts the estimates from lowest to highest. Point estimates are indicated by dots, and around each dot is a 95% confidence interval. Red dots/intervals indicate the point estimate is statistically significant. Grey dots/ intervals indicate statistical insignificance. + +This specification curve vividly demonstrates the range of point estimates and statistical significances that are possible depending on the combination of specification characteristics. One clearly sees three sets of results: negative and significant, negative and insignificant, and positive and insignificant. + +“**specr**” also comes with a feature that allows one to connect results to specific combinations of specification characteristics. + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-1.webp) + +FIGURE 2 identifies the individual specification characteristics: There is one “x” variable (“WW2injury”). There are two “y” variables (“lifesatisfaction15” and “lifesatisfaction110”). There are five general estimation models corresponding to the five studies (“NikolovaSanfey2016”, “Kjewski2020”, “Ivlevs2015”, “Djankoveta2016”, and “ChildsNikolova2020”). There are 8 possible combinations of the variable sets “War Controls”, “Income”, and “Other Controls”. And there are four possible samples. This yields 1x2x5x8x4 = 320 model specifications. + +The figure allows one to visually connect results to characteristics, with red (blue) markers indicating significant (insignificant) estimates, and estimates increasing in size as one moves from left to right in the figure. However, to investigate further, one needs to do a proper multivariate analysis. + +Coupé chose to do this by estimating a linear regression, establishing one specification as the reference category and representing the other model characteristics with dummy variables. The reference case was defined as Model = ChildsNikolova2020, Subset = All, Controls = Income controls, and Dependent Variable  = lifesatisfaction110.[[1]](#_ftn1) + +Because the model covariates are all dummy variables, the estimated coefficients can be directly compared. From TABLE 1 below (which is Table III in Coupé’s paper) we observe that specifications that include “Income Controls” are associated with larger (less negative, more positive) estimates of the effect of war injury on life satisfaction. + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-2.webp) + +**Using “specr”** + +Now that we know what specification curve analysis can do, the next step is to go over the code that produced these results. The following provides line-by-line explanations of the code ([***provided here***](https://osf.io/e8mcf/)) used to reproduce the two figures and the table above. + +First, create a folder and place the dataset COUPE.RData in it. COUPE.RData is a dataset that I created using datasets from Coupé’s github site (***[see here](https://github.com/dataisdifficult/war/blob/main/README.md)****)*. + +Open R and start a new script by setting the working directory equal to that folder. + +[![](/replication-network-blog/image-3.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-3.webp) + +Use the library command to read in all the packages you need to run the program. Install any packages that you do not currently have installed. + +[![](/replication-network-blog/image-4.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-4.webp) + +The following command discourages R from reporting results using scientific notation. + +[![](/replication-network-blog/image-5.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-5.webp) + +Now we read in the dataset, COUPE, an R dataset. + +[![](/replication-network-blog/image-6.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-6.webp) + +**The “setup” command** + +The heart of the “**specr**” package is the “**setup**” command.  It lays out the individual components that will combined to produce a given specification. The syntax for the “**setup**” command is given below ([***documentation here***](https://masurp.github.io/specr/reference/setup.html)): + +[![](/replication-network-blog/image-7.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-7.webp) + +It involves specifying the dataset, the treatment variable(s) (“x”), the dependent variable(s) (“y”), the sets of control variables (“controls”), and the different subsets (“subsets”). The “add\_to\_formula” option identifies a constant set of control variables that are included in every specification. + +You will have noticed that I skipped over “model”. This is where the different estimation procedures are identified. In Coupé’s analysis, he specified five different models, each representing one of the five papers in his study. This is the most complicated part of the “**setup**” command, which is why I saved it for last. + +Coupé named each of his models after the paper they represents. Here is how the model for Kijewski2020 is defined: + +[![](/replication-network-blog/image-8.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-8.webp) + +Kijewski2020 is a customized function that has three components. “formula” is defined by the combination of characteristics that define a particular specification. “data” is the dataset identified in the “**setup**” command, and the third line indicates that Kijewski2020 used a two-level, mixed effects models. To estimate this model, Coupé uses the R function “**lmer**”. The only twist here is the addition of “+(1|country)” that lets “**lmer**” know to include random effects at the country level. + +The other studies are represented similarly. Below is how the model for ChildsNikolova2020 is defined: + +[![](/replication-network-blog/image-9.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-9.webp) + +Rather than estimating a multi-level model, ChildsNikolova2020 estimates a fixed effects, OLS model. The component “|Region1” adds regional dummy variables to the variable specification. The “cluster” and “weights” options compute cluster robust standard errors (CR1-type) and allow for weighting, designed to give each country in the sample an equal weight. + +The other models are defined similarly. + +[![](/replication-network-blog/image-10.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-10.webp) + +Having defined the individual models, we can now assemble the individual specification components in the “**setup**” command. + +[![](/replication-network-blog/image-11.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-11.webp) + +The first several lines should be easy to interpret, having been discussed above. “subset” identifies two subsets: “sample15” = “Heavily affected countries only”; “under65” = “Under sixty-five”. “**specr**” turns this into four samples: (i) “sample15”, (ii) under65”, (iii) “sample15” & “under65”, and (iv) the full sample (“all”). + +“controls” identifies three sets of controls. I have separated the three sets of controls  to make them easier to identify. The first is what Coupé calls “Other controls”. The second is what he calls “War controls”; and the third, “Income Controls”. In combination with the option “simplify = FALSE”, “**specr**” turns this into 8 sets of control variables, which is equal to all possible combinations of the three sets of control variables (none, each individually, each combination between each variable, all variables). If simplify = TRUE”, had been selected, only five sets of control variables would have been included (no covariates, each individually, and all covariates). + +The last element in “**setup**” command is “add\_to\_formula”. This identifies variables that are common to every specification. This option allows one to list these variables once, rather than repeating the list for each set of control variables. + +**FIGURE 1** + +Together, the “**setup**” command creates an object called “**specs**”. In turn, “**specr**” takes this object and creates another object, “**results**”, which is then fed into subsequent “**specr**” functions. + +[![](/replication-network-blog/image-12.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-12.webp) + +And just like that, we can now use the command “**plot**” to produce our specification curve. + +[![](/replication-network-blog/image-13.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-13.webp) + +**FIGURE 2** + +Moving on to FIGURE 2, the next set of six commands renames the respective sets of control variables and sample subsets for more convenient representation in subsequent outputs. As it is not central to running “**specr**”, but is only done to make the output more legible, I will only give a cursory explanation. + +[![](/replication-network-blog/image-14.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-14.webp) + +The “**specr**” command produces an object called “results$data” (see below). Inside it are contained many columns, including “x”, “y”, “model”, “controls”, “subsets”, etc. It has 320 rows, with each row containing details about the respective model specifications. + +[![](/replication-network-blog/image-15.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-15.webp) + +“results$data$controls” reports the variables used for each of the model specifications. Each of the six renaming commands replaces the contents of “controls” with shorter descriptions. + +For example, the command below creates a temporary dataset called “**tom**” which is a copy of “results$data$controls”. + +[![](/replication-network-blog/image-16.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-16.webp) + +It looks for the text in the 10th row of “**tom**”, tom[10]  = “recentwarmoved+recentwarinjury+recentwarHHinjury+ WW2moved”. Every time it sees that text in “**tom**” it replaces it with “War Controls”. It then substitutes “**tom**” back into “results$data$controls”. + +The other five renaming commands all do something similar. + +With the variable names cleaned up for legibility, we’re ready to produce FIGURE 2. + +[![](/replication-network-blog/image-17.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-17.webp) + +**TABLE 1** + +The last task is to produce TABLE 1. The estimates in TABLE 1 come from a standard OLS regression, which we again call “**tom**” (I don’t know where Coupé comes up with these names!). + +[![](/replication-network-blog/image-18.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-18.webp) + +The only thing unusual about this regression is that the dependent variable “estimate” and the explanatory variables (“model”, “subsets”, etc.) come from the dataset “results$data” described above. + +We could have printed out the regression results from “**tom**” using the familiar “**summary**” command, but the output is very messy and hard to read. + +Instead, we take the regression results from “**tom**” and use the “**stargazer**” package to create an attractive table. Note, however, we had to first look at the regression results from “**tom**” to get the correct order for naming the respective factor variables. + +[![](/replication-network-blog/image-19.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-19.webp) + +You can now go to the OSF site given at the top of this blog, download the data and code there, and produce the results in this blog (and Coupé’s paper). That should enable you to do your own specification curve analysis. + +To learn more about “specr”, and see some more examples, [***go here***](https://masurp.github.io/specr/index.html). + +*NOTE: Bob Reed is Professor of Economics and the Director of [**UCMeta**](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/) at the University of Canterbury. He can be reached at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +--- + +[[1]](#_ftnref1) Personally, I would have omitted the constant term and included all the model characteristics as dummy variables. + +**REFERENCES** + +Del Giudice, M., & Gangestad, S. W. (2021). A traveler’s guide to the multiverse: Promises, pitfalls, and a framework for the evaluation of analytic decisions. Advances in Methods and Practices in Psychological Science, 4(1), 1-15. + +Masur, Philipp K., and Michael Scharkow. 2020. “Specr: Conducting and Visualizing Specification Curve Analyses (Version 1.0.1).” [https://CRAN.R-project.org/package=specr](https://cran.r-project.org/package=specr). + +Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2020). Specification curve analysis. *Nature Human Behaviour*, 4(11), 1208-1214. + +Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. *Perspectives on Psychological Science*, 11(5), 702-712. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/11/05/reed-using-the-r-package-specr-to-do-specification-analysis/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/11/05/reed-using-the-r-package-specr-to-do-specification-analysis/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-using-the-stata-package-speccurve-to-do-specification-curve-analysis.md b/content/replication-hub/blog/reed-using-the-stata-package-speccurve-to-do-specification-curve-analysis.md new file mode 100644 index 00000000000..137a93e9520 --- /dev/null +++ b/content/replication-hub/blog/reed-using-the-stata-package-speccurve-to-do-specification-curve-analysis.md @@ -0,0 +1,236 @@ +--- +title: "REED: Using the Stata Package “speccurve” to Do Specification Curve Analysis" +date: 2024-11-16 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Life Satisfaction" + - "Martin Andresen" + - "SCA" + - "speccurve" + - "Specification curve analysis" + - "Stata" + - "War" +draft: false +type: blog +--- + +*NOTE: The data (“COUPE.dta”) and code (“speccurve\_program.do”) used for this blog can be found here: * + +In a previous post (***[see here](https://replicationnetwork.com/2024/11/05/reed-using-the-r-package-specr-to-do-specification-analysis/)***), I provided a step-by-step procedure for using the R package “specr”. The specific application was reproducing the specification curve analysis from a paper by Tom Coupé and coauthors (***[see here](https://dataisdifficult.github.io/PAPERLongTermImpactofWaronLifeSatisfaction.html)***). + +**“speccurve”** + +In this post, I do the same, only this time using the Stata program “**speccurve**” (Andresen, 2020).  The goal is to facilitate Stata users to produce their own specification curve analyses. In what follows, I presume the reader has read my previous post so that I can go straight into the code. + +The first step is to download and install the Stata program “**speccurve**” from GitHub: + +[![](/replication-network-blog/image-20.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-20.webp) + +There isn’t a lot of documentation for “**speccurve**.” The associated GitHub site includes an .ado file that provides examples (“[***speccurve\_gendata.ado***](https://github.com/martin-andresen/speccurve/blob/master/speccurve_gendata.ado)”). Another example is provided by [***t***](https://hbs-rcs.github.io/post/specification-curve-analysis/)***[his independent site](https://hbs-rcs.github.io/post/specification-curve-analysis/)*** (though ignore the bit about factor variables, as that is now outdated). + +“**speccurve**” does not do everything that “**specr**” does, but it will allow us to produce versions of FIGURE 1 and TABLE 1. (Also, FIGURE 2, though for large numbers of specifications, the output isn’t useful.) + +**What “speccurve” Can Do** + +Here is the “**speccurve**” version of FIGURE 1 from my previous post: + +[![](/replication-network-blog/image-21.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-21.webp) + +And here are the results for TABLE 1: + +[![](/replication-network-blog/image-22.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-22.webp) + +How about FIGURE2? Ummm, not so helpful (see below). I will explain later. + +[![](/replication-network-blog/image-23.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-23.webp) + +Unlike “**specr**”, “**speccurve**” does not automatically create combinations of model features for you. You have to create your own. “**speccurve**” is primarily a function for plotting a specification curve after you have estimated your selected specifications. The latter is typically done by creating a series of for loops. + +**Downloading Data and Code, Setting the Working Directory, Reading in the Data** + +Once you have installed “**speccurve**”, the next step is to create a folder, download the the dataset “COUPE.dta” and code “speccurve\_program.do” from the OSF website (***[here](https://osf.io/4yrxs/)***), and store them in the folder.  COUPE.dta is a Stata version of the COUPE.RData dataset described in the previous post. + +The following section provides an explanation for each line of code in “speccurve\_program.do”. + +The command line below sets the working directory equal to the folder where the data are stored. + +[![](/replication-network-blog/image-24.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-24.webp) + +The next step clears the working memory. + +[![](/replication-network-blog/image-25.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-25.webp) + +“**speccurve**” stores results in a “.ster” file. If you have previously run the program and created a “.ster” file, it is good practice to remove it before you run the program again. In this example, I have called the .ster file “WarSatisfaction”. To remove this file, delete the comment markers from the second line of code below and run. + +[![](/replication-network-blog/image-26.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-26.webp) + +Now import the Stata dataset “COUPE”. + +[![](/replication-network-blog/image-27.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-27.webp) + +**Making Variables for the For Loops** + +Recall from the previous post that Coupé created 320 specifications by combining the following model characteristics: (i) two dependent variables, (ii) four samples, (iii) eight variables, and (iv) five models. Many of these had cumbersome names. Since I will reproduce Coupé’s specifications using for loops, I want to simplify the respective names. + +For example, rather than referring to “lifesatisfaction15” and “lifesatisfaction110”, I create duplicate variables and name them “y1” and “y2”. Likewise, I create and name four sample variables, “sample1” to “sample4”. + +[![](/replication-network-blog/image-28.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-28.webp) + +Having done this for “y” and “sample”, I now want to do the same thing for the different variable specifications. There were eight variable combinations. I create a “local macro” for each one, associating the names “var1” to “var8” with specific sets of variables. + +For example, “var1” refers to the set of variables that Coupé collectively referred to “No Additional Controls”. + +[![](/replication-network-blog/image-29.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-29.webp) + +“var2” refers to the set of variables Coupé called “Other Controls”. + +[![](/replication-network-blog/image-30.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-30.webp) + +In this way, all eight of the variable sets receive “shorthand” names “var1” to “var8”. + +[![](/replication-network-blog/image-31.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-31.webp) + +**A For Loop for All the Djankovetal2016 Specifications** + +We now get to the heart of the program. I create another local macro called “no” that will assign a number (= “no”) to each of the models I estimate (model1-model320). I initialize “no” at “1”. + +[![](/replication-network-blog/image-32.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-32.webp) + +I write separate for loops for each of the five models, starting with Djankovetal2016. + +[![](/replication-network-blog/image-33.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-33.webp) + +I set up three levels of for loops. Starting from the outside (top), there are four types of samples (sample1-sample4), two types of dependent variables (y1-y2) and eight variable sets (var1-var8). + +[![](/replication-network-blog/image-34.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-34.webp) + +Djankovetal2016 estimates a weighted, linear regression model without fixed effects so I reproduce that estimation approach here.  Note how “i”, “j”, and “k” allow different combinations of variables, dependent variables, and samples, respectively. + +[![](/replication-network-blog/image-35.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-35.webp) + +The next lines of commands identify which model we are estimating. The “estadd” command creates a variable that assigns either a “1” or a “0” depending on the model. Since the first model is Djankovetal2016, Model1 = 1 and all the other Model variables  (Model2-Model5) are set = 0. ”estadd” saves the model variables so that we can identify the estimation model that each specification uses. + +[![](/replication-network-blog/image-36.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-36.webp) + +I next do the same thing for variable sets. Recall that the “i” indicator loops through eight variable specifications. When “i” = 1, the variable “Vars1” takes the value 1. When “i” = 2, the variable “Vars2” takes the value 1. And so on. + +[![](/replication-network-blog/image-37.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-37.webp) + +Having set the “right” Vars variable to “1”, I then set all the other Vars variables to “0”. That’s what the next set of commands do. + +[![](/replication-network-blog/image-38.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-38.webp) + +If “i” is not equal to 1, then Vars1 = 0. If “i” is not equal to 2, then Vars2 = 0. At the end of this section, all the Vars variables have been set = 0 except for the one variable combination that is being used in that regression. + +I next do the exact same thing for the dependent variable, Y. “j” takes the values “1” or “2” depending on which dependent variable is being used. When “j” = 1, Y1 = 1. When “j” = 2, Y2 = 1. The remaining lines set Y2 = 0 or Y1 = 0 depending if “j” = 1 or 2, respectively. + +[![](/replication-network-blog/image-39.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-39.webp) + +And then the same thing for the sample variables. + +[![](/replication-network-blog/image-40.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-40.webp) + +The next two lines save the results in a (“.ster”) file called “WarSatisfaction” and assigns each set of results a unique model number (“model1”-“model320”). + +[![](/replication-network-blog/image-41.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-41.webp) + +The last few lines of this for loop increases the “no” count by 1, and then the three right hand brackets finish off the three levels of the for loop. + +[![](/replication-network-blog/image-42.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-42.webp) + +**For Loops for the Other 4 Estimation Models** + +Having estimated all the specifications associated with the Djankovetal2016 model, I proceed similarly for the other four estimation models. + +[![](/replication-network-blog/image-43.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-43.webp) +[![](/replication-network-blog/image-44.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-44.webp) +[![](/replication-network-blog/image-45.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-45.webp) +[![](/replication-network-blog/image-46.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-46.webp) + +**Producing the Specification Curve** + +After running all 320 model specifications and storing them in the file “WarSatisfaction”, the “**speccurve**” command below produces the sought-after specification curve. + +[![](/replication-network-blog/image-47.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-47.webp) + +The option “param” lets “**speccurve**” know which coefficient estimate it should use. “title” is the name given to the figure produced by “speccurve”. “panel” produces the box below the figure (see below), the analogue of FIGURE 2 in the “specr” post. + +[![](/replication-network-blog/image-48.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-48.webp) + +Ideally, the panel sits below the specification curve and identifies the specific model characteristics that are associated with each of the 320 estimates. However, with so many specifications, it is squished down to an illegible box . If the reader finds this annoying, they can safely omit the “panel’ option. + +**Producing the Regression Results for TABLE 1** + +To produce TABLE 1, we need to access a table that is automatically produced by “**speccurve**” and stored in a matrix called “r(table)”. The following command prints out “r(table)” for inspection. + +[![](/replication-network-blog/image-49.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-49.webp) + +“r(table)” stores variables for each of the 320 model specifications estimated by “**speccurve**”. The first two stored variables are “modelno” and “specno”. We saw how “modelno” was created in the for loops above. “specno” simply sorts the models so that the model with the lowest estimate is “spec1” and the model with the largest is “spec320). + +Then comes “estimate”, “min95”, “max95”, “min90”, and “max90”. These are the estimated coefficients for the war “treatment variable”, and the respective 95- and 90-confidence interval limits. + +After these is a series of binary variables (“Model1”-“Model5”, “Vars1”-“Vars8”, “Y1”-“Y2”, and “Sample1”-“Sample4”) that identify the specific characteristics associated with each estimated model. + +With the goal of producing TABLE 1, I next turn this matrix into a Stata dataset. The following command takes the variables stored in the matrix “r(table)” and adds them to the existing COUPE.dta dataset, giving them the same names they have in “r(table)”. + +[![](/replication-network-blog/image-50.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-50.webp) + +With these additions, the COUPE.dta dataset now consists of 38,843 observations and 90 variables, 26 more variables than before because of the addition of the “r(table)” variables. + +I want to isolate these “r(table)” variables in order to estimate the regression of TABLE 1. To do that, I run the following lines of code: + +[![](/replication-network-blog/image-51.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-51.webp) + +Now the only observations and variables in Stata’s memory are the data from “r(table)”. I could run the regression now, regressing “estimate” on the respective model characteristics. However, matching variable names like “Model 1” and “Vars7” to the variable names that Coupé uses in his TABLE 1 is inconvenient and potentially confusing. + +Instead, I want to produce a regression table that looks just like Coupé’s. To do that, I install a Stata package called “estout” (see below). + +[![](/replication-network-blog/image-52.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-52.webp) + +“estout” provides the option of reporting labels rather than variable names. To take advantage of that, I next assign a label to each of the model characteristic variable names. + +[![](/replication-network-blog/image-53.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-53.webp) + +Now I am finally in a position to estimate and report TABLE 1. + +First, I regress the variable “estimate” on the respective model characteristics. To match Coupé’s table, I set the reference category equal to Model5=1, Sample4 = 1, Vars8=1, and Y2=1 and omit these from the regression. + +[![](/replication-network-blog/image-54.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-54.webp) + +Next I call up the command “**esttab**”, part of “**estout**”, and take advantage of the “label” option that replaces variable names with their labels. + +[![](/replication-network-blog/image-55.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-55.webp) + + “varwidth” formats the output table to look like Coupé’s TABLE 1. Likewise, “order” puts the variables in the same order as Coupé’s table. “se” has the table report standard errors rather than t-statistics, and b(%9.3f) reports coefficient estimates and standard errors to three decimal places, again to match Coupé’s table. + +And voilá, we’re done! You can check that the resulting TABLE 1 exactly matches Coupé’s table by comparing it with TABLE 1 from the previous “**specr**” post. + +**Saving the Results** + +As a final action, you may want to save the table as a WORD doc for later editing. This can be done by inserting “using TABLE1.doc” (or whatever you want to call it) after “**esttab**” but before the comma in the command line above. + +One could also save the “**speccurve**” dataset for later analysis using the “**save**” command, as below, where I have called it “speccurve\_output”. This is saved as a standard .dta Stata dataset. + +[![](/replication-network-blog/image-56.webp)](https://replicationnetwork.com/wp-content/uploads/2024/11/image-56.webp) + +**Conclusion** + +You are now all set to use Stata to reproduce the results from Coupé’s paper. Go to the OSF site given at the top of this blog, download the data and code, and run the code to produce the results in this blog. It took about 7 minutes to run on my laptop. + +Of course the goal is not to just reproduce Coupé’s results, but rather to prepare you to do your own specification curve analyses. As noted above, for more examples, go  to the “**speccurve**” GitHub site and check out “***[speccurve\_gendata.ado](https://github.com/martin-andresen/speccurve/blob/master/speccurve_gendata.ado)***”. Good luck! + +*NOTE: Bob Reed is Professor of Economics and the Director of*[***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*at the University of Canterbury. He can be reached at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* *Special thanks go to Martin Andresen for his patient assistance in answering Bob’s many questions about “speccurve”.* + +**REFERENCES** + +Andresen, M.  (2020). [martin-andresen](https://github.com/martin-andresen)/[speccurve](https://github.com/martin-andresen/speccurve) [Software]. GitHub. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/11/16/reed-using-the-stata-package-speccurve-to-do-specification-curve-analysis/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/11/16/reed-using-the-stata-package-speccurve-to-do-specification-curve-analysis/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-why-lowering-alpha-to-0-005-is-unlikely-to-help.md b/content/replication-hub/blog/reed-why-lowering-alpha-to-0-005-is-unlikely-to-help.md new file mode 100644 index 00000000000..ab6dd9f6ef8 --- /dev/null +++ b/content/replication-hub/blog/reed-why-lowering-alpha-to-0-005-is-unlikely-to-help.md @@ -0,0 +1,93 @@ +--- +title: "REED: Why Lowering Alpha to 0.005 is Unlikely to Help" +date: 2017-12-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "alpha" + - "false positive rate" + - "Ioannidis" + - "null hypothesis significance testing" + - "publication bias" + - "Reproducibility crisis" + - "significance testing" +draft: false +type: blog +--- + +###### *[This blog is based on the paper, **[“A Primer on the ‘Reproducibility Crisis’ and Ways to Fix It”](http://www.econ.canterbury.ac.nz/RePEc/cbt/econwp/1721.pdf)** by the author]* + +###### A standard research scenario is the following: A researcher is interested in knowing whether there is a relationship between two variables, *x* and *y*. She estimates the model *y* = *μ**0* + *μ**1 x*+ *ε*, *ε* *~ N(0,**σ2)*.  She then tests H0: *μ**1* = 0 and concludes that a relationship exists if the associated *p*-value is less than 0.05. + +###### Recently, a large number of prominent researchers have called for journals to lower the threshold level of statistical significance from 0.05 to 0.005 (***[Benjamin et al., 2017](https://www.nature.com/articles/s41562-017-0189-z)***; henceforth B72 – for its 72 authors!). They give two main arguments for doing so. First, an *α* value of 0.005 corresponds to Bayes Factor values that they judge to be more appropriate. Second, it would reduce the occurrence of false positives, making it more likely that significant estimates in the literature represent real results. Here is the argument in their own words: + +###### “The choice of any particular threshold is arbitrary and involves a trade-off between Type I and II errors. We propose 0.005 for two reasons. First, a two-sided P-value of 0.005 corresponds to Bayes factors between approximately 14 and 26 in favor of H1. This range represents “substantial” to “strong” evidence according to conventional Bayes factor classifications. Second, in many fields the 𝑃 < 0.005 standard would reduce the false positive rate to levels we judge to be reasonable” (B72, page 8). + +###### However, the model that these authors employ ignores two factors which mitigate against the positive consequences of lowering *α*. First, it ignores the role of publication bias. Second, lowering *α* would also lower statistical power. So while lowering *α* would reduce the rate of false positives, it would also reduce the capability to identify real relationships. + +###### In the following numerical analysis, I show that once one accommodates these factors, the benefits of lowering *α* disappear, so that the world of academic publishing when *α* = 0.005 looks virtually identical to the world of *α* = 0.05, at least with respect to the signal value of statistically significant estimates. + +###### B72 demonstrate the benefit of lowering the level of significance as follows: Let *α* be the level of significance and *β* the rate of Type II error, so that *Power* is given by (1-*β*).  Define a third parameter, *ϕ*, as the prior probability that *H0* is true. + +###### In any given study, *ϕ* is either 1 or 0; i.e., a relationship exists or it doesn’t. But consider a large number of “similar” studies, all exploring possible relationships between different *x*’s and *y*’s. Some of these relationships will really exist in the population, and some will not. *ϕ* is the probability that a randomly chosen study estimates a relationship where none really exists. + +###### B72 use these building blocks to develop two useful constructs. First is *Prior Odds*, defined as Pr(*H1*)/Pr(*H0*) = (1- *ϕ*)/*ϕ.* They posit the following range of values as plausible for real-life research scenarios: (i) 1:40, (ii) 1:10, and (iii) 1:5. + +###### Second is the *False Positive Rate*. Let *ϕα* be the probability that no relationship exists but Type I error produces a significant finding. Let (1-*ϕ*)(1-*β*) be the probability that a relationship exists and the study has sufficient power to identify it. The percent of significant estimates in published studies for which there is no underlying, real relationship is thus given by + +###### (1) *False Positive Rate(FP**R) = ϕα / [**ϕα+**(1-ϕ)(1-β)] .* + +###### Table 1 reports *False Positive Rates* for different *Prior Odds* and *Power* values when *α* = 0.05. Taking a *Prior Odds* value of 1:10 as representative, they show that *FPR*s are distressing large over a wide range of *Power* values. For example, given a *Power* value of 0.50 — the same value that ***[Christensen and Miguel (2017)](https://escholarship.org/uc/item/52h6x1cq)*** use in their calculations — there is only a 50% chance that a statistically significant, published estimate represents something real. With smaller *Power* values — such as those estimated by **[*Ioannidis et al. (2017)*](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12461/full)** — the probability that a significant estimate is a false positive is actually greater than the probability that it represents something real. + +###### Table1 + +###### Table 2 shows that lowering *α* to 0.005 substantially improves this state of affairs. *False Positive Rates* are everywhere much lower. For example, when *Prior Odds* is 1:10 and *Power* is 0.50, the *FPR* falls to 9%, compared to 50% when *α*= 0.05. Hence their advocacy for a lower *α* value. + +###### Table2 + +###### Missing from the above analysis is any mention of publication bias. Publication bias is the well-known tendency of journals to favor significant findings over insignificant findings. This also has spillovers on the behavior of researchers, who may engage in p-hacking and other suspect practices in order to obtain significant results. Though measuring the prevalence of publication bias is challenging, a recent study estimates that significant findings are 30 times more likely to be published than insignificant findings (***[Andrews and Kasy, 2017](http://www.nber.org/papers/w23298)***). As a result, insignificant findings will be underrepresented, and significant findings, overrepresented, in the published literature. + +###### Following ***[Ioannidis (2005)](http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124)*** and others, I introduce a *Bias* term, defined as the decreased share of insignificant estimates that appear in the published literature as a result of publication bias. If *Pr(insignificant)* is the probability that a study reports an insignificant estimate in a world without publication bias, then the associated probability with bias is *Pr(insignificant)**∙**(1-Bias).* Correspondingly, the probability of a significant finding increases by *Pr(insignificant)**∙Bias**.* It follows that the *FPR* adjusted for *Bias* is given by + +###### (2) *False Positive Rate(FPR) = [ϕα +**ϕ(1-α)Bias] /**[ϕα +**ϕ(1-α)Bias +**(1-ϕ)(1-β) +**(1-ϕ)**βBias].* + +###### Table 3 shows the profound effect that *Bias* has on the *False Positive Rate.* The top panel recalculates the *FPRs* from Table 1 when *Bias =* 0.25. As points of comparison, ***[Ioannidis et al. (2017)](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12461/full)*** assume *Bias* values between 0.10 and 0.80, ***[Christensen and Miguel (2016)](https://escholarship.org/uc/item/52h6x1cq)*** assume a *Bias* value of 0.30, and ***[Maniadis et al. (2017)](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12527/full)*** assume *Bias* values of 0.30 and 0.40, though these are applied specifically to replications. + +###### Returning to the previous benchmark case of *Prior Odds* = 1:10 and *Power* = 0.50, we see that the *FPR* when *α = 0.05*is a whopping 82%. In a world of *Bias,* lowering *α* to 0.005 has little effect, as the corresponding *FPR* is 0.80. Why is that? Lowering *α* to 0.005 produces a lot more insignificant estimates, which also means a lot more false positives.  This counteracts the benefit of the higher significance standard. + +###### Table3 + +###### Advocates of lowering *α* might counter that decreasing *α* would also have the effect of decreasing *Bias*, since it would make it harder to p-hack one’s way to a significant result if no relationship really exists. However, lowering *α* would also diminish *Power*, since it would be harder for true relationships to achieve significance. Just how all these consequences of lowering  would play out in practice is unknown, but TABLE 4 present a less than sanguine picture. + +###### Table4 + +###### Suppose that before the change in *α*, *Bias* = 0.25 and *Power* = 0.50. Lowering *α* from 0.05 to 0.005 decreases *Bias* and *Power*. Suppose that the new values are *Bias* = 0.15 and *Power* = 0.20. A comparison of these two panels shows that the ultimate effect of decreasing *α* on the *False Positive Rate* is approximately zero. + +###### It is, of course, possible that lowering *α* would reduce *Bias* to near zero values and that the reduction in *Power* would not be so great as to counteract its benefit. However, it would not be enough for researchers to forswear practices such as p-hacking and HARKing. Journals would also have to discontinue their preference for significant results. If one thinks that it is unlikely that journals would ever do that, then it is hard to avoid the conclusion that it is also unlikely that lowering *α* to 0.005 would help with science’s credibility problem. + +###### *Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at* [*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +###### **REFERENCES** + +###### [***Andrews, I. and Kasy, M. (2017) Identification and correction for publication bias. Working paper 23298, National Bureau of Economic Research, November 2017***.](http://www.nber.org/papers/w23298) + +###### ***[Benjamin, D.J., Berger, J.O., Johannesson, M. Nosek, B.A., Wagenmakers, E.-J., Berk, R., …, Johnson, V.E. (2017). Redefine statistical significance. Nature Human Behaviour, 1(0189).](https://www.nature.com/articles/s41562-017-0189-z)*** + +###### ***[Christensen, G.S. and Miguel, E. (2016). Transparency, reproducibility, and the credibility of economics research. CEGA Working Paper Series No. WPS-065. Center for Effective Global Action. University of California, Berkeley.](https://escholarship.org/uc/item/52h6x1cq)*** + +###### [***Ioannidis, J.P. (2005). Why most published research findings are false. PloS Medicine, 2(8): 1418-1422.***](http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124) + +###### ***[Ioannidis, J.P., Doucouliagos, H. and Stanley, T. (2017). The power of bias in economics. Economic Journal 127(605): F236-65.](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12461/full)*** + +###### ***[Maniadis, Z., Tufano, F., and List, J.A. (2017). To replicate or not to replicate? Exploring reproducibility in economics through the lens of a model and a pilot study. Economic Journal, 127(605): F209-F235.](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12527/full)*** + +###### ***[Reed, W.R. (2017). A primer on the “reproducibility crisis” and ways to fix it. Working Paper No. 21/2017, Department of Economics and Finance, University of Canterbury, New Zealand.](http://www.econ.canterbury.ac.nz/RePEc/cbt/econwp/1721.pdf)*** + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/12/15/reed-why-lowering-alpha-to-0-005-is-unlikely-to-help/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/12/15/reed-why-lowering-alpha-to-0-005-is-unlikely-to-help/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-wu-eir-missing-data.md b/content/replication-hub/blog/reed-wu-eir-missing-data.md new file mode 100644 index 00000000000..ece7a1413fc --- /dev/null +++ b/content/replication-hub/blog/reed-wu-eir-missing-data.md @@ -0,0 +1,184 @@ +--- +title: "REED & WU: EiR* – Missing Data" +date: 2022-01-16 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "economics" + - "FIML" + - "health economics" + - "inequality" + - "Leigh and Jencks (2007)0" + - "Maximum Likelihood" + - "Missing Data" + - "Multiple Imputation" + - "replication" + - "Stata" +draft: false +type: blog +--- + +*[\* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]* + +*NOTE: This blog uses Stata for its estimation. All the data and code necessary to reproduce the results in the tables below are available at Harvard’s Dataverse: [**click here**](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FETWQP).* + +Missing data is ubiquitous in economics. Standard practice is to drop observations for which any variables have missing values. At best, this can result in diminished power to identify effects. At worst, it can generate biased estimates. Old-fashioned ways to address missing data assigned values using some form of interpolation or imputation. For example, time series data might fill in gaps in the record using linear interpolation. Cross-sectional data might use regression to replace missing values with their predicted values. These procedures are now known to be flawed (Allison, 2001; Enders, 2010). + +The preferred way to deal with missing data is to use maximum likelihood (ML) or multiple imputation (MI), assuming the data are “missing at random”. Missing at random (MAR) essentially means that the probability a variable is missing is independent of the value of that variable. For example, if a question about illicit drug use is more likely to go unanswered for respondents who use drugs, then those data would not be MAR. Assuming that the data are MAR, both ML and MI will produce estimates that are consistent and asymptotically efficient. + +ML is in principle the easiest to perform. In Stata, one can use the structural equation modelling command (“sem”) with the option “method(mlmv)”. That’s it! Unfortunately, the simplicity of ML is also its biggest disadvantage. For linear models, ML simultaneously estimates means, variances, and covariances while also accounting for the incomplete records associated with missing data. Not infrequently, this causes convergence problems. This is particularly a problem for panel data where one might have a large number of fixed effects. + +In this blog, we illustrate how to apply both ML and MI to a [***well-cited study***](https://www.sciencedirect.com/science/article/pii/S0167629606000750?casa_token=Qll1DeS9h-0AAAAA:vHnzVb1lTS-Yip6ea-Qfb50jOzuoBGqCeYseDS-tzumlkJmsYfOJV15WCwNH2ogyDVUkuW-idA) on mortality and inequality by Andrew Leigh and Christopher Jencks (Journal of Health Economics, 2007). Their analysis focused on the relationship between life expectancy and income inequality, measured by the share of pre-tax income going to the richest 10% of the population. Their data consisted of annual observations from 1960-2004 for Australia, Canada, France, Germany, Ireland, the Netherlands, New Zealand, Spain, Sweden, Switzerland, the UK, and the US. We use their study both because their data and code are publicly available, and because much of the original data were missing. + +The problem is highlighted in TABLE 1, which uses a reconstruction of L&J’s original dataset. The full dataset has 540 observations. The dependent variable, “Life expectancy”, has approximately 11 percent missing values. The focal variable, “Income share of the richest 10%”, has approximately 24 percent missing values. The remaining control variables vary widely in their missingness. Real GDP has no missing values. Education has the most missing values, with fully 80% of the variable’s values missing. This is driven by the fact that the Barro and Lee data used to measure education only reports values at five-year intervals. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2022/01/image.webp) + +In fact, the problem is more serious than TABLE 1 indicates. If we run the regression using L&J’s specification (cf. Column 7, Table 4 in their study), we obtain the results in Column (1) of TABLE 2. The estimates indicate that a one-percentage point increase in the income share of the richest 10% is associated with an increase in life expectancy of 0.003 years, a negligible effect in terms of economic significance, and statistically insignificant. Notably, this estimate is based on a mere 64 observations (out of 540). + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2022/01/image-1.webp) + +In fact, these are not the results that L&J reported in their study. No doubt because of the small number of observations, they used linear interpolation on some (but not all) of their data to fill in missing values. Applying their approach to our data yields the results in Column (2) of Table 2 below. There are two problems with using their approach. + +First, for various reasons, L&J did not fill in values for all the missing values. The ended up using only 430 out of a possible 540 observations. As a result, their estimates did not exploit all the information that was available to them. Second, interpolation replaces missing values with their predicted values without accounting for the randomness that occurs in real data. This biases standard errors, usually downwards. ML and MI allow one to do better. + +ML is the easiest method to apply. To estimate the regression in Table 2 requires a one-line command: + +**sem (le <- ts10 gdp gdpsq edu phealth thealth id2-id12 year2-year45), method(mlmv) vce(cluster id)** + +The “sem” command calls up Stata’s structural equation modelling procedure. The option “method(mlmv)” tells Stata to use maximum likelihood to accommodate missing values. If this option is omitted from the above, then the command will produce results identical to those in Column 1 of Table 1, except that the standard errors will be slightly smaller. + +While the simplicity of ML is a big advantage, it also introduces complications. Specifically, ML estimates all the parameters simultaneously. The inclusion of 11 country fixed effects and 44 year dummies makes the number of elements in the variance-covariance matrix huge. This, in combination with the fact that ML simultaneously integrates over distributions of variables to account for missing values creates computational challenges. The ML procedure called up by the command above did not converge after 12 hours. As a result, we next turn to MI. + +Unlike ML, MI fills in missing values with actual data. The imputed values are created to incorporate the randomness that occurs in real data. The most common MI procedure assumes that all of the variables are distributed multivariate normal. It turns out that this is a serviceable assumption even if the regression specification includes variables that are not normally distributed, like dummy variables (Horton et al., 2003; Allison, 2006). + +As the name suggests, MI creates multiple datasets using a process of Monte Carlo simulation. Each of the datasets produces a separate set of estimates. These are then combined to produce one overall set of estimation results. Because each data set is created via a simulation process that depends on randomness, each dataset will be different. Furthermore, unless a random seed is set, different attempts will produce different results. This is one disadvantage of MI versus ML. + +A second disadvantage is that MI requires a number of subjective assessments to set key parameters. The key parameters are (i) the “burnin”, the number of datasets that are initially discarded in the simulation process; (ii) the “burnbetween”, the number of intervening datasets that are discarded between retained datasets to maintain dataset independence; and (iii) the total number of imputed datasets that are used for analysis. + +The first two parameters are related to the properties of “stationarity” and “independence”. The analogue to convergence in estimated parameters in ML is convergence in distributions in MI. To assess these two properties we first do a trial run of imputations. + +The command “mi impute mvn” identifies the variables with missing values to the left of the “=” sign, while the variables to the right are identified as being complete. + +**mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys)  mcmconly rseed(123) savewlf(wlf, replace)** + +The option “mcmconly” lets Stata know that we are not retaining the datasets for subsequent analysis, but only using them to assess their characteristics. + +The option “rseed(123)” ensures that we will obtain the same data every time we run this command. + +The option “prior(jeffreys)” sets the posterior prediction distribution used to generate the imputed datasets as “noninformative”. This makes the distribution used to impute the missing values solely determined by the estimates from the last regression. + +Lastly, the option “savewlf(wlf, replace)” creates an aggregate variable called the “worst linear function” that allows one to investigate whether the imputed datasets are stationary and independent. + +Note that Stata sets the default values for “burnin” and “burnbetween” at 100 and 100. + +The next set of key commands are given below. + +**use wlf, clear** + +**tsset iter** + +**tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)** + +**ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)** + +The “tsline” command produces a “time series” graph of the “worst linear function” where “time” is measured by number of simulated datasets. We are looking for trends in the data. That is, do the estimated parameters (which includes elements in the variance-covariance matrix) tend to systematically depart from the overall mean. + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2022/01/image-2.webp) + +The graph above is somewhat concerning because it appears to first trend up and then trend down. As a result, we increase the “burnin” value to 500 from its default value of 100 with the following command. Why 500? We somewhat arbitrarily choose a number that is substantially larger than the previous “burnin” value. + +**mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) mcmconly burnin(500) rseed(123) savewlf(wlf, replace)** + +… + +**use wlf, clear** + +**tsset iter** + +**tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)** + +**ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)** + +[![](/replication-network-blog/image-3.webp)](https://replicationnetwork.com/wp-content/uploads/2022/01/image-3.webp) + +This looks a lot better. The trending that is apparent in the first half of the graph is greatly reduced in the second half. We subjectively determine that this demonstrates sufficient “stationarity” to proceed. Note that there is no formal test to determine stationarity. + +The next thing is to check for independence. The posterior distributions used to impute the missing values rely on Bayesian updating. While our use of the Jeffrys prior reduces the degree to which contiguous imputed datasets are related, there is still the opportunity for correlations across datasets. The “ac” command produces a correlogram of the “worst linear function” that allows us to assess independence. This is produced below. + +[![](/replication-network-blog/image-4.webp)](https://replicationnetwork.com/wp-content/uploads/2022/01/image-4.webp) + +This correlogram indicates that as long as we retain imputed datasets that are at least “10 datasets apart”, we should be fine. The default value of 100 for “burnbetween” is thus more than sufficient. + +The remaining parameter to be set is the total number of imputed datasets to use for analysis. For this we use a handy, user-written Stata (and SAS) command from von Hippel (2020) called “how\_many\_imputations”. + +The problem with random data is that it produces different results each time. “how\_many\_imputations” allows us to set the number of imputations so that the variation in estimates will remain within some pre-determined threshold value. The default value is to set the number of imputations so that the coefficient of variation of the standard error of the “worst linear function” is equal to 5%. + +It works like this. First we create a small initial set of imputed datasets. The command below imputes 10 datasets (“add(10)”). + +**mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(10) rseed(123)** + +We then estimate a fixed effects regression for each of the 10 datasets. Note that we use the standard Stata command for “xtreg, fe” after “mi estimate:” + +**mi xtset id year** + +**mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)** + +**how\_many\_imputations** + +The command “how\_many\_imputations” determines the number of imputed datasets calculated to produce standard errors with a coefficient of variation for the standard errors equal to 5%. In this particular case, the output is given by: + +[![](/replication-network-blog/image-5.webp)](https://replicationnetwork.com/wp-content/uploads/2022/01/image-5.webp) + +The output says to create 182 more imputed datasets. + +We can feed this number directly into the “mi impute” command using the “add(`r(add\_M)’)” option: + +**mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(`r(add\_M)’)** + +After running the command above, our stored data now consists of 104,220 observations: The initial set of 540 observations plus 192 imputed datasets × 540 observations/dataset. To combine the individual estimates from each dataset to get an overall estimate, we use the following command: + +**mi xtset id year** + +**mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)** + +The results are reported in Column (3) of Table 2. A further check re-runs the program using different seed numbers. These show little variation, confirming the robustness of the results. + +In calculating standard errors, Table 2 follows L&J’s original procedure of estimating standard errors clustered on countries. One might want to improve on this given that their sample only included 12 countries. + +An alternative to the “mi estimate” command above is to use a user-written program that does wild cluster bootstrapping. One such package is “[wcbregress](http://fmwww.bc.edu/RePEc/bocode/w)”.  While not all user-written programs can be accommodated by Stata’s “mi estimate”, one can use wcb by modifying the “mi estimate” command as follows: + +**mi estimate, cmdok: wcbregress le ts10 gdp gdpsq edu phealth thealth year2-year45, group(id**) + +A comparison of Columns (3) and (1) reveals what we have to show for all our work. Increasing the number of observations substantially reduced the sizes of the standard errors. The standard error of the focal variable, “Income share of the richest 10%”, decreased from 0.051 to 0.035. + +While the estimated coefficient remained statistically insignificant for this variable, the smaller standard errors boosted two other variables into significance: “Real GDP per capital squared” and “Log public health spending pc”. Furthermore, the larger sample provides greater confidence that the estimated coefficients are representative of the population from which we are sampling. + +Overall, the results provide further support for Leigh & Jencks (2007)’s claim that the relationship between inequality and mortality is small and statistically insignificant. + +Given that ML and MI estimation procedures are now widely available in standard statistical packages, they should be part of the replicator’s standard toolkit for robustness checking of previously published research. + +*Weilun (Allen) Wu is a PhD student in economics at the University of Canterbury. This blog covers some of the material that he has researched for his thesis. Bob Reed is Professor of Economics and the Director of*[***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*at the University of Canterbury. They can be contacted at [weilun.wu@pg.canterbury.ac.nz](mailto:weilun.wu@pg.canterbury.ac.nz) and [bob.reed@canterbury.ac.nz](mailto:bob.reed@canterbury.ac.nz), respectively.* + +**REFERENCES** + +Allison, P. D. (2001). *Missing data*. Sage publications. + +Enders, C. K. (2010). *Applied missing data analysis*. Guilford press. + +Leigh, A., & Jencks, C. (2007). Inequality and mortality: Long-run evidence from a panel of countries. *Journal of Health Economics*, 26(1), 1-24. + +Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. *The American Statistician*, 57(4), 229-232. + +Allison, P. (2006, August). Multiple imputation of categorical variables under the multivariate normal model. In *Annual Meeting of the American Sociological Association*, Montreal. + +Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. *Sociological Methods & Research*, 49(3), 699-718. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2022/01/16/reed-wu-eir-missing-data/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2022/01/16/reed-wu-eir-missing-data/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/reed-you-can-calculate-power-retrospectively-just-don-t-use-observed-power.md b/content/replication-hub/blog/reed-you-can-calculate-power-retrospectively-just-don-t-use-observed-power.md new file mode 100644 index 00000000000..337d89c9949 --- /dev/null +++ b/content/replication-hub/blog/reed-you-can-calculate-power-retrospectively-just-don-t-use-observed-power.md @@ -0,0 +1,103 @@ +--- +title: "REED: You Can Calculate Power Retrospectively — Just Don’t Use Observed Power" +date: 2025-08-29 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Observed Power" + - "Post-hoc Power" + - "Retrospective Power" + - "SE-ES" +draft: false +type: blog +--- + +*In this blog, I highlight a valid approach for calculating power after estimation—often called retrospective power. I provide a Shiny App that lets readers explore how the method works and how it avoids the pitfalls of “observed power” — try it out for yourself! I also link to a webpage where readers can enter any estimate, along with its standard error and degrees of freedom, to calculate the corresponding power.* + +**A. Why retrospective power can be useful** +-------------------------------------------- + +Most researchers calculate power before estimation, generally to plan sample sizes: given a hypothesized effect, a significance level, and degrees of freedom, power analysis asks how large a study must be to achieve a desired probability of detection. + +That’s good practice, but key inputs—variance, number of clusters, intraclass correlation coefficient (ICC), attrition, covariate performance—are guessed before the data exist, so realized (ex post) values often differ from what was planned. As ***[Doyle & Feeney (2021)](https://www.povertyactionlab.org/resource/quick-guide-power-calculations)*** note in their guide to power calculations, “the exact ex post value of inputs to power will necessarily vary from ex ante estimates.” This is why it can be useful—even preferable—to also calculate power after estimation. + +Ex-post power can be helpful in at least three situations. + +1) **It can provide a check on whether ex-ante power assessments were realized.** Because actual implementation rarely matches the original plan—fewer participants recruited, geographic constraints on clusters, or greater dependency within clusters than anticipated—realized power often departs from planned power. Calculating ex-post power highlights these gaps and helps diagnose why they occurred. + +2) **It can help distinguish whether a statistically insignificant estimate reflects a negligible effect size or an imprecise estimate.** In other words, it can separate “insignificant because small” from “insignificant because underpowered.” + +3) **It can flag potential Type M (magnitude) risk when results are significant but measured power is low.** In this way, it can warn of possible overestimation and prompt more cautious interpretation ([***Gelman & Carlin, 2014***](https://sites.stat.columbia.edu/gelman/research/published/retropower_final.pdf)). + +In short, while ex-ante power is essential for planning, ex-post power is a practical complement for evaluation and interpretation. It connects power claims with realized outcomes, enables the diagnosis of deviations from plan, and provides additional insights when interpreting both null and significant findings. + +**B. Why the usual way (“Observed Power”) is a bad idea** +--------------------------------------------------------- + +Most statisticians advise against computing observed power, which plugs the observed effect and its estimated standard error into a power formula ([***McKenzie & Ozier, 2019***](https://blogs.worldbank.org/en/impactevaluations/why-ex-post-power-using-estimated-effect-sizes-bad-ex-post-mde-not)). Because observed power is a one-to-one (monotone) transformation of the test statistic—and hence of the *p*-value—it adds no information and encourages tautological explanations (e.g., “the result was non-significant because power was low”). + +Worse, as an estimator of a study’s design power, observed power is both biased and high variance, precisely because it treats a noisy point estimate as the true effect. These problems are well documented ([***Hoenig & Heisey, 2001***](https://doi.org/10.1198/000313001300339897); [***Goodman & Berlin, 1994***](https://doi.org/10.7326/0003-4819-121-3-199408010-00008); [***Cumming, 2014***](https://doi.org/10.1177/0956797613504966); [***Maxwell, Kelley, & Rausch, 2008***](https://doi.org/10.1146/annurev.psych.59.103006.093735)). These concerns are not just theoretical: I demonstrate below how minor sampling variation translates into dramatic changes in observed power. + +**C. A better retrospective approach: SE–ES** +--------------------------------------------- + +In a recent paper ([***Tian et al., 2024***](https://doi.org/10.1111/rode.13130)), I and my coauthors propose a practical alternative that we call: SE–ES (Standard Error–Effect Size). The idea is simple. The researcher specifies a hypothesized effect size (what would be substantively important), uses the estimated standard error from the fitted regression, and combines those with the relevant degrees of freedom to compute power for a two‑sided t‑test. + +Because SE–ES fixes the effect size externally—rather than using the noisy point estimate—it yields a serviceable retrospective power number: approximately unbiased for the true design power with a reasonably tight 95% estimation interval, provided samples are not too small. + +To make this concrete, suppose the data-generating process is *Y=a+bX+ε* , with *ε* a classical error term and *b* estimated by OLS. If the true design power is 80%, simulations at sample sizes *n* = 30, 50, 100 show that the SE–ES estimator is approximately unbiased, with 95% estimation intervals that tighten as *n* grows: (i) *n* = 30 yields (60%, 96%); (ii) *n* = 50 yields (65%, 94%); and (iii) *n* = 100 yields (70%, 90%). + +**D. Try it yourself: A Shiny app that compares SE–ES with Observed Power** + +To visualize the contrast, I have created a companion Shiny app. It lets you vary sample size (*n*), target/true power, and *α*, then: (1) runs Monte Carlo replications of *Y ~ 1 + βX*; (2) plots side‑by‑side histograms of retrospective power for SE–ES and Observed Power; and (3) reports the Mean and the 95% simulation interval (the central 2.5%–97.5% range of simulated power values) for each method. Power is calculated under two‑tailed testing. + +What you should see: the Observed Power histogram tracks the significance test—mass near 0 when results are null, near 1 when they are significant—because it is just a re‑expression of the t statistic. Further, the wide range of estimates makes it unusable even if its biasedness did not. The SE–ES histogram, in contrast, concentrates near the design’s target power and tightens as sample size grows. + +To use the app, ***[click here](https://w87avq-bob-reed.shinyapps.io/retrospective_power_app/)***. Input the respective values in the Shiny app’s sidebar panel. The panel below provides an example with sample size set equal to 100; true power equal to 80% (for two-sided significance), alpha equal to 5%, and sets the number of simulations = 1000 and the random seed equal to 123. + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2025/08/image-1.webp) + +Once you have entered your input values, click “Run simulation”. Two histograms will appear. The histogram to the left reports the distribution of estimated power values using the SE-ES method. The histogram to the right reports the same using Observed Power. The vertical dotted line indicates true power. + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2025/08/image-2.webp) + +Immediately below this figure, the Shiny app produces a table that reports the mean and 95% estimation interval of estimated powers for the SE-ES and Observed Power methods. For this example, with the true power = 80%, the Observed Power distribution is left skewed, biased downwards (mean = 73.4%) with a 95% estimation interval of (14.5%, 99.8%). In contrast, the SE-ES distribution is approximately symmetric, approximately centered around the true of 80%, with a 95% estimation interval of (68.5%, 89.9%). + +[![](/replication-network-blog/image-3.webp)](https://replicationnetwork.com/wp-content/uploads/2025/08/image-3.webp) + +The reader is encouraged to try out different target power values and, most importantly, sample sizes. What you should see is that the SE-ES method works well at every true power value, but, in this context, it becomes less serviceable for sample sizes below 30. + +**E. Bottom line—and an easy calculator you can use now** +--------------------------------------------------------- + +Power estimation is useful for before estimation, for planning. But it is also useful after estimation, as an interpretative tool. Furthermore, it is easy to calculate. For readers interested in calculating retrospective power for their own research, Thomas Logchies and I have created an online calculator that is easy to use: ***[click here](https://replicationnetwork.com/2024/08/15/reed-logchies-calculating-power-after-estimation-no-programming-required/)***. There you can enter α, degrees of freedom, an estimated standard error, and a hypothesized effect size to obtain SE–ES retrospective power for your estimate. Give it a go! + +*NOTE: Bob Reed is Professor of Economics and the Director of*[***UCMeta***](https://www.canterbury.ac.nz/business-and-law/research/ucmeta/)*at the University of Canterbury. He can be reached at*[*bob.reed@canterbury.ac.nz*](mailto:bob.reed@canterbury.ac.nz)*.* + +**References** +-------------- + +Cumming, G. (2014). The new statistics: Why and how. *Psychological Science*, 25(1), 7–29. + +Doyle, M.-A., & Feeney, L. (2021). Quick guide to power calculations. [https://www.povertyactionlab.org/](https://www.povertyactionlab.org/resource/quick-guide-power-calculations)resource/quick-guide-power-calculations + +Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. *Perspectives on Psychological Science,* 9(6), 641–651. + +Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. *Annals of Internal Medicine*, 121(3), 200–206. + +Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. *The American Statistician*, 55(1), 19–24. + +Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. *Annual Review of Psychology*, 59(1), 537–563. + +McKenzie, D., & Ozier, O. (2019, May 16). *Why ex-post power using estimated effect sizes is bad, but an ex-post MDE is not*. *Development Impact* (World Bank Blog). [https://blogs.worldbank.org/en/impactevaluations/why-ex-post-power-using-estimated-effect-sizes-bad-ex-post-mde-not](https://blogs.worldbank.org/en/impactevaluations/why-ex-post-power-using-estimated-effect-sizes-bad-ex-post-mde-not?utm_source=chatgpt.com) + +Tian, J., Coupé, T., Khatua, S., Reed, W. R., & Wood, B. D. (2025). Power to the researchers: Calculating power after estimation. *Review of Development Economics*, *29*(1), 324-358. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/08/29/you-can-calculate-power-retrospectively-just-dont-use-observed-power/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/08/29/you-can-calculate-power-retrospectively-just-dont-use-observed-power/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/richard-anderson-replication-and-the-zen-of-home-repair.md b/content/replication-hub/blog/richard-anderson-replication-and-the-zen-of-home-repair.md new file mode 100644 index 00000000000..cf9089d947e --- /dev/null +++ b/content/replication-hub/blog/richard-anderson-replication-and-the-zen-of-home-repair.md @@ -0,0 +1,33 @@ +--- +title: "RICHARD ANDERSON: Replication and the Zen of Home Repair" +date: 2015-07-30 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "data and code" + - "home repair" + - "Zen" +draft: false +type: blog +--- + +###### This summer is the first since my retirement from government that I find myself without academic obligations here or abroad.  Instead, I am focused on starting to rehab a tattered house that I recently purchased jointly with one of my children. + +###### Surprising parallels exist between repairing a house and pursuing scientific research, at least for persons with an active imagination. First, it is important to understand the basic structure of the problem: removing an incorrect wall might lead to collapse of the house.  Pursuit of an uninteresting hypothesis might doom many months of research to becoming a permanent resident in your file cabinet. Both are tragedies. + +###### There also is the issue of “what was done before.” Is it important to discern the architect’s original location for a window or a door? Is it important to discern precisely how the investigator in a previous study specified his regression in Eviews? Surprisingly, the answers are both yes. Approximate guesses are not adequate. Cutting through framing that supports a hidden beam can lead to poor results, as can guessing what exact specification was used by a previous investigator. + +###### Opening the door on an older house and opening a new academic study are quite similar in a challenging way: neither typically comes with adequate documentation. In a house, you open the door to adventure: no document reveals the modifications and flaws, there is candy and danger for you to discover. Your mind’s vision of the completed project is its advertisement. Similarly, as Bruce McCullough phrases it, opening a new published article is but an advertisement for the underlying research. What data, precisely, were used? If a regression was used, how was it specified and what options (or defaults) were used in its estimation? What statistical package was used?  If a hypothesis test was used, precisely how was the test statistic calculated? + +###### Danger stalks both restored houses and scientific research.  An incorrectly modified house can risk human life (or at least the value of the property). An incorrect scientific study risks a poorly designed public policy, or creating a “bandwagon” that leads others in pursuit of flawed results. + +###### Fortunately, the answer in economic research  (both empirical and DSGE-style simulation studies) is easier than in old houses: the profession should expect authors to furnish code and data as part of the output of their research. It is an enduring mystery that professional economists – and the persons who pay their salaries – see no value in such transparency. An old house is unable to reveal clearly its history and current flaws; most are sold “as is” for that reason. How much longer will published economic research similarly be sold “as is” to its consumers? + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/07/30/richard-anderson-replication-and-the-zen-of-home-repair/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/07/30/richard-anderson-replication-and-the-zen-of-home-repair/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/richard-ball-introducing-project-tier.md b/content/replication-hub/blog/richard-ball-introducing-project-tier.md new file mode 100644 index 00000000000..fe60123b4b4 --- /dev/null +++ b/content/replication-hub/blog/richard-ball-introducing-project-tier.md @@ -0,0 +1,60 @@ +--- +title: "RICHARD BALL: Introducing Project TIER" +date: 2016-01-20 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Haverford College" + - "Project TIER" + - "research transparency" + - "Richard Ball" +draft: false +type: blog +--- + +###### Project TIER (*Teaching Integrity in Empirical Research*) is one of the many initiatives launched within the last several years—a number of which have been featured in previous TRN guest blogs—that seek to strengthen standards of research transparency in the social sciences.  Its mission statement reads: + +###### ***Project TIER’s mission is to promote a systemic change in the professional norms related to the transparency and reproducibility of empirical research in the social sciences.  It is guided by the principle that providing comprehensive replication documentation for research involving statistical data should be as ubiquitous and routine as it is to provide a list of references cited.  Authors should view this documentation as an essential component of how they communicate their research to other scholars, and readers should not consider a study to be credible unless such documentation is available.*** + +###### ***We will know this mission has been accomplished when failing to provide replication documentation for an empirical study is considered as aberrant as writing a theoretical paper that does not contain proofs of the propositions, an experimental paper that does not describe the treatment conditions, or a law review article that does not cite legal statutes or judicial precedents.*** + +###### Project TIER’s approach to promoting research transparency is distinctive in two ways:  it focuses on the education of social scientists early in their training, and it emphasizes the things that authors of research papers can do ensure that interested readers are able to replicate their empirical results without undue difficulty.  Both of these features reflect the circumstances that led to the conception of Project TIER and in which the initiative has evolved. + +###### The ideas that eventually grew into Project TIER began taking shape in an introductory course on statistical methods for undergraduates majoring in economics at Haverford College.  Richard Ball, an economics faculty member, was the instructor for the course, and Norm Medeiros, a librarian, collaborated closely in the advising of students conducting research projects required for the class.  For those projects, students chose the topics they investigated, found statistical data that could shed light on the questions they were interested in investigating, examined and analyzed the data in simple ways, and then wrote complete papers in which they presented and interpreted their results. + +###### When this research project was introduced as a requirement for the course in 2001, the initial results were not encouraging.  The papers students turned in were, to put it mildly, less than completely transparent.  Their descriptions of the original data they had used and the sources from which those data had been obtained, of how the original data were cleaned and processed to create the final data sets used for the analyses, and of how the figures and tables presented in the papers were generated from the final data sets, were incoherent.  In most cases it was impossible to understand the empirical work underlying the papers or to evaluate it in a constructive way. + +###### To address this problem, we began requiring students to turn in not only printed copies of their research papers, but also to submit electronic documentation consisting of their data, code and some supporting information.  We found, however, that developing a workable set of guidelines for the required documentation presented some challenges:  they needed to be detailed and explicit enough that students would know unambiguously what was expected of them, and they needed to be general enough that they would be applicable across the varied types of data and analyses encountered in these projects; but they also needed to be short, simple and clear enough that it would be realistic to expect students to understand and implement them.  It took a number of iterations to formulate guidelines that met these challenges, but over the course of several semesters we arrived at a set of written instructions that proved to be adequate.  In the past ten years or so it has become routine for our students to follow those instructions for constructing replication documentation to accompany their research papers. + +###### Requiring students to turn in comprehensive replication documentation with their research papers has solved the problem that led us to introduce the requirement: if any aspect of the data processing and analysis is not explained adequately in text of a paper, it is possible to discover exactly what the students did simply by reading and running their code.  But a number of other benefits have followed as well.  When students know that they have to document their statistical work, and are given some guidance for doing so, they themselves understand better what they are doing.  And when they understand what they are doing, the analyses they choose to conduct tend to make more sense, and the explanations of what they did that they give in their papers tend to be much more coherent.  Moreover, throughout the entire course of the semester in which students work on a project, their instructors are able to advise them much more effectively than would be possible if students did not keep their data organized and systematically record their work in command files. + +###### Most fundamentally, placing upon students the responsibility to ensure that their work is reproducible (and teaching them some tools for achieving this goal) reinforces the principle that one should not make claims that cannot be verified or whose validity is in doubt; allowing students to turn in a paper based on work that they cannot reproduce undermines this principle.  This principle applies broadly across academic disciplines, but it is particularly important to convey to beginning students of statistics, many of whom hold the prior belief that manipulation and obfuscation are inherent in statistical analysis. + +###### After developing a simple and effective way of teaching students to document statistical research and observing the benefits that follow from doing so, we decided it would be worthwhile to share our experiences with others.  We began in 2011 by presenting a paper at the first annual Conference on Teaching and Research in Economic Education (CTREE), organized by the American Economic Association Committee on Education, and that paper later appeared in the *Journal of Economic Education*.[[1]](#_ftn1) + +###### Positive responses to these and other early outreach efforts led us to launch Project TIER in 2013.  The activities we have undertaken since then include + +###### – A series of workshops for social science faculty interested in introducing principles and methods of transparent research in classes they teach on quantitative methods.  The next Faculty Development Workshop will take place April 1-2, 2016, on the Haverford College campus.  These workshops are offered free of charge; ***[information](https://www.haverford.edu/sites/default/files/Office/TIER/Spring-16-TIER-Workshop-Announcement_0.pdf)*** and ***[applications](https://forms.haverford.edu/view.php?id=169326)*** are now available. + +###### – A program of year-long fellowships, in which faculty who have already made significant contributions collaborate with us in the development and dissemination of new curriculum and approaches to teaching transparent research methods.  We are currently working with ***[five Fellows nominated for 2015-16](https://www.haverford.edu/project-tier/people/tier-faculty-fellows)***, and have begun recruiting for the 2016-17 cohort of TIER Faculty Fellows, for which ***[information](https://www.haverford.edu/sites/default/files/Office/TIER/2016-17-TIER-Fellowship-Announcement.pdf)*** and ***[applications](https://forms.haverford.edu/view.php?id=169189)*** are also available. + +###### – Workshops offered to doctoral students in doctoral programs in the social sciences, offering practical guidance on research documentation and workflow management in the course of writing an empirical dissertation.  We will be conducting a workshop at Duke University, for economics graduate students, on February 12, 2016.  Thanks to a generous grant, we can offer these workshops free of charge, and we would be happy to consider requests from other graduate departments interested in hosting a workshop. + +###### **TO LEARN MORE** + +###### ***Visit our website:***  [www.haverford.edu/TIER](http://www.haverford.edu/TIER) + +###### Please note that we are working on a completely redesigned website, which will have a new URL:  [www.projecttier.org](http://www.projecttier.org).  At the time this blog is being posted, this URL is not yet active, but it will launch in the spring of 2016. + +###### ***Follow us on Twitter:*** @Project\_TIER + +###### [[1]](#_ftnref1) Ball, R. and N. Medeiros (2012).  Teaching Integrity in Empirical Research: A Protocol for Documenting Data Analysis and Management.  *Journal of Economic Education*, *43*(2), 182–189. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/01/20/richard-ball-introducing-project-tier/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/01/20/richard-ball-introducing-project-tier/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/richard-palmer-jones-replication-talk-costs-lives-why-are-economists-so-concerned-about-the-reputati.md b/content/replication-hub/blog/richard-palmer-jones-replication-talk-costs-lives-why-are-economists-so-concerned-about-the-reputati.md new file mode 100644 index 00000000000..214d0e5d004 --- /dev/null +++ b/content/replication-hub/blog/richard-palmer-jones-replication-talk-costs-lives-why-are-economists-so-concerned-about-the-reputati.md @@ -0,0 +1,98 @@ +--- +title: "RICHARD PALMER-JONES: Replication Talk Costs Lives: Why are economists so concerned about the reputational effects of replications?" +date: 2015-05-14 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Michael Clemens" + - "replication" + - "Replication versus Robustness" + - "reputation" +draft: false +type: blog +--- + +###### Michael Clemens’ recent working paper [“The Meaning of Failed Replications: A Review and Proposal”](http://www.cgdev.org/publication/meaning-failed-replications-review-and-proposal-working-paper-399) echoes concerns expressed by some replicatees and economists more generally Ozler, 2014, for example, about the potentially damaging effects of a claim of failed replication on reputations (or similar concerns in [social psychology](http://www.psychol.cam.ac.uk/cece/blog)). Some of these concerns have been expressed in relation to 3ie’s program of funding for replications of prominent works in development economics (Jensen and Oster, 2013; Dercon et al., 2015). The responses of some, not necessarily these, replicatees can be characterised as “belligerent” (Camfield and Palmer-Jones, 2015), adequately matching any giant slaying potential of the replicators. + +###### Further attention (Cremer does refers to some of the main contributions in economics and social science more widely to this them, but not other) to the desirability of and need for more replications in economics is welcome (Camfield and Palmer-Jones, 2013; Duvendack and Palmer-Jones, 2013; Duvendack, Palmer-Jones and Read, 2015 forthcoming; see also the earlier work of Bruce McCullough and associates). + +###### While there is much in Clemens’ paper that could be debated, I will address here what I consider to be the core argument. Clemens argues that replication should be distinguished from robustness testing in part because a failed replication bears implications of error or wrong doing or some other deficiency of the replicatees, while robustness testing (and extension) are practices that reflect legitimate disagreements among competent professionals. He attributes the new concerns with replication in large part to the computational turn in economics, wrongly I think (concern with the provision of data sets to allow replication is expressed by Ragnar Frisch in the editorial to the first issue of Econometrica, for example (Frisch, 1933). In doing so he refers to Collins, 1991, and to Camfield and Palmer-Jones, 2013, to support the claim that a failed replication implies a moral failure among the replicatees. Neither reference is correct. Clemens quotes the latter: “replication speaks to ethical professional practice” (1612), and the former that “replication is a matter of establishing *ought*, not *Is* [emphasis in original]”. I know the latter were not suggesting that replicatees’ behaviour was morally reprehensible and I don’t believe Collins was either. Rather they refer to the ethics of the profession. Both sources use replication to cover both checking, or pure (Hammermesh, 2007) or exact replication (Duvendack, et al., 2015), and what Clemens terms robustness testing. + +###### What Camfield and Palmer-Jones intend the reader to understand is that replication should be a quotidian practice of economics, promoted by professional institutions (teaching, appointment, promotion, journals, conferences, and so on), and that claims to economic knowledge should be based on replicable and replicated studies. In this context I mean the more extensive understanding of replication which means that the findings are not only free from the sorts of error that can be identified by checking, but also stand up to comparison with the results of using alternative relevant datasets, estimation methods, variable constructions, and models, and, as far as is possible, to alternative explanations of the same phenomena drawn from different theoretical frameworks. Collins writes; “[H]ere we see the difference between *checking* and replication. Checking often involves *repetition* as a technique, but it is not *replication. …. Replication is the establishment of a new and contested result by agreement over what counts as a correctly performed series of experiments.*” (Collins, 1991: 131-2). + +###### And I think this is consistent with understandings of replication in laboratory and other sciences, and with computational science (Peng, 2011). Thus natural science replicates an experiment with what are supposed to be identical materials, methods and conditions (and hopefully different actors trained in putatively the same methods but at different locations, with different perspectives, or priors if you like). These qualifications on the training of the actors stem from the role of tacit knowledge or practices, and different interests, that characterise the decentralised practices of science and are apparently often crucial. A failure to replicate at the checking stage can indicate unknown, or unreported, differences in any of these factors. Detective work to find out where the unrecognised differences leading to different outcomes lie is part of the work of scientists. These differences need not lie in moral failures but in the realities of life – that complete description is almost always impossible, that to err is human, and that some choices are margin or perhaps judgement calls. + +###### This is where things elide seamlessly into robustness testing. For a simple example, consider the choice as to what constitutes a comparable sample for an experiment or a survey, or even from an existing secondary data source, is generally validated by comparing descriptive statistics. These will never be identical between the original and subsequent sample. (Value) judgements are required as to whether the samples are sufficiently similar. + +###### What is interesting is that Cremer sees merit in flogging this dead horse. Hardly any replication practitioner sees merit in restricting the practice to “checking” (and Clemens provides citations in support of this), and consider that checking hardly provides sufficient motivation for the work involved, given the low status and poor publication prospects (Dewald et al., 1986; Duvendack and Palmer-Jones, 2013). But I disagree with the view that checking has little value (I am reminded of Oscar Wilde’s aphorism about knowing the price of everything but the value of nothing). “Checking” helps understand what the authors have done as a basis for extension, and so on, and has revealed plenty of problems in economics, from Dewald et al., 1986, through McCullough and Vinod, 2003. The anodyne results of the work reported by Glandon, 2011, apart from lacking details, might perhaps have revealed more if the checking had extended beyond reproducing tables and graphs from estimation data sets with provided code, to included data, variable and sample preparation, and re-writing estimation code from scratch, perhaps in a different computer language. And perhaps been undertaken by people with more status and authority than that of a “graduate student”. + +###### However, there clearly is support for the view that replication should be restricted to checking and perhaps minimal pre-specified robustness testing, in parts of the economics establishment (e.g. Ozler, 2014). I do not here expand on the issue of pre-defined replication plans (see also Kahneman, 2014). Since replicated authors have been able to vary data sets, samples, variable constructions, estimation models and methods and so on, so, within reasonable, similar limits, should replicators be allowed to test the robustness of results. Sauce for the goose, should be sauce for the gander. + +###### Is potential loss of reputation due to unanswerable botched or malign replication, or just the mere mention of replication? If so, why should reputation hang on such a fragile thread, especially nowadays when social media provide ample opportunity for prompt and extensive response from replicatees who consider themselves maligned? + +###### What is it about the replication word that gets to these economists? Elsewhere, co-authors and I have suggested the answer may lie in the nature of professional economics as a policy science (Duvendack and Palmer-Jones, 2013; Camfield et al., 2014). Following Ioannidis (2005) and related work suggesting the fragility of statistical analyses (see Manniadis et al., 2014), we argue that emphasis on “originality” and statistical significance is associated with data mining, model polishing (p-hacking), and HARKing (hypothesising after results are known), in often underpowered studies. These result in far too many false positives, which proper replication will unveil in a process of both checking and robustness testing. We have also suggested that ideological and career interests are involved in the race to produce significant and surprising policy relevant results, behaviour promoted by the institutions of economics, even when producing these results requires practices which may contravene well known principles of statistical estimation and testing. This may result in in cognitive dissonance. I don’t elaborate these arguments here; they are touched on in some of the references already mentioned. The response to cognitive dissonance is not usually to change behaviour producing the dissonance but to ignore the problem, or to justify the practices. We can see these behaviours in the texts protesting against the extension of replication to include robustness testing (or what I prefer to term statistical and scientific testing). + +###### Clemens’s proposal to restrict replication to checking would amount to a “public lie” (Geisser, 2012), a wilful blindness (Heffernan, 2012), or contrived ignorance (Luban, 2007, chapter 6), evidence of a state of denial (Cohen, 1992). Reject this proposal; support proper replication. + +###### References + +###### Camfield, L., Palmer-Jones, R., 2013. Three “Rs” of Econometrics: Repetition, Reproduction and Replication. Journal of Development Studies 49, 1607–1614. + +###### Camfield, L., Palmer-Jones, R.W., 2015. Ethics of analysis and publication in international development, in “Social Science Research Ethics for a Globalizing World: Interdisciplinary and Cross-Cultural Perspectives, in: Nakray, N., Alston, M., Wittenbury, K. (Eds.), Social Science Research Ethics for a Globalizing World: Interdisciplinary and Cross-Cultural Perspectives. Routledge, New York. + +###### Camfield, L., Duvendack, M., Palmer-Jones, R., 2014. Things you Wanted to Know about Bias in Evaluations but Never Dared to Think. IDS Bulletin 45, 49–64. + +###### Cohen, S., 2001. States of Denial: Knowing About Atrocities and Suffering. Polity Press, Cambridge. + +###### Collins, H.M., 1991. The Meaning of Replication and the Science of Economics. History of Political Economy 23, 123–142. + +###### Collins, H.M., 1985. Changing Order: Replication and Induction in Scientific Practice. University of Chicago Press, Chicago. + +###### Clemens, M., 2015. The Meaning of Failed Replications: A Review and Proposal. Center for Global Development, Washington, Working Paper 399, [http://www.cgdev.org/sites/default/files/CGD-Working-Paper-399-Clemens-Meaning-Failed-Replications.pdf accessed 14/4/2015](http://www.cgdev.org/sites/default/files/CGD-Working-Paper-399-Clemens-Meaning-Failed-Replications.pdf%20accessed%2014/4/2015). + +###### Dercon, S., Gilligan, D.O., Hoddinott, J., Woldehanna, T., 2014. The Impact of Agricultural Extension and Roads on Poverty and Consumption Growth in Fifteen Ethiopian Villages: Response to William Bowser, 3ie, New Delhi, available , accessed 14/4/2015 + +###### Dewald, W. G., Thursby, J. G., Anderson, R. G. 1986. Replication in Empirical Economics: the Journal of Money, Credit and Banking Project. American Economic Review, 76, 587-603. + +###### Duvendack, M., Palmer-Jones, R.W., 2013. Replication of Quantitative work in development studies: Experiences and suggestions. Progress in Development Studies 13, 307–322. + +###### Duvendack, M., Palmer-Jones, R.W., 2014. Replication of quantitative work in development studies: experiences and suggestions, in: Camfield, L., Palmer-Jones, R.W. (Eds.), As Well as the Subject: Additional Dimensions in Development Research Ethics. Palgrave, London. + +###### Duvendack, M., Palmer-Jones, R.W., Reed, W.R., 2015. Replications in Economics: a Progress Report. Econ Watch Journal, forthcoming. + +###### Geissler, P.W., 2013. Public secrets in public health: Knowing not to know while making scientific knowledge. American Ethnologist 40, 13–34. + +###### Glandon, P., 2011. Report on the American economic review data availability compliance project. American Economic Review 101, 695–699. + +###### Hamermesh, D.S., 2007. Viewpoint: Replication in Economics. Canadian Journal of Economics 40, 715–733. + +###### Hamermesh, D.S., 1997. Some Thoughts on Replications and Reviews. Labour Economics 4, 107–109. + +###### Heffernan, M., 2012. Willful Blindness: Why We Ignore the Obvious at Our Peril, Reprint edition. Walker & Company, New York. + +###### Ioannidis, J.P.A., 2005. Why Most Published Research Findings Are False. PLoS Med 2, e124. + +###### Jensen, R., Oster, E., 2009. The Power of TV: Cable Television and Women’s Status in India. The Quarterly Journal of Economics 124 (3), 1057-1094. + +###### Jensen, R. and Oster, E. 2014. TV, Female Empowerment and Fertility Decline in Rural India: Response to Iversen and Palmer-Jones. *3ie, New Delhi. Available at:* [*http://www.3ieimpact.org/publications/3ie-replication-paper-series/3ie-replication-paper-2/*](http://www.3ieimpact.org/publications/3ie-replication-paper-series/3ie-replication-paper-2/)*,* accessed 29 August + +###### Kahneman, D., 2014. A New Etiquette for Replication. , accessed 28/08/2014. + +###### Luban, D., Legal Ethics and Human Dignity, Cambridge University Press, Cambridge. + +###### McCullough, B.D., Vinod, H.D., 2003. Verifying the Solution from a Nonlinear Solver: A Case Study. The American Economic Review 93, 873–892. + +###### Maniadis, Z., Tufano, F., List, J.A., 2014. One Swallow Doesn’t Make a Summer: New Evidence on Anchoring Effects – Online appendix. American Economic Review 104, 277–290. + +###### Ozler, B., 2014. , accessed 17/10/2014. + +###### Peng, R.D., 2011. Reproducible Research in Computational Science. Science, 334, 1226–1227. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/05/14/replication-talk-costs-lives-why-are-economists-so-concerned-about-the-reputational-effects-of-replications-richard-palmer-jones/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/05/14/replication-talk-costs-lives-why-are-economists-so-concerned-about-the-reputational-effects-of-replications-richard-palmer-jones/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/robert-gelfond-and-ryan-murphy-out-of-sample-tests-and-macroeconomics.md b/content/replication-hub/blog/robert-gelfond-and-ryan-murphy-out-of-sample-tests-and-macroeconomics.md new file mode 100644 index 00000000000..6331b75580a --- /dev/null +++ b/content/replication-hub/blog/robert-gelfond-and-ryan-murphy-out-of-sample-tests-and-macroeconomics.md @@ -0,0 +1,37 @@ +--- +title: "ROBERT GELFOND and RYAN MURPHY: Out-of-Sample Tests and Macroeconomics" +date: 2016-05-14 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Government spending multiplier" + - "Macroeconomics" + - "Out-of-sample tests" +draft: false +type: blog +--- + +###### The replication crisis has elicited a number of recommendations, from ***[betting on beliefs](http://www.tandfonline.com/doi/abs/10.1080/02691729508578768#.VzOWtuSo6io)***, to ***[open data](http://pps.sagepub.com/content/7/6/615.full)***, to ***[improved norms in academic journals regarding replication studies](http://pfr.sagepub.com/content/43/2/139)***. In our recent working paper, “A Call for Out-of-Sample Testing in Macroeconomics” (***[available at SSRN](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2732099)***), we argue that a renewed focus on out-of-sample tests will significantly mitigate the issues with replication, and we document the fact that out-of-sample tests are absent from entire literatures in economics. + +###### Our starting point is the observation that the literature regarding the government spending multiplier lacked almost any result supported by an out-of-sample test. For this we use a fairly forgiving definition of “out-of-sample test;” so long as a model is parameterized in one period and then applied to new data, we count it. We review 87 empirical papers estimating the multiplier. Out-of-sample tests do not make an appearance, with only a few exceptions. Given that this question is perhaps the most important in macroeconomics, with quite literally trillions of dollars on the line, this result is jarring. + +###### It was in 1953 that Milton Friedman published ***[The Methodology of Positive Economics](http://www.sfu.ca/~dandolfa/friedman-1966.pdf)***, urging economists to use prediction as their criterion for comparing the worthiness of competing theories. Clearly, philosophy of science and practical econometrics have moved beyond this simplistic dictum, but does it make sense to cast aside out-of-sample predictions altogether when comparing theories and models? Are we all that confident that the results found using the methods which claim the throne of the “***[credibility revolution in empirical economics](https://www.aeaweb.org/articles?id=10.1257/jep.24.2.3)***” will withstand the scrutiny of truly out-of-sample tests? + +###### The primary exception to our result is a ***[2007 paper by Frank Smets and Rafael Wouters](https://www.aeaweb.org/articles?id=10.1257/aer.97.3.586)***, who ably perform an out-of-sample test against a series of baseline models. However, in the absence of other papers performing such tests, it is difficult to say how strong of a result it is. Even more laudable is ***[the lengthy attempt in 2012 by Volker Wieland and his colleagues](http://www.sciencedirect.com/science/article/pii/S0167268112000157)*** in comparing the performance of a number of macroeconomic models, although it is difficult to parse this study to answer the narrower question regarding the government spending multiplier. Another example is a ***[2016 paper by Jorda and Taylor](http://onlinelibrary.wiley.com/doi/10.1111/ecoj.12332/abstract)***, which creates a “counterfactual forecast,” which is similar to, but not quite, an out-of-sample test. + +###### The other two examples we were able to identify were published in ***[1964](http://www.jstor.org/stable/2525559)*** and ***[1967](http://www.jstor.org/stable/2525379)***. + +###### There are clearly additional criteria that economists can and should use for evaluating theories. Nonetheless, the paucity of examples of these tests points to p-hacking, specification searches, and the whole slew of problems associated with the replication crisis of social science. And perhaps macroeconomics is “hard” and things like recessions cannot be reasonably forecasted. Fine. Meteorologists cannot forecast more than a week or so ahead, but they still forecast what they can forecast. What models work the best is still an extremely pertinent question, even if all models fail miserably when a recession hits. + +###### Rather, doing away with out-of-sample tests and other similar tests does away with the scientific ideal of Conjectures and Refutations, with scientific knowledge evolving as bold ideas starkly stated compete for the title of least wrong. + +###### *Bob Gelfond is the CEO of MQS Management LLC and the chairman and founder of MagiQ Technologies. Ryan Murphy is a research assistant professor at the O’Neil Center for Global Markets and Freedom at SMU Cox School of Business.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/05/14/robert-gelfond-and-ryan-murphy-out-of-sample-tests-and-macroeconomics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/05/14/robert-gelfond-and-ryan-murphy-out-of-sample-tests-and-macroeconomics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/roodman-appeal-to-me-first-trial-of-a-replication-opinion.md b/content/replication-hub/blog/roodman-appeal-to-me-first-trial-of-a-replication-opinion.md new file mode 100644 index 00000000000..b4b7c7048b4 --- /dev/null +++ b/content/replication-hub/blog/roodman-appeal-to-me-first-trial-of-a-replication-opinion.md @@ -0,0 +1,160 @@ +--- +title: "ROODMAN: Appeal to Me – First Trial of a “Replication Opinion”" +date: 2025-05-31 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Academic incentives" + - "Comments" + - "economics" + - "Evidence-based policy" + - "Journal policy" + - "Meta-Science" + - "Open Philanthropy" + - "peer review" + - "replications" + - "Truth-seeking" +draft: false +type: blog +--- + +*[This blog is a repost of a blog that first appeared at [**davidroodman.com**](https://davidroodman.com/blog/2025/05/09/appeal-to-me-first-trial-of-a-replication-opinion/). It is republished here with permission from the author.]* + +My employer, Open Philanthropy, strives to make grants in light of evidence. Of course, many uncertainties in our decision-making are irreducible. No amount of thumbing through peer-reviewed journals will tell us how great a threat AI will pose decades hence, or whether a group we fund will get a vaccine to market or a bill to the governor’s desk. But we have checked journals for insight into many topics, such as the [***odds of a grid-destabilizing geomagnetic storm***](https://www.openphilanthropy.org/research/geomagnetic-storms-an-introduction-to-the-risk/), and how much [***building new schools boosts what kids earn when they grow up***](https://www.openphilanthropy.org/research/does-putting-kids-in-school-now-put-money-in-their-pockets-later-revisiting-a-natural-experiment-in-indonesia/). + +When we draw on research, we vet it in rare depth (as does GiveWell, from which we spun off). I have sometimes spent months replicating and reanalyzing a key study—checking for bugs in the computer code, thinking about how I would run the numbers differently and how I would interpret the results. This interface between research and practice might seem like a picture of harmony, since researchers want their work to guide decision-making for the public good and decision-makers like Open Philanthropy want to receive such guidance. + +Yet I have come to see how cultural misunderstandings prevail at this interface. From my side, there are two problems. First, about half the time I reanalyze a study, I find that there are important bugs in the code, or that adding more data makes the mathematical finding go away, or that there’s a compelling alternative explanation for the results. (Caveat: most of my experience is with non-randomized studies.) + +Second, when I send my critical findings to the journal that peer-reviewed and published the original research, the editors usually don’t seem interested ([***recent exception***](https://www.journals.uchicago.edu/doi/10.1086/732254)). + +Seeing the ivory tower as a bastion of truth-seeking, I used to be surprised. I understand now that, because of how the academy works, in particular, because of how the individuals within academia respond to incentives beyond their control, we consumers of research are sometimes more truth-seeking than the producers. + +Last fall I read a tiny illustration of the second problem, and it inspired me to try something new. Dartmouth economist Paul Novosad tweeted his pique with economics journals over how they handle challenges to published papers: + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2025/05/image.webp) + +As you might glean from the truncated screenshots, the starting point for debate is a [***paper published in 2019***](https://doi.org/10.1257/app.20170223). It finds that U.S. immigration judges were less likely to grant asylum on warmer days. For each 10°F the temperature went up, the chance of winning asylum went down 1 percentage point. + +The [***critique***](https://doi.org/10.1257/app.20200118) was written by another academic. It fixes errors in the original paper, expands the data set, and finds no such link from heat to grace. In the [***rejoinder***](https://doi.org/10.1257/app.20200068), the original authors acknowledge errors but say their conclusion stands. “AEJ” (*American Economic Journal: Applied Economic*s) published all three articles in the debate. As you can see, the dueling abstracts confused even an expert. + +So I appointed myself *judge* in the case. Which I’ve never seen anyone do before, at least not so formally. I did my best to hear out both sides (though the “hearing” was reading), then identify and probe key points of disagreement. I figured my take would be more independent and credible than anything either party to the debate could write. I hoped to demonstrate and think about how academia sometimes struggles to serve the cause of truth-seeking. And I could experiment with this new form as one way to improve matters. + +I just filed my opinion, which is to say, the Institute for Replication has [***posted it***](https://www.econstor.eu/handle/10419/316399). (Open Philanthropy [***partly funds***](https://www.openphilanthropy.org/grants/university-of-ottawa-institute-for-replication/) them.) My colleague [***Matt Clancy***](https://www.openphilanthropy.org/about/team/matt-clancy/) has pioneered [***living literature reviews***](https://www.newthingsunderthesun.com/about); he suggested that I make this opinion a living document as well. If either party to the debate, or anyone else, changes my mind about anything in the opinion, I will [***revise it***](https://github.com/droodman/RO-Heyes-Saberian-2019/releases) while preserving the history. + +**Verdict** + +My conclusion was more one-sided than I had expected. I came down in favor of the commenter. The authors of the original paper defend their finding by arguing that in retrospect they should have excluded the quarter of their sample consisting of asylum applications filed by people from *China*. Yes, they concede, correcting the errors mostly erases their original finding. But it reappears after Chinese are excluded. + +This argument did not persuade me. True, during the period of this study, 2000–04, most Chinese asylum-seekers applied under a [***special U.S. law***](https://www.govinfo.gov/content/pkg/PLAW-104publ208/pdf/PLAW-104publ208.pdf#page=690) meant to give safe harbor to women fearing forced sterilization and abortion in their home country. + +The authors seem to argue that because grounds for asylum were more demonstrable in these cases—anyone [***could read***](https://www.theguardian.com/books/2013/may/06/chinas-barbaric-one-child-policy) about the draconian enforcement of China’s one-child policy—immigration judges effectively lacked much discretion. And if outdoor temperature couldn’t meaningfully affect their decisions, the cases were best dropped from a study of precisely that connection. + +But this premise is flatly contradicted by a [***study the authors cite***](https://stanfordlawreview.org/wp-content/uploads/sites/3/2010/04/RefugeeRoulette.pdf) called “Refugee Roulette.” In the study, Figure 6 shows that judges differed widely in how often they granted asylum to Chinese applicants. One did so less than 5% of the time, another more than 90%, and the rest were spread evenly between. (For a more thorough discussion, read sections 4.4 and 6.1 of my [***opinion***](https://www.econstor.eu/handle/10419/316399).) + +Thus while I do not dispute that there is a correlation between temperature and asylum grants in a particular subset of the data, I think it is best explained by [***p-hackin***](https://doi.org/10.1037/a0033242)[g](https://doi.org/10.1037/a0033242) or some other form of “filtration,” in which, [***consciously or not***](https://doi.org/10.1511/2014.111.460), researchers gravitate toward results that happen to look statistically significant. (In fairness, they know that peer reviewers, editors, and readers gravitate to the same sorts of results, and getting a paper into a good journal can make a career.) + +The nature of the defense raises a question about how the journal handled the dispute. It published the original authors’ rejoinder [***as a Correction***](https://www.aeaweb.org/articles?id=10.1257/app.20200068)[.](https://www.aeaweb.org/articles?id=10.1257/app.20200068) Yet, while one might agree that it is *better* to exclude Chinese from the analysis, I think their inclusion in the original was not an *error*, and therefore their exclusion is not a *correction*. Thus, one way the journal might have headed off Novosad’s befuddlement would have been by insisting that Corrections only make corrections. + +**What’s wrong with this picture?** + +To recap: + +– *Two economists performed a quantitative analysis of a clever, novel question.* + +– *It underwent peer review.* + +– *It was published in one of the*[***top journals in economics***](https://www.pjip.org/Economics-journal-rankings.html)*. Its data and computer code were [**posted online**](https://www.openicpsr.org/openicpsr/project/113722/version/V1/view), per the journal’s [**policy**](https://www.aeaweb.org/journals/data)* + +– *Another researcher [**promptly responded**](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3645463) that the analysis contains errors (such as computing average daytime temperature with respect to Greenwich time rather than local time), and that it could have been done on a much larger data set (for 1990 to ~2019 instead of 2000–04). These changes make the headline findings go away.* + +– *After behind-the-scenes back and forth among the disputants and editors, the journal published the comment and rejoinder.* + +– *These new articles confused even an expert.* + +– *An outsider (me) delved into the debate and found that it’s actually a pretty easy call.* + +If you score the journal on whether it successfully illuminated its readership as to the truth, then I think it is kind of 0 for 2. + +[Update: I submitted the opinion to the journal, which promptly rejected it. I understand that the submission was an odd duck. But if I’m being harsh I can raise the count to 0 for 3.] + +That said, *AEJ Applied* did support dialogue between economists that eventually brought the truth out. In particular, by requiring public posting of data and code (an area where this journal and its siblings have been pioneers), it facilitated rapid scrutiny. + +Still, it bears emphasizing:*For quality assurance, the data sharing was much more valuable than the peer review*. And, whether for lack of time or reluctance to take sides, the journal’s handling of the dispute obscured the truth. + +My purpose in examining this example is not to call down a thunderbolt on anyone, from the Olympian heights of a funding body. It is rather to use a concrete story to illustrate the larger patterns I mentioned earlier. + +Despite having undergone peer review, many published studies in the social sciences and epidemiology do not withstand close scrutiny. When they are challenged, journal editors have a hard time managing the debate in a way that produces more light than heat. + +I have critiqued papers about the impact of [***foreign aid***](https://www.jstor.org/stable/3592954), [***microcredit***](https://www.tandfonline.com/doi/abs/10.1080/00220388.2013.858122), [***foreign aid***](https://retractionwatch.com/2012/06/29/authors-retract-plos-medicine-foreign-health-aid-paper-that-had-criticized-earlier-lancet-study/), [***deworming***](http://blog.givewell.org/2017/12/07/questioning-evidence-hookworm-eradication-american-south/), [***malaria eradication***](https://blog.givewell.org/2017/12/29/revisiting-evidence-malaria-eradication-americas/), [***foreign aid***](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(12)61529-3/fulltext), [***geomagnetic storm risk***](https://www.openphilanthropy.org/research/geomagnetic-storms-historys-surprising-if-tentative-reassurance/), [***incarceration***](https://www.openphilanthropy.org/research/reasonable-doubt-a-new-look-at-whether-prison-growth-cuts-crime/), [***schooling***](https://www.openphilanthropy.org/research/does-putting-kids-in-school-now-put-money-in-their-pockets-later-revisiting-a-natural-experiment-in-indonesia/), [***more schooling***](https://arxiv.org/abs/2303.11956), [***broadband***](https://arxiv.org/abs/2401.13694), [***foreign aid***](https://doi.org/10.1177/1091142114537895), [***malnutrition***](https://papers.ssrn.com/abstract=4294284), …. + +Many of those critiques I have submitted to journals, typically only to receive polite rejections. I obviously lack objectivity. But it has struck me as strange that, in these instances, we on the outside of academia seem more concerned about getting to the truth than those on the inside. Sometimes I’ve wished I could appeal to an independent authority to review a case and either validate my take or put me in my place. + +*That* yearning is what primed me to respond to Novosad’s tweet by donning the robe of a judge myself. (I passed on the wig.) + +I’ve never edited a journal, but I’ve talked to people who have, and I have some idea of what is going on. Editors juggle many considerations besides squeezing maximum truth juice out of any particular study. Fully grasping a replication debate takes work—imagine the parties lobbing 25-page missives at each other, dense with equations, tables, and graphs—and editors are busy. + +Published comments don’t get [***cited***](https://econjwatch.org/articles/decline-in-critical-commentary-1963-2004) [***much***](https://doi.org/10.1111/ecin.13222) anyway, and editors keep an eye on [***how much their journals get cited***](https://www.pjip.org/Economics-journal-rankings.html). They may also weigh the personal costs for the people whose reputations are at stake. Many journals, especially those published by professional associations, want to be open to all comers—to be the moderator, not the panelist, the platform, not the content provider. + +The job they set for themselves is not quite to assess the reliability of any given study (a tall order) but to certify that each article meets a minimum standard, to support the collective dialogue through which humanity seeks scientific truth. + +Then, too, I think journal editors often care a lot about whether a paper makes a “contribution” such as a novel question, data source, or analytical method. Closer to home, junior editors may think twice before welcoming criticism that could harm the reputation of their journal or ruffle the feathers of more powerful members of their flock. Senior editors may have gotten where they are by thinking in the same, savvy way. + +Modern science is the best system ever developed for pursuing truth. But it is still populated by human beings ([***for how much longer?***](https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/)) whose cognitive wiring makes the process uncomfortable and imperfect. Humans are tribal creatures—not as wired for selflessness as your average ant, but more apt to go to war than an eel or an eagle. + +Among the bits of psychological glue that bind us are shared ideas about “is” and “ought.” Imperialists and evangelists have long influenced shared ideas in order to expand and solidify the groups over which they hold sway. The links between belief, belonging, and power are why the notion that evidence trumps belief was so revolutionary when the Roman church sent Galileo to his death, and why the idea, despite making modernity possible, remains discomfiting to this day. + +The inefficiency in pursuing truth has real costs for society. Some social science research influences decisions by private philanthropies and public agencies, decisions whose stakes can be measured in human lives, or in the millions, billions, even trillions of dollars. Yet individual studies receive perhaps hundreds of dollars worth of time in peer review, and that within a system in which getting each paper as close as possible to the truth is one of several competing priorities. + +Making science work better is the stuff of *metascience*, an area in which Open Philanthropy [***makes grants***](https://www.openphilanthropy.org/grants/?focus-area=innovation-policy). It’s a big topic. Here, I’ll merely toss out the idea that if these new-fangled replication opinions were regularly produced, they could somewhat mitigate the structural deprecation of truth-seeking. + +On the demand side—among decision-makers using research—replication opinions could improve the vetting of disputed studies, while efficiently targeting the ones that matter most. (Related idea [***here***](https://www.chronicle.com/article/social-science-is-broken-heres-how-to-fix-it).) + +On the supply side, a heightened awareness that an “appeals court” could upstage journals in a role laypeople and policymakers expect them to fill—performing quality assurance on what they publish—could stimulate the journals to handle replication debates in a way that better serves their readers and society. + +**Reflections on writing the replication opinion** + +Writing a novel piece led me to novel questions. To prepare for writing my opinion, I read about how judges [***write***](https://www.fjc.gov/content/judicial-writing-manual-pocket-guide-judges-second-edition) [***theirs***](https://scholarship.law.umn.edu/mlr/1677). Judicial opinions usually have a few standard sections. They review the history of the case (what happened to bring it about, what motions were filed); list agreed facts; frame the question to be decided; enunciate some standard that a party has to meet, perhaps handed down by the Supreme Court; and then bring the facts to the standard to reach a decision. + +Could I follow that outline? Reviewing the case history was easy enough. I had the papers and could inventory their technical points. The data and computer code behind the papers are [***on***](https://www.openicpsr.org/openicpsr/project/113722/version/V1/view) [***the***](https://doi.org/10.7910/DVN/3LOR3R) [***web***](https://www.openicpsr.org/openicpsr/project/127263/version/V1/view), so I could rerun the code and stipulate facts such as that a particular statistical procedure applied to a particular data set generates a particular output. + +Figuring out what I was trying to *judge* was harder. Surely it was not whether, for all people, places, and times, heat makes us less gracious. Nor should I try to decide that question even in the study’s context, which was U.S. asylum cases decided between 2000 and 2004. + +Truth in the social sciences is rarely absolute. We use statistics precisely because we know that there is noise in every measurement, uncertainty in every finding. In addition, by [***Bayes’ Rule***](https://www.youtube.com/watch?v=BrK7X_XlGB8), the conclusions we draw from any one piece of evidence depend on the prior knowledge we bring to it, which is shaped by other evidence. + +Someone who has read 10 ironclad articles on how temperature affects asylum decisions should hardly be moved by one more. Yet I think those 10 other studies, if they existed, would lie beyond the scope of this case. + +That means that my replication opinion is *not* about the effects of temperature on behavior in any setting. It’s more meta than that. It’s about how much this new paper should *shift or strengthen* one’s views on the question. + +After reflecting on these complications, here is what I decided to decide: *to the extent that a reasonable observer updated their priors after reading the original paper, how much should the subsequent debate reverse or strengthen that update?* + +My judgment need not have been binary. Unlike a jury burdened with deciding guilt or innocence, a replication opinion can come down in the middle, again by Bayes’ Rule. Sometimes there is more than one reasonable way to run the numbers and more than one reasonable way to interpret the results. + +I sought rubrics through which to organize my discussion—both to discipline my own reasoning and to set precedents, should I or anyone else do this again. I borrowed a[***typology developed by former colleague Michael Clemens***](https://onlinelibrary.wiley.com/doi/10.1111/joes.12139) of the varieties of replication and robustness testing, as well as a typology of statistical issues from [***Shadish, Cook, and Campbell***](https://www.amazon.com/Experimental-Quasi-Experimental-Designs-Generalized-Inference/dp/0395615569). + +And I made a list of study traits that we can expect to be associated, on average, with publication bias and other kinds of result filtration. For example, there is [***evidence***](https://doi.org/10.1257/app.20150044) that in top journals, statistical results from junior economists, who are running the publish-or-perish gauntlet toward tenure, are more likely to report results that *just* clear conventional thresholds for statistical significance. That is consistent with the theory that the researchers on whom the system’s perverse incentives impinge most strongly are most apt to run the numbers several ways and emphasize the “significant” runs in their write-ups. + +One tricky issue was how much I should analyze the data myself. The upside could be more insight. The downside could be a loss of (perceived) objectivity if the self-appointed referee starts playing the game. Wisely or not, I gave myself *some* leeway here. Surely real judges also rely on their knowledge about the world, not just what the parties submit as evidence. + +For example, in addition to its analysis of asylum decisions, the original paper checks whether the California parole board was less likely to grant parole on warmer days in 2012–15. Partly because the critical comment did not engage with this side-analysis, I revisited it myself. I transferred it to the next quadrennium, 2016–19, while changing the original computer code as little as possible. (Here, too, the apparent impact of temperature went away.) + +**Closing statement** + +The stakes in this case are probably low. While the question of how temperature affects human decision-making links broadly to climate change, and the arbitrariness of the American immigration system is a serious concern, I would be surprised if any important policy decision in the next few years turns on this research. + +But the case illustrates a much larger problem. Some studies do influence important decisions. That they have been peer-reviewed should hardly reassure. Judicious [***post-publication review***](https://statmodeling.stat.columbia.edu/2016/12/16/an-efficiency-argument-for-post-publication-review/) of important studies, perhaps including “replication opinions,” can give decision-makers with real dollars and real livelihoods on the line a clearer sense of what the data do and do not tell us. + +Unfortunately, powerful incentives within academia, rooted in human nature, have generally discouraged such Socratic inquiry. + +I like to think of myself as judicious. As to whether I’ve lived up to my self-image [***in this case***](https://www.econstor.eu/handle/10419/316399), I will let you be the judge. At any rate, I figure that in the face of hard problems, it is good to try new things. We will see if this experiment is replicated, and if that does much good. + +*David Roodman is Senior Advisor at Open Philanthropy. He can be contacted at [**david@davidroodman.com**](mailto:david@davidroodman.com)* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2025/05/31/roodman-appeal-to-me-first-trial-of-a-replication-opinion/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2025/05/31/roodman-appeal-to-me-first-trial-of-a-replication-opinion/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/roodman-hookworms-and-malaria-and-replications-oh-my.md b/content/replication-hub/blog/roodman-hookworms-and-malaria-and-replications-oh-my.md new file mode 100644 index 00000000000..7cf62f22d52 --- /dev/null +++ b/content/replication-hub/blog/roodman-hookworms-and-malaria-and-replications-oh-my.md @@ -0,0 +1,87 @@ +--- +title: "ROODMAN: Hookworms and Malaria and Replications, Oh My!" +date: 2018-12-21 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "American Economic Association" + - "David Roodman" + - "GiveWell" + - "Hookworm" + - "Hoyt Bleakley" + - "IREE" + - "Malaria" + - "Re-analysis" + - "replication crisis" +draft: false +type: blog +--- + +###### About 10 years ago, the economist Hoyt Bleakley published two important papers on the impact of health on wealth—more precisely, on the long-term economic impacts of large-scale disease eradication campaigns. In the *Quarterly Journal of Economics*, “*[Disease and Development: Evidence from Hookworm Eradication in the American South](https://doi.org/10.1162/qjec.121.1.73)*” found that a hookworm eradication campaign in the American South in the 1910s was followed by a substantial gain in adult earnings. In *AEJ: Applied Economics*, “*[Malaria Eradication in the Americas: A Retrospective Analysis of Childhood Exposure](https://doi.org/10.1257/app.2.2.1)*” reported similar benefits from 20th-century *malaria* eradication efforts in Brazil, Colombia, Mexico, and the U.S. + +###### With my colleagues at [GiveWell](https://givewell.org) providing inspiration and assistance, I replicated and reanalyzed both studies. The resulting pair of papers has just appeared in the *[International Journal for Re-Views in Empirical](https://www.iree.eu) [Economics](https://www.iree.eu)* (*[hookworm](https://doi.org/10.18718/81781.7)*, *[malaria](https://doi.org/10.18718/81781.8)*). I’ve blogged my findings on givewell.org (*[hookworm](https://blog.givewell.org/2017/12/07/questioning-evidence-hookworm-eradication-american-south/)*, *[malaria](https://blog.givewell.org/2017/12/29/revisiting-evidence-malaria-eradication-americas/)*). Short version: I can buy the Bleakley findings for malaria, but not for hookworm. + +###### Here I will share some thoughts sparked by my experience about *process*—about how we generate, review, publish, and revisit research in the social sciences. + +###### **To win trust, studies need to be reanalyzed, not just replicated** + +###### Psychology is now in the throes of a *[replication crisis](https://en.wikipedia.org/wiki/Replication_crisis#In_psychology)*: when published lab experiments are repeated, about half the time the original results (presumably statistically different from zero) disappear (*[this](https://doi.org/10.1126/science.aac4716)*, *[this](https://cos.io/about/news/28-classic-and-contemporary-psychology-findings-replicated-more-60-laboratories-each-across-three-dozen-nations-and-territories/)*). *[Some see](http://theconversation.com/the-replication-crisis-has-engulfed-economics-49202)* a replication crisis in economics too. I do not. In my experience (*[this](https://doi.org/10.1257/0002828041464560)*, *[this](https://elibrary.worldbank.org/doi/abs/10.1093/wber/lhm004?journalCode=wber)*, *[this](https://doi.org/10.1016/S0140-6736(12)61529-3)*, *[this](https://doi.org/10.1080/00220388.2013.858122)*, *[this](https://doi.org/10.1177%2F1091142114537895)*, *[this](https://davidroodman.com/david/The%20impacts%20of%20alcohol%20taxes%206.pdf)*, *[this](https://www.openphilanthropy.org/blog/reasonable-doubt-new-look-whether-prison-growth-cuts-crime)*, *[this](https://blog.givewell.org/2016/12/06/why-i-mostly-believe-in-worms/)*, …), most empirical research in economics *does* replicate, in the sense that original results can be matched when applying the reported methods to the reported data. The matches are perfect when original data and code are available and approximate otherwise. A *[paper by Federal Reserve economists](https://www.federalreserve.gov/econresdata/feds/2015/files/2015083pap.pdf)* reaches the opposite conclusion only by counting as non-replicable any study whose authors did not respond to a request for data. + +###### I would say, rather, that economics is in a *reanalysis* crisis. Or perhaps a “robustness crisis.” When I turn from replicating a study to revising it, introducing arguable improvements to data and code, the original findings often slip away like sand through my fingers. About half the time, in fact. The split decision on the two Bleakley papers is a case in point. Another is my [“](https://www.openphilanthropy.org/blog/reasonable-doubt-new-look-whether-prison-growth-cuts-crime)[*reanalysis review” of the impact of incarceration on crime*](https://www.openphilanthropy.org/blog/reasonable-doubt-new-look-whether-prison-growth-cuts-crime): of the eight studies for which data availability permitted replication, I found what I deemed to be significant methodological concerns in seven, and that ultimately led to me to reverse my reading of four. (Caveat: Essentially all my experience is with observational studies rather than field experiments, which may be more robust.) + +###### This is why I say that half of economics studies are reliable—I’m just not sure which half. Seriously, as a partial answer to “which half?”, I conjecture that young, tenure-track researchers are more apt to produce fragile work, because they are under the most intense pressure to generate significant, non-zero results. + +###### **Review of research is under-supplied** + +###### Many studies in economics aspire to influence policy decisions that have stakes measured in billions or trillions of dollars. Yet society invests only hundreds or thousands of dollars in assessing the quality of economics research, mainly in the form of peer review. And if only half the studies that survive peer review withstand closer scrutiny, then we evidently have not reached the point of diminishing returns to investment in review. + +###### There is something wrong with this picture. Serious assessment of published research is a public good and so is under-supplied. Who will fill the gap? + +###### **Reanalysis, like original analysis, cannot be mechanized** + +###### I like to think that in reanalyzing research, I strike a judicious balance. Ideally, I introduce appropriately tough robustness tests; yet I avoid “gotcha” specification mining, trying lots of things until I break a regression. Ultimately, it is for readers to assess my success. One might take the discretionary character of reanalysis as a fatal flaw: replication, by contrast, can be fully pre-specified and is in this sense more objective. But by the same argument, one ought not to perform original research. A better approach is to *[marshal the toolkit](https://dx.doi.org/10.1126%2Fscience.1245317)* that has gradually been assembled to improve the objectivity and reliability of original research, and bring it to reanalysis—for example, posting data and code along with finished analysis, and preregistering one’s analytical plan of attack. In revisiting the Bleakley studies, I did both. + +###### **Preregistering reanalysis is a good thing** + +###### In fact, this was my first-time preregistering. I’ve heard of preregistered analysis plans that run to hundreds of pages. My plans (*[hookworm](https://osf.io/yb537/)*, *[malaria](https://osf.io/h98yf/)*) just run a page or so. The *[Open Science Framework](https://osf.io/)* of the Center for Open Science serves as the perfect, public home for the documents, as an independent party that credibly time-stamped them and makes them public. + +###### I tried to use the plans to signal my strategy, recognizing that tactics would need to be refined after encountering the data. But I did not take the plans as *binding*. I allowed myself to stray outside a plan, while working to inform the reader when I had done so. After all, reanalysis is a creative act too, which I think should be allowed to take unexpected turns. It’s also a social act: helpful or even peremptory comments from the original authors, as well as reviewers and editors, are bound to motivate changes late in a project. + +###### That said, I think I have room to mature as preregisterer. I could have written my hookworm plan with more care, making it more predictive of what I ultimately did, thus adding to its credibility. + +###### **Original authors should be included in the review of replications and reanalyses, in the right way** + +###### I always send draft write-ups of replications and reanalyses to the original authors. Some don’t respond (much). Others do, and I always learn from them (*[Pitt](https://web.archive.org/web/20120314132029/http://www.pstc.brown.edu/~mp/papers/Pitt_response_to_RM.pdf)*, *[Bleakley](https://web.archive.org/web/20180619194738/http://www-personal.umich.edu/~hoytb/Bleakley_Hookworm_Corrigenda.pdf)*). Clearly original authors should be heard from. But should journals give them the full powers of a referee? Maybe not. This creates an incentive for them to withhold comment on drafts sent to them before submission to a journal and then, when invited to referee, to roll out all their criticisms before the editor. Presumably some of the criticisms will be valid, and ought to be incorporated before involving other referees. Managing editor *[Martina Grunow](https://replicationnetwork.com/tag/martina-grunow/)* explained to me how *IREE* threads this needle: + +###### *“We…decided that contacting the original author must be done by the replicator and before submitting the replication study to IREE (with 4 weeks waiting time whether the author responds) and that the contact (attempt or dialog) must be documented in the paper. This mainly protects the replicator against the killer argument [that the replicator failed to perform the due diligence of sharing the text with the original author]. In the case that an original author wants to comment on the replication, we offer to publish this comment along with the replication study. Up to now this did not happen. As we read in the submissions to IREE, most original authors do not reply when they are contacted by the replicators.”* + +###### I like this solution: do not use original authors as ordinary referees, but require replicators to make reasonable efforts to include original authors in the process before journal submission. + +###### **The American Economic Association’s archiving policy has holes** + +###### In 2003, the *American Economic Review* published a study by *[McCullough and Vinod](https://doi.org/10.1257/000282804322970896)*, which tried—and failed—to replicate ten empirical papers in an issue of the *AER*. At the time, the journal merely required publishing authors to provide data and code to interested researchers upon request: + +###### *“Though the policy of the AER requires that ‘Details of computations sufficient to permit replication must be provided,’ we found that fully half of the authors would not honor the replication policy….Two authors provided neither data nor code: in one case the author said he had already lost all the files; in another case, the author initially said it would be ‘next semester’ before he would have time to honor our request, after which he ceased replying to our phone calls, e-mails, and letters. A third author, after several months and numerous requests, finally supplied us with six diskettes containing over 400 files—and no README file. Reminiscent of the attorney who responds to a subpoena with truckloads of documents, we count this author as completely noncompliant. A fourth author provided us with numerous datafiles that would not run with his code. We exchanged several e-mails with the author as we attempted to ascertain how to use the data with the code. Initially, the author replied promptly, but soon the amount of time between our question and his response grew. Finally, the author informed us that we were taking up too much of his time—we had not even managed to organize a useable data set, let alone run his data with his code, let alone determine whether his data and code would replicate his published results.”* + +###### I’ve had similar experiences. (As well as plenty of better ones, with cooperative replicatees.) + +###### In response, *AER* editor Ben Bernanke *[announced an overhaul](https://web.archive.org/web/20160322212917/http://www.climateaudit.info/pdf/aereditorial.pdf)*: henceforth, the journal would require submission of data and code to a central archive at the time of publication. The policy now applies to all American Economic Association journals, including the one that published the Bleakley study of malaria eradication. + +###### Kudos to Bernanke and the *AER*, for that policy reform put the journal many years ahead of the *QJE*, which became the periodical of record for the Bleakley hookworm study. But in taking advantage of the [Bleakley malaria data and code archive](https://www.aeaweb.org/aej/app/data/2008-0126_data.zip), I also ran into two serious gaps in the AEA’s policy, or at least its implementation. These leave substantial scope for original authors to impede replication. As I *[write](https://doi.org/10.18718/81781.8)*: + +###### *“First, [the AEA journals] provide no access to the primary data, or at least to the code that transforms the primary data into the analysis data. The American Economic Review’s own assessment of compliance with its data availability policy highlighted this omission in 2011. ‘Simply requiring authors to submit their data prior to publication may not be sufficient to improve accuracy….The broken link in the replication process usually lies in the procedures used to transform raw data into estimation data and to perform the statistical analysis, rather than in the data themselves’ ([Glandon 2011](https://pubs.aeaweb.org/doi/pdfplus/10.1257/aer.101.3.684#page=12)). Second, code is provided for tables only, not figures. Yet figures can play a central role in a study’s conclusions and impact. Like tables, figures distill large amounts of data to inform inference. They ought to be fully replicable, but only can be if their code is public too.”* + +###### I think much of the power of the Bleakley studies lay in figures that seemed to show kinks in long-term earnings trends with timing explicable by the eradication campaigns. In the hookworm study, those kinks pretty substantially faded in the attempted replication—and it is impossible to be sure why, for lack of access to much of the original data and code. Potential causes include discrepancies between original and replication in primary database construction, in the transformation code, or in the figure-generating code. + +###### The AEA and other publishers can and should head-off such mysteries, with more complete archiving. + +###### *David Roodman is a Senior Advisor at [GiveWell](https://www.givewell.org). He has replicated and reanalyzed research on foreign aid effectiveness, geomagnetic storms, alcohol taxes, immigration, microcredit, and other subjects.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/12/21/hookworms-and-malaria-and-replications-oh-my/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/12/21/hookworms-and-malaria-and-replications-oh-my/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/royne-building-and-enhancing-the-advertising-discipline-through-replication.md b/content/replication-hub/blog/royne-building-and-enhancing-the-advertising-discipline-through-replication.md new file mode 100644 index 00000000000..2fdce75b9e1 --- /dev/null +++ b/content/replication-hub/blog/royne-building-and-enhancing-the-advertising-discipline-through-replication.md @@ -0,0 +1,60 @@ +--- +title: "ROYNE: Building and Enhancing the Advertising Discipline Through Replication" +date: 2018-04-06 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Advertising" + - "Journal of Advertising" + - "Journal policies" + - "replication" +draft: false +type: blog +--- + +###### *[This blog is taken from a recent editorial that appeared in the Journal of Advertising Research entitled “Why We Need More Replication Studies to Keep Empirical Knowledge in Check” by Marla B. Royne. The full-length editorial can be found **[here](http://www.journalofadvertisingresearch.com/content/58/1/3)**]* + +###### Advertising research in top advertising journals goes through a rigorous peer-review process to ensure that published articles are of the highest quality.  Yet such research is generally published only in these top journals when something novel is reported because previously reported results are believed to be less interesting and unimportant. Despite the rarity of replication research in advertising, Nosek et al (2012) note that replications can help keep existing knowledge in check, yet they argue that academia only rewards novel, positive results.   However, controversy about conducting and publishing advertising replications remains. + +###### Replications might be best viewed as a process of conducting similar, but consecutive studies that increasingly consider alternative explanations, critical contingencies, and real-world relevance. This belief is in line with my own work (Royne 2016) and supports the role of replications as a way to reach ultimate understanding of a particular theory or construct. + +###### The *Journal of Advertising*’s 2016 special issue on “re-inquiries” in advertising research reinforces this notion and published a range of articles reinvestigating advertising questions.   The issue included articles that replicated existing studies either empirically or conceptually; in some cases, the publication offered support for the original work and in others, provided different results. + +###### For example, an attempted empirical replication of Gwinner and Eaton (1999) showed the effects of brand sponsorship on image congruence between sponsoring brands and sponsored sporting events (Kwon, Ratneshwar, and Kim, 2016). Attempting to address potential methodological flaws, the authors enhanced their statistical analyses.  Findings supported some of the original findings, including that brand sponsorship increases image congruence between sponsoring brands and sponsored sporting events. However, only mild support of the matchup hypothesis was found and there was no support of a moderating influence of image-based similarity on the extent of image congruence. + +###### Another article in the same issue by Bellman, Wooley, and Varan (2016) applied facial-tracking technology in a conceptual replication of Kamins, Marks, and Skinner’s (1991) program-ad matching study. Although this research examined the original study’s program–ad matching effect on informational advertisements on cognitive recall, this replication varied because it utilized a mixed experimental design, different genres of television shows and a biometric process measure (computer-detected smiling). The replication both corroborated and extended the original study demonstrating how replications need not be limited to just the original results. + +###### A third study included a hybrid attempt to empirically replicate the findings of Kees, et al (2006) who originally reported that more graphic pictorial cigarette warnings positively affect smoking cessation intentions and that evoked fear underlies this relationship (Davis and Burton, 2016). This replication study also differed.  Specifically, the 2016 study used cigarette advertisements (and not packaging and warning statements), FDA mandated pictures (and not self-selected pictures) and different samples. Partial corroboration of Kees et al. (2006) was found including additional support that more graphic pictorials positively influence warning effectiveness perceptions and smoking cessation intentions and confirmation of evoked fear as the primary underlying mediating mechanism. + +###### These are just three examples of studies that helped add knowledge and understanding to the advertising literature through “replications,” but not a 100% pure repeat of what had been done previously.   In short, replication is about much more than just redoing a study that had been done before; rather, it has the vast potential of building and enhancing the advertising discipline. + +###### *Marla B. Royne (Stafford) is the Great Oaks Foundation Professor of Marketing at the University of Memphis, USA. She is past President of the American Academy of Advertising and past Editor-in-Chief of the Journal of Advertising, the leading journal in the advertising discipline. Professor Royne Stafford can be contacted at mstaffrd@memphis.edu.* + +###### **REFERENCES** + +###### Bellman, S., B. Wooley, and D. Varan. “Program–Ad Matching and Television Ad Effectiveness: A Reinquiry Using Facial Tracking Software.” *Journal of Advertising* 45, 1 (2016): 72–77. + +###### Davis, C., and S. Burton. “Understanding Graphic Pictorial Warnings in Advertising: A Replication and Extension.” *Journal of Advertising* 45, 1 (2016): 33–42. + +###### Gwinner, K. P., and J. Eaton. “Building Brand Image through Event Sponsorship: The Role of Image Transfer.” *Journal of Advertising* 28, 4, (1999): 47–57. + +###### Kamins, M. A., L. J. Marks, and D. Skinner. “Television Commercial Evaluation in the Context of Program-Induced Mood: Congruency versus Consistency Effects.” *Journal of Advertising* 20, 2 (1991): 1–14. + +###### Kees, J., S. Burton, J. C. Andrews, and J. Kozup. “Tests of Graphic Visuals and Cigarette Package Warning Combinations: Implications for the Framework Convention on Tobacco Control.” *Journal of Public Policy and Marketing* 25, 2 (2006): 212–23. + +###### Kwon, E., S. Ratneshwar, and E. Kim. “Brand Image Congruence Through Sponsorship of Sporting Events: A Reinquiry of Gwinner and Eaton (1999).” *Journal of Advertising* 45, 1 (2016): 130–38. + +###### Nosek, B. A., J. R. Spies, and M. Motyl. “Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth over Publishability.” *Perspectives on Psychological Science* 7, 6 (2012): 615–31. + +###### Park, J. H., O. Venger, D. Y. Park, and L. N. Reid. “Replication in Advertising Research, 1980-2012: A Longitudinal Analysis of Leading Advertising Journals.” *Journal of Current Issues and Research in Advertising* 36, (2015): 115–35. + +###### Royne (Stafford), M. “Research and Publishing in the Journal of Advertising: Making Theory Relevant.” *Journal of Advertising* 45, 2 (2016): 269–73. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/04/06/royne-building-and-enhancing-the-advertising-discipline-through-replication/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/04/06/royne-building-and-enhancing-the-advertising-discipline-through-replication/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/sch-nbrodt-learn-to-p-hack-like-the-pros.md b/content/replication-hub/blog/sch-nbrodt-learn-to-p-hack-like-the-pros.md new file mode 100644 index 00000000000..67940af09ce --- /dev/null +++ b/content/replication-hub/blog/sch-nbrodt-learn-to-p-hack-like-the-pros.md @@ -0,0 +1,93 @@ +--- +title: "SCHÖNBRODT: Learn to p-Hack Like the Pros!" +date: 2016-10-19 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "experiments" + - "Felix Schoenbrodt" + - "Humor" + - "p-hacking" + - "Shiny app" +draft: false +type: blog +--- + +###### (NOTE: This ironic blog post was originally published on ******) + +##### **My Dear Fellow Scientists!** + +###### *“If you torture the data long enough, it will confess.”* + +###### This aphorism, attributed to ***[Ronald Coase](http://en.wikipedia.org/wiki/Ronald_Coase)***, sometimes has been used in a dis-respective manner, as if it was wrong to do creative data analysis. + +###### In fact, the art of creative data analysis has experienced despicable attacks over the last years. A small but annoyingly persistent group of second-stringers tries to denigrate our scientific achievements. They drag science through the mire. + +###### These people propagate stupid method repetitions (also know as “direct replications”); and what was once one of the supreme disciplines of scientific investigation – a creative data analysis of a data set – has been crippled to conducting an empty-headed step-by-step pre-registered analysis plan. (Come on: If I lay out the full analysis plan in a pre-registration, even an *undergrad* student can do the final analysis, right? Is that really the high-level scientific work we were trained for so hard?). + +###### They broadcast in an annoying frequency that *p*-hacking leads to more significant results, and that researchers who use *p*-hacking have higher chances of getting things published. + +###### What are the consequence of these findings? The answer is clear. Everybody should be equipped with these powerful tools of research enhancement! + +##### **The Art of Creative Data Analysis** + +###### Some researchers describe a performance-oriented data analysis as “data-dependent analysis”. We go one step further, and call this technique ***data-optimal analysis (***[***DOA***](http://www.urbandictionary.com/define.php?term=DOA)***)***, as our goal is to produce the optimal, most significant outcome from a data set. + +###### I developed an **online app that allows to practice creative data analysis and how to polish your *p*-values**. It’s primarily aimed at young researchers who do not have our level of expertise yet, but I guess even old hands might learn one or two new tricks! It’s called “The p-hacker” (please note that ‘hacker’ is meant in a very positive way here. You should think of the cool hackers who fight for world peace). You can use the app in teaching, or to practice *p*-hacking yourself. + +###### Please test the app, and give me feedback! You can also send it to colleagues: ****** + +###### The full R code for this Shiny app is on ***[Github](https://github.com/nicebread/p-hacker)***. + +##### **Train Your *p*-Hacking Skills: Introducing the *p*-Hacker App** + +###### Here’s a quick walk-through of the app. Please see also the quick manual at the top of the app for more details. + +###### First, you have to run an initial study in the “New study” tab: + +###### pic1 + +###### When you ran your first study, inspect the results in the middle pane. Let’s take a look at our results, which are quite promising: + +###### pic2 + +###### **After exclusion of this obvious outlier, your first study is already a success!** Click on “Save” next to your significant result to save the study to your study stack on the right panel: + +###### pic3 + +###### Sometimes outlier exclusion is not enough to improve your result. + +###### Now comes the magic. Click on the “Now: p-hack!” tab – **this gives you all the great tools to improve your current study**. Here you can fully utilize your data analytic skills and creativity. + +###### In the following example, we could not get a significant result by outlier exclusion alone. But after adding 10 participants (in two batches of 5), controlling for age and gender, and focusing on the variable that worked best – voilà! + +###### pic4 + +###### Do you see how easy it is to craft a significant study? + +###### Now it is important to **show even more productivity**: Go for the next conceptual replication (i.e., go back to Step 1 and collect a new sample, with a new manipulation and a new DV). Whenever your study reached significance, click on the *Save* button next to each DV and the study is saved to your stack, awaiting some additional conceptual replications that show the robustness of the effect. + +###### **Many journals require multiple studies**. Four to six studies should make a compelling case for your subtile, counterintuitive, and shocking effects: + +###### pic5 + +###### Honor to whom honor is due: Find the best outlet for your achievements! + +###### My friends, let’s stand together and Make Science Great Again! I really hope that the *p*-hacker app can play its part in bringing science back to its old days of glory. + +###### (A quick side note: Some so-called „data-detectives“ developed several methods for detecting *p*-hacking. This is a cowardly attack on creative data analyses. If you want to take a look at these detection tools,  check out the ***[p-checker app](http://www.shinyapps.org/apps/p-checker/)***. You can also transfer your *p*-hacked results to the p-checker app with a single click on the button ‘Send to p-checker’ below your study stack on the right side). + +###### Best regards, + +###### *Ned Bicare, PhD* + +###### PS: A similar app can be found on FiveThirtyEight: ***[Hack Your Way To Scientific Glory](http://projects.fivethirtyeight.com/p-hacking/)*** + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/10/19/schonbrodt-p-hacking-for-pros/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/10/19/schonbrodt-p-hacking-for-pros/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/scheel-when-null-results-beat-significant-results-or-why-nothing-may-be-truer-than-something.md b/content/replication-hub/blog/scheel-when-null-results-beat-significant-results-or-why-nothing-may-be-truer-than-something.md new file mode 100644 index 00000000000..6d026c80a81 --- /dev/null +++ b/content/replication-hub/blog/scheel-when-null-results-beat-significant-results-or-why-nothing-may-be-truer-than-something.md @@ -0,0 +1,75 @@ +--- +title: "SCHEEL: When Null Results Beat Significant Results OR Why Nothing May Be Truer Than Something" +date: 2017-06-27 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Anne Scheel" + - "Felix Schoenbrodt" + - "John Ioannidis" + - "null hypothesis significance testing" + - "null results" + - "PPV" + - "Shiny app" +draft: false +type: blog +--- + +###### *[The following is an adaption of (and in large parts identical to) a* **[*recent blog post*](http://www.the100.ci/2017/06/01/why-we-should-love-null-results/)***by Anne Scheel that appeared on* **[*The 100% CI*](http://www.the100.ci/)** .] + +###### Many, probably most empirical scientists use frequentist statistics to decide if a hypothesis should be rejected or accepted, in particular *null hypothesis significance testing* (NHST). + +###### NHST works when we have access to all statistical tests that are being conducted. That way, we should at least in theory be able to see the 19 null results accompanying every statistical fluke (assuming an alpha level of 5%) and decide that effect X probably does not exist. But publication bias throws this off-kilter: When only or mainly significant results end up being published, whereas null results get p-hacked, file-drawered, or rejected, it becomes very difficult to tell false positive from true positive findings. + +###### The number of true findings in the published literature depends on something significance tests can’t tell us: The base rate of true hypotheses we’re testing. If only a very small fraction of our hypotheses are true, we could always end up with more false positives than true positives (this is one of the main points of ***[Ioannidis’ seminal 2005 paper](http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124)***). + +###### When ***[Felix Schönbrodt](https://twitter.com/nicebread303)*** and ***[Michael Zehetleitne](http://www.psy.lmu.de/exp/people/former/zehetleitner/index.html)r*** released ***[this great Shiny app](http://shinyapps.org/apps/PPV/)*** a while ago, I remember having some vivid discussions with Felix about what the rate of true hypotheses in psychology may be. In his very nice ***[accompanying blog post](http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effec)***, Felix included a flowchart assuming that 30% of all tested hypotheses are true. At the time I found this grossly pessimistic: Surely our ability to develop hypotheses can’t be worse than a coin flip? We spent years studying our subject! We have theories! We are really smart! I honestly believed that the rate of true hypotheses we study should be at least 60%. + +###### A few months ago, ***[this interesting paper](http://amstat.tandfonline.com/doi/pdf/10.1080/01621459.2016.1240079?needAccess=true)*** by Johnson, Payne, Want, Asher, & Mandal came out. They re-analysed 73 effects from the ***[Reproducibility Project: Psychology](https://en.wikipedia.org/wiki/Reproducibility_Project)*** and tried to model publication bias. I have to admit that I’m not maths-savvy enough to understand their model and judge their method, but they estimate that over 700 hypothesis tests were run to produce these 73 significant results. They assume that the statistical power for tests of true hypotheses was 75%, and that 7% of the tested hypotheses were true. *Seven percent.* + +###### Er, ok, so not 60% then. To be fair to my naive 2015 self: this number refers to *all* hypothesis tests that were conducted, including p-hacking. That includes the one ANOVA main effect, the other main effect, the interaction effect, the same three tests without outliers, the same six tests with age as covariate, … and so on. + +![Table1_PPV-NPV-FDR-FOR_table](/replication-network-blog/table1_ppv-npv-fdr-for_table.webp) + +###### Let’s see what these numbers mean for the rates of true and false findings. For this we will need the *positive predictive value* (PPV) and the *negative predictive value* (NPV). I tend to forget what exactly they and their two siblings, FDR and FOR, stand for and how they are calculated, so added the table above as a cheat sheet. + +###### Ok, now we got that out of the way, let’s stick the numbers estimated by Johnson et al. into a flowchart. You see that the positive predictive value is shockingly low: Of all significant results, only 53% are true. Wow. I must admit that even after reading Ioannidis (2005) several times, this hadn’t quite sunk in. If the 7% estimate is anywhere near the true rate, that basically means that we can flip a coin any time we see a significant result to estimate if it reflects a true effect. + +###### But I want to draw your attention to the *negative* predictive value. The chance that a non-significant finding is true is 98%! Isn’t that amazing and heartening? In this scenario, null results are vastly more informative than significant results. + +![Figure1_PPV_NPV](/replication-network-blog/figure1_ppv_npv.webp) + +###### I know what you’re thinking: 7% is ridiculously low. Who knows what those statisticians put into their Club Mate when they calculated this? For those of you who are more like 2015 me and think psychologists are really smart, I plotted the PPV and NPV for different levels of power across the whole range of the true hypothesis rate, so you can pick your favourite one. I chose five levels of power: 21% (estimate for neuroscience by ***[Button et al., 2013](http://www.nature.com/nrn/journal/v14/n5/full/nrn3475.html)***), 75% (Johnson et al. estimate), 80% and 95% (common conventions), and 99% (upper bound of what we can reach). + +![Figure2_PVplot](/replication-network-blog/figure2_pvplot.webp) + +###### The not very pretty but adaptive (you can chose different values for alpha and power) code is available ***[here](https://github.com/amscheel/PPV-NPV-FDR-FOR_plot/blob/master/PPV_NPV_FDR_FOR_looped.R)***. + +###### The plot shows two vertical dashed lines: The left one marks 7% true hypotheses, as estimated by Johnson et al. The right one marks the intersection of PPV and NPV for 75% power: This is the point at which significant results become more informative than negative results. That happens when more than 33% of the studied hypotheses are true. So if Johnson et al. are right, we would need to up our game from 7% of true hypotheses to a whopping 33% to get to a point where significant results are as informative as null results! + +###### There are a few things to keep in mind: First, 7% true hypotheses and 75% power are  simply an estimate, based on data from one replication project. I can certainly imagine that this isn’t far from the truth in psychology, but even if the estimate is accurate, it will vary at least slightly across different fields and probably across time. + +###### Second, we have to be clear about what “hypothesis” means in this context: It refers to *any* statistical test that is conducted. A researcher could have one “hypothesis” in mind, yet perform twenty different *hypothesis tests* on their data to test this hypothesis, all of which would count towards the denominator when calculating the rate of true hypotheses. I personally believe that the estimate by Johnson et al. is so low because psychologists tend to heavily exploit so-called “researcher degrees of freedom” and test many more hypotheses than they themselves are aware of. Third, statistical power will vary from study to study and the plot above shows that this affects our conclusions. It is also important to bear in mind that power refers to a specific effect size: A specific study has different levels of power for large, medium, and small effects. + +###### We can be fairly certain that most of our hypotheses are false (otherwise we would waste a lot of money by researching trivial questions). The exact percentage of true hypotheses remains unknown, but if it there is something to the estimate of Johnson et al., the fact that an effect is significant doesn’t tell us much about whether or not it is real. Non-significant findings, on the other hand, likely are correct most of the time in this scenario – maybe even 98% of the time! Perhaps we should start to take them more seriously. + +###### *Anne Scheel is a PhD student in psychology at Ludwig-Maximilians-Universität, Munich (LMU).  She is co-moderator of the Twitter site **[@realsci\_DE](https://twitter.com/realsci_DE)** and co-blogger at **[The 100% CI](http://www.the100.ci/)**. She can be contacted at anne.scheel@psy.lmu.de.* + +###### **References** + +###### Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. *Nature Reviews Neuroscience, 14*, 365–376. + +###### Ioannidis, J. P. A. (2005). Why most published research findings are false. *PLOS Medicine, 2*(8), e124. doi: 10.1371/journal.pmed.0020124 + +###### Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the reproducibility of psychological science. *Journal of the American Statistical Association, 112*(517), 1-10. doi: 10.1080/01621459.2016.1240079 + +###### Open Science Collaboration (2015). Estimating the reproducibility of psychological science. *Science, 349*(6251), aac4716. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/06/27/scheel-when-null-results-beat-significant-results-or-why-nothing-may-be-truer-than-something/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/06/27/scheel-when-null-results-beat-significant-results-or-why-nothing-may-be-truer-than-something/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/simchi-levi-behavioral-science-s-credibility-is-at-risk-replication-studies-can-help.md b/content/replication-hub/blog/simchi-levi-behavioral-science-s-credibility-is-at-risk-replication-studies-can-help.md new file mode 100644 index 00000000000..c2a368dc8dc --- /dev/null +++ b/content/replication-hub/blog/simchi-levi-behavioral-science-s-credibility-is-at-risk-replication-studies-can-help.md @@ -0,0 +1,266 @@ +--- +title: "SIMCHI-LEVI: Behavioral Science’s Credibility Is At Risk. Replication Studies Can Help" +date: 2023-11-19 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "COLIN CAMERER" + - "INFORMS" + - "Laboratory Experiments" + - "Management Science" + - "replications" + - "Research Validity" + - "Yan Chen" +draft: false +type: blog +--- + +*NOTE: This blog is a repost of one originally published at the Informs blogsite ([**click here**](https://www.informs.org/Blogs/ManSci-Blogs/From-the-Editor/Behavioral-science-s-credibility-is-at-risk.-Replication-studies-can-help?fbclid=IwAR3OTh9PUEWcZmcqkHy8XsKlNOOAxadXyoRqfV8_yZeHNS-Foy_FkvKY9b))*. *We thank David Simchi-Levi for permission to repost.* + +Several scientific disciplines have been conducting replication initiatives to investigate the reliability of published research results. Replication studies are particularly important in social sciences for creating and advancing the state of knowledge. Transparency plans such as data disclosure policies and pre-registration have undoubtedly improved the standards for future studies.  Replication studies help improve the level of confidence in previously-published results. + +In “A Replication Study of Operations Management Experiments in Management Science” (Andrew M. Davis, Blair Flicker, Kyle Hyndman, Elena Katok, Samantha Keppler, Stephen Leider, Xiaoyang Long and Jordan D. Tong, Management Science), the authors investigate the replicability of a subset of laboratory experiments in operations management published in *Management Science* in 2020 and earlier to better understand the robustness, or limits, of the previously published results. + +The research team, which included eight scholars from five research universities, carefully selected ten influential experimental papers in operations management based on survey-collected preferences of the operations management community more broadly. These papers covered six of the key OM topics: inventory management, supply chain contracts, queuing, forecasting, and sourcing. To control for subject-pool effects, the team conducted independent replications for each paper at two different research laboratories. The goal was to establish robust and reliable results with a statistical power of 90% at a 5% significance level. + +The research team categorized replication outcomes based on p-values. When a paper’s primary hypothesis was replicated at both research sites, the team label the outcome as a “full” replication. A “partial” replication occurred when the primary hypothesis was replicated at one of the sites. And finally, when the primary hypothesis failed to replicate at either site, the authors label the outcome as a “no” replication. + +Among the ten papers included in this study, six achieved full replication, two achieved partial replication, and two did not replicate. While it is encouraging to observe full replications, there is value in understanding why papers did not fully replicate. Non-replications provided interesting insights that contributed to our understanding of the research topics as well as methods. + +An additional facet of the study was a survey conducted by the authors to collect predictions about the likelihood of replication of the results in papers in our study, from operations management researchers in general and specifically from OM researchers engaged in behavioral work. Interestingly, for both respondent groups, higher predictions of replication were positively associated with replication success; however, the team found that the researchers in the behavioral operations management field demonstrated more optimism about the likelihood of replication. However, overall prediction accuracy did not significantly differ between the behavioral and non-behavioral communities within operations. + +This study has at least three important implications for the operations management community. First, it establishes a level of certainty about the validity, and in some cases limitations, of some of the most prominent laboratory experimental results in our field. Behavioral researchers, and those who draw on behavioral insights in analytical or empirical work, can leverage the study findings as a source of confidence about the results we tested. + +Second, the study contributes interesting insights about the transferability of findings between the in-person and online data collection methods. Due to COVID-19, in some cases, the authors ran online versions of the original experiments and saw that in some cases they did indeed replicate despite the substantive differences. The COVID-19 pandemic more broadly caused an increase in online modes of data collection and remote interactions with participants, and therefore it became important to have findings that indicate when and why results can replicate across modalities and when they do not. + +Third, the study initiates what we hope becomes a new norm in behavioral operations management for which this paper can serve as a foundation.  We hope that similar operations management replication projects will become prevalent in the future.  This should provide incentives for researchers to carefully document their methods in a way that will make it possible for others to replicate their results.  Additionally, a stronger norm of independently conducted replications can help disincentivize researchers from engaging in fraud as well as identify existing fraudulent data. + +There are several trade-offs that need to be considered in replication studies. There have now been several replication projects in psychology, economics, and operations. Each project has taken a different approach in terms of the paper selection method, author team size and structure, the number of sites per paper, and replication type. + +Some projects adopt a mechanical approach, replicating papers from specific journals within a defined timeframe, while others, select a list of papers from various journals based on an open nomination process. The replication study in the Management Science paper employed a hybrid approach, combining mechanical and selective inclusion. This allowed the research team to create a representative list of papers while also incorporating community feedback. + +Replicating more papers is desirable, but it comes with increased costs and coordination challenges. Projects that replicate a larger number of papers tend to have larger, decentralized teams. In contrast, the current project formed a centralized eight-author team to ensure process consistency and coordination. The team communicated regularly and made decisions jointly, leveraging the expertise of each team member.  They also communicated with the authors of original papers throughout the replication process. For all selected papers, the original study authors shared their preferences about the hypotheses to study, shared and/or provided feedback on the materials used (software, instructions…) and the analysis, and were also given an opportunity to write responses to the replication reports. + +Replication studies can choose to replicate each paper at one site or multiple sites. Replicating at multiple sites helps mitigate idiosyncratic differences among labs and provides a clearer understanding of the generalizability of results. Therefore, this research team conducted a replication study for each paper at two sites. + +Replication studies can opt for either exact replications, following the original protocol, or close replications, introducing some material differences while documenting them explicitly. The choice of replication type depends on the goals of the study and the interpretation of the results. Although the initial intention of the research team was to conduct exact replications, this was not always possible.  All deviations from original protocols, intentional as well as unintentional, are clearly documented. + +In conclusion, this replication study in operations management provides valuable insights into the validity and transferability of experimental results. It enhances the confidence of researchers in the field and contributes to the ongoing replication movement. The research findings shed light on the intricacies of designing and conducting replication studies, addressing the trade-offs and compromises inherent in such endeavors. + +We envision this study as a catalyst for future replication projects in operations management and related fields. By embracing replication and transparency, we can ensure the robustness and reliability of our research, ultimately advancing the knowledge and understanding of operations management principles and practices. + +**References** + +Andrew M. Davis, Blair Flicker, Kyle Hyndman, Elena Katok, Samantha Keppler, Stephen Leider, Xiaoyang Long and Jordan D. Tong, A Replication Study of Operations Management Experiments in Management Science. Management Science 2023 69:9, XX-XX. + +For more perspective on this research, the E-i-C, **David Simchi-Levi**, asked two experts, **Professor Colin F. Camerer**(California Institute of Technology) and **Professor Yan Chen**(University of Michigan). Their comments are given below. **Colin F. Camerer (California Institute of Technology)** + +**—————————-** + +**Title:** Comment on “A Replication Study of Operations Management Experiments in *Management Science*” +[**camerer@caltech.edu**](mailto:camerer@caltech.edu) + +This comment cheerleads and historicizes the superb, action-packed paper by Davis et al (2023) replicating a carefully chosen selection of seminal behavioral operations experiments. + +Almost twenty years ago, a series of seismic events began in experimental medical and social sciences, which led to a Replication Reboot in experimental social science that is still ongoing. It should continue until all the Rebooted open science practices are routine. + +The first event was that Ioannidis (2005) promised to explain, as advertised in the hyperbolic title of his paper “Why Most Published Research Findings Are False”, referring to practices in medical trials. + +In 2011 there were two more serious important events in experimental psychology: First, Bem (2011) published a paper in a highly-selective social psychology journal (JPSP) presenting evidence of “pre-cognitive” extrasensory perception. It was widely ridiculed—and the general claim was surely wrong (as some replication failures suggested). The JPSP defended their publication on the basis of being open-minded[**[1]**](https://www.informs.org/Blogs/ManSci-Blogs/From-the-Editor/Behavioral-science-s-credibility-is-at-risk.-Replication-studies-can-help?fbclid=IwAR3OTh9PUEWcZmcqkHy8XsKlNOOAxadXyoRqfV8_yZeHNS-Foy_FkvKY9b#_ftn1) and deferring to positive referee judgments. The Bem finding catalyzed concern that in some areas of experimental psychology, results that were “too cute to be true” were being published routinely. Many priming studies also failed to replicate around this time. + +The second 2011 event was that Simmons et al (2011) demonstrated the idea of p-hacking—choosing specifications and subsamples iteratively until a preferred hypothesis appeared significant at a professionally-endorsed level (rigidly p<.05). (Leamer, 1978, also warned about the dangers of undisciplined “specification searches” in econometrics; his warning was often-cited but rarely followed.) + +Simmons, and colleagues Simonsohn and Nelson, followed up their p-hacking bombshell with a series of papers clarifying their ideas and presenting tools (such as “p-curve”) to provide boundaries on how likely a set of studies might be exaggerated by p-hacking. By 2019 the term “p-hacking” was part of a question on the game show “Jeopardy!”. + +A year later, in 2012, the research director at the Arnold Foundation, Stuart Buck, and the Arnolds themselves, became curious about whether policy evidence was really flimsy (based on highly-publicized replication failures of priming studies). Buck had seen writing from Brian Nosek about open science and he and the Arnolds communicated. The Arnold Foundation ended up investing $60 million in Open Science (see Buck 2013). + +Shortly thereafter, the first multiple-study “meta-replications” appeared (Klein, 2014, a/k/a ManyLabs1; Open Science Collaboration, 2015 a/k/a RPP). The latter was the first output of the Open Science Foundation, created by Brian Nosek and Jeffrey Spies in 2013 and funded generously by the Arnold Foundation. + +Those studies provided a sturdy foundation for how to do a set of replications and what can be learned. Then came two large-scale team efforts in replicating experimental economics, and then replicating general social science experiments published in high-impact Science and Nature (Camerer et al 2016, 2018). Our efforts followed the pioneering early ManyLabs and RPP work closely, except that we added prediction markets and surveys (following Dreber at al 2015). All of us are thrilled that these meta-replications have continued in different areas of experimental social science. + +These historical facts help judge how quickly science—in this case, certain areas of experimental social science– is “self-correcting”. From the dramatic events of 2011 to the earliest meta-replications was only about four years. Generous private funding from the Arnold Foundation (people passionate about openness and quality of science) surely accelerated that timeline, perhaps by a factor of 2-3x. + +Table 5 in this Davis et al (2023) article summarizing meta-replications shows that the first seven were not only conducted, but were *published* in a four-year span (2014-2018). That is amazingly fast (considering how much work goes into carefully documenting all aspects of the details and supplemental reporting, and the editorial grind). And it was done with no special centralized help from any professional organization, university, or federal funding agency. The moving forces were personal scientific contributions, Nosek and Spies creating OSF, quick writing of checks with many zeros by the Arnold Foundation, and emergence of nudges and funding from the Sloan Foundation (for Camerer et al studies, including Swedish grants to Anna Dreber and Magnus Johanneson) and then Simchi-Levi’s (2019) editorial. + +About ten years later from the 2011 events, Andrew M. Davis, Blair Flicker—yes, I am spelling out all their names, not abbreviating– Kyle Hyndman, Elena Katok, Samantha Keppler, Stephen Leider, Xiaoyang Long, and Jordan D. Tong (2023) took up Simchi-Levi’s call to action. They carefully replicated ten experimental behavioral operations from studies published in *Management Science*. + +Everyone involved deserves community thanks for putting in enormous amounts of work to produce evidence. This activity is less glamorous and career-promoting than regular kinds of innovative research. “Everyone” also includes peer scientists who made voted on which studies to replicate, and made judgments and replicability predictions in surveys. The collective effort would not be as persuasive and insightful without those peer contributions. The authors’ institutions also contributed money and resources. All this collective effort is trailblazing in extending similar meta-replication methods to a new domain. This effort also helps solidify best practices in a pipeline protocol many others can follow. + +A cool feature of their approach is that peer scientists voted on which papers to replicate. The authors winnowed experimental papers down to 24, which around five in each of five substantive categories. Peer voting chose two from each category, for a total of 10 replications. + +**Basic results and one post-mortem** + +The basic results are simple to summarize: The quality of replicability varied but was generally high. A strong feature of their design is that each original finding was replicated twice, not once. Only two studies appeared to fail to replicate substantially in both replications. (Readers who aren’t interested in the nuanced details of this post-mortem should skip ahead to “Peer scientist replicability predictions are informative”.) + +Let’s dig into details, as in a post-mortem, of one of the two apparent failures. It was an older paper by Ho[**[2]**](https://www.informs.org/Blogs/ManSci-Blogs/From-the-Editor/Behavioral-science-s-credibility-is-at-risk.-Replication-studies-can-help?fbclid=IwAR3OTh9PUEWcZmcqkHy8XsKlNOOAxadXyoRqfV8_yZeHNS-Foy_FkvKY9b#_ftn2) and Zhang (2008) about manufacturer-retail channel bargaining. It is well-known that linear pricing is inefficient because it doesn’t maximize total profit. Charging a fixed fee can fully restore efficiency (a/k/a “two-part tariff”). Ho and Zhang did the first experimental test of this theory. They were also curious about the behavioral hypothesis that retailers might encode paying a fixed fee (TPT) as a loss—if it is “mentally accounted” for as being decoupled from revenue—and balk at paying. However, if the fee is equivalently framed as a “quantity discount (QD)”, they wondered, that might reduce the loss-aversion. Their behavioral hypothesis was therefore that QD framing might be more empirically efficient than TPT. + +As Ho and Zhang note in their Author Response (included in Davis et al’s supplementary material), four statistics go into the computation of overall efficiency. Three of the four statistics—wholesale and (conditional) retail prices, and the fixed fee—were strongly significantly different in the TPT and QD treatments in both the original study and the replication. Prices were lower and the fee was higher in QD. However, the difference in acceptance rates of the contracts, between the treatments, do go in opposite directions and efficiency does too. + +A restriction which Davis et al used, as many others have (including our Camerer et al 2016, 2018 studies) is that only one hypothesis is tested. This is reasonable as otherwise it requires more complicated power calculations across multiple hypotheses. The efficiency prediction is what was singled out and it plainly did not replicate in magnitude or even direction. Davis et al and Ho and Zhang’s parallel discussions are in agreement about this. + +A difference in protocol is that Ho and Zhang’s experiment was done on paper. Subjects had to make calculations by hand and their calculations were not checked. In the replication protocol, software was used, calculations were checked, and errors had to be corrected before subjects could proceed. The replication team noted that some subjects found this frustrating. The team also noted (as quoted in the Author Response): + +*“This calculation involves X + Y/(10-P) which is often not an integer. To do this calculation correctly also requires understanding order of operations, which to my great surprise it turns out that many people don’t understand. So some tried to calculate it as (X+Y)/(10-P). Even after calculating Price A correctly, they often ended up with a fraction and then had to multiply a fraction to calculate profit.”* + +The protocol difference and this comment make an important point: There will always be some unplanned deviations…***always***. The question is how well they might be anticipated and what they tell us. + +This example reminds us that every experiment tests a *joint hypothesis* based on a long chain of behavioral elements. The chain is typically: Credibility of expected money or other reward; basic comprehension of the instructed rules; ability to make necessary calculations, whether explicitly (in the computer) or implicitly (on paper or in the brain); beliefs about what other subjects might do; a preference parameter such as the degree of loss-aversion (conditioned on a reference point); how feedback generates learning; and so on. + +A good experimental design tries hard to create strong internal validity about design assumptions (such as payment credibility) and to allow thoughtful behavioral variation to identify behavioral parameters or treatment effects. (For example, Ho and Zhang recover estimates of a loss-aversion parameter, l=1.37 and 1.27 in TPT and QD.) + +Sometimes “ability to make necessary calculations” is an element the experimenter wants to clamp and restrict to be true (e.g., providing a calculator) and other times it is an interesting source of behavioral variation. In this case, that “ability…” seems to have been sensitive to the experimental interface. This is a nuisance (unless, as in UI experiments, the nature of the interface is the subject of study—it wasn’t here). + +This example is also a good opportunity to think about what scientists should do next in the face of a partially-failed replication. In a very few cases, a failure to replicate puts an idea or a group of scientists into a probationary period. For example, I do not think priming experiments such as many which replicated so poorly will ever recover. + +In most cases, however, a partially-failed replication just means that a phenomenon is likely to be empirically sensitive to a wider range of variables than might be have been anticipated. This sensitivity should often *spur* more new investigation, not less. + +In the case of nonlinear channel pricing, the empirical question is so interesting that mixed results of the original experiment and the replication should certainly provoke more research. As noted above, the core behavioral difference is the retailer acceptance rate of fixed-fee contracts under the TPT and QD frames. What variables cause this difference?  Given these results, this question is even more interesting than before the replication results. + +**Why is behavioral operations replication so solid?** + +The general quality of replication effects is about as good as has been seen in any of the many meta-replications in somewhat different social sciences. In the replications of Davis et al and Ozer et al (ab) the replication effect sizes are smack on the original ones (as plotted in Figure 2). Other replication effect sizes are smaller than original effects, but that is the modal result.  The relative effect size of replication compared to original data seems to generally be reliably around .60.  Replicated effects usually go in the direction show originally, but are smaller. There is some great work to be done in figuring out exactly why this replication effect shrinkage is so regular. + +In my experience looking across social sciences, replication is most precarious when a dependent variable and independent variable are measured (or experimentally manipulated) with error variance, and especially when the focus of theory is an interaction. In experimental economics, theory usually produces predictions about “main effects”. + +Many of the behavioral operations are also about main effects— there is a single treatment variable and a plausible intuition about the direction of its effect, often competing with another intuition that there should be no effect at all. The “bullwhip effect” in supply chains is a great illustrative example. There is a clear concept of why no such effect should occur (normatively) and why it might and seems to (behaviorally). (This example is a great gift to behavioral economics—thanks, operations researchers!– because it spotlights a kind of behavioral consequence which seems to be special to operations but might be evident in other cases once the operations example is in mind.) There is a strong intuition that providing more information could make a difference, but is difficult to know without putting people in that situation and seeing what they do whether that’s true. + +This formula is what made most of the earliest behavioral economics demonstrations so effective. Typically we were pitting one conventionally-believed convenient assumption—e.g., people discount exponentially, players in games are in equilibrium, people are selfish, people see through framing differences—against a plausible alternative. + +Altmejd et al (2019) used LASSO methods to create a long list of features of both original studies and replication samples, to then predict from about 150 actual replications, what features of studies predicted replicability best (including prediction market forecasts). The LASSO “shrinks” features with weak predictive power to have zero coefficients (to limit false positives). + +One of the important findings is that replication is less likely when the hypothesized effect is an *interaction* term rather than a main effect. Interactions are hard to detect when there is measurement error in both of two variables which are hypothesized to interact. Happily, behavioral operations seems to be in a start-up period in which main effects are still quite interesting. Interactions are part of human nature too, but identifying them requires larger high-quality samples and science which can confidently point to some plausible interactions and ignore others. + +**Publishing replications: Many >> 1** + +People often say that pure replication of a single previous study is not professionally rewarding (it can even make enemies) and is hard to publish. I think that’s generally true. Replicating one study raises a question that is difficult for the replicator to answer: Why did you choose *this* *study*to spend your time replicating? A single-study replication also cannot answer the most fundamental questions the profession would like to know about—such as “Does replicability failure casts possible fault on an author team or on a chosen topic that is hard to replicate?” + +What we have learned in general science from the RPP and ManyLabs replications, and efforts like our own and this one in *Management Science*, is that meta-replicating a *group* of studies is a whole different enterprise. It is more useful and persuasive. It renders the question “Why replicate this particular study?” irrelevant and clearly answers the question “What does this tell us in general about studies on this topic?”. It is especially persuasive using the method the authors did here, which was to survey peer scientists about what they would most like to see replicated. + +Furthermore, in our modern experience (post-2010) some funders—not necessarily all, but enough to supply enough resources—and many journals, are actually receptive to publishing these meta-replications. All sensible editors are open to publishing replication—they realize how important it is—but they would rather not have to adjudicate how to decide which of many one-study replications to publish. Editor Simchi-Levi’s editorial (2019) is an important demonstration of an active call from an editor of a prestigious journal. + +**Peer scientist replicability predictions are informative** + +In Davis et al (2023) peer scientist predictions of replicability were substantially correlated with actual predictability. There is also a small ingroup-optimism bias, in which peers most expert in behavioral operations predicted higher replication rates than non-experts, by about .08. However, both groups were about equally accurate when predictions are matched to actual replication outcomes. This is an important finding because it implies that in gathering accurate peer predictions, the peer group can be defined rather broadly. It does not have to be scientists closely expert with the style or scope of experiments. + +Peer accuracy in predicting replicability raises a fundamental question about the nature of peer review and may suggest a simple improvement. + +Suppose that referees are, to some extent, including guessed p(replication) as an input to their judgment of whether a paper should be published. And suppose there are observable features Xp of a refereed paper p—sample size, p-values, whether tested hypotheses are interactions, etc.—which both referees can easily see, and can also be seen in the post-refereeing process after papers are published. + +Simple principles of rational forecasting imply that even if the features Xp are associated with replicability in the unconditional sample of papers reviewed, those features should *not* be associated with actually replicability of papers that are accepted, because referee selection will largely erase their predictive power. + +Think of the simplest example of sample size N. Suppose a referee knows that there is a function p(replication|N) which is increasing in N. If she rejects papers with perceived low replicability, according to that function, then the set of accepted papers will not have replicability which is closely associated with N.  The association was erased by referee selection. + +However, it is a strong and reliable fact that actual replicability, based on observable characteristics of published papers, is somewhat predictable from peer judgments (surveys or more sophisticated prediction markets) and from simple observables like p-values and sample sizes. Altmejd et al (2019) found binary classification around 70% (where 50% is random and 100% is perfect). Most other studies, include Davis et al, also found a positive correlation between replication predictions and actual replication. In a working paper studying 36 PNAS experimental paper, Holtzmeister et al (in prep) find a very strong association between prediction-market predictions and actual replication. From 36 experiments published recently in PNAS, from the bottom 12 (judged as least likely to replicate by prediction markets), ten of 12 did not replicate at all. + +There is no way around a hard truth here: In different fields, implicit referee and editorial decisions are choosing to publish studies for which replicability failures are substantially predictable, based on data the referees and editors had. The predictability increment is also not a small effect, it is substantial. + +What can we do that is better? A few years ago, I proposed to a private foundation a pilot test where some submitted papers would be randomly “audited” by gathering peer predictions and then actually replicating the paper while it was being reviewed. The foundation liked the idea but was not willing to pay for a pilot unless we figured out who would pay for replications and investigator time if such a system were made permanent. We did not do the pilot. It is still worth a try. + +**What can we do better moving forward?** + +These results show that the quality of behavioral operations experiments, in terms of choice of theories, methods and sampling designed to create replicability, is solid. No area of social science has shown uniformly better replicability and some areas of psychology are clearly worse. + +There is much to admire in Davis et al’s carefully-worded conclusion. I’ll mostly just quote and endorse some of their advice. + +Suppose you are an active researcher, perhaps newer and not too burdened by old pre-Open Science bad habits. What should you be sure to do, to ensure your results are *easy* to replicate and *likely* to replicate? + +**Design for a generic subject pool.**Whomever your starting subject pool is, somebody somewhere might want to replicate your experiment as exactly as possible, including written or oral instructions and user interface. It could be in a foreign country (so you want instructions to be faithfully translateable), with higher or lower literacy groups, with younger students or aging seniors, etc. You should want replication to be easy and to design for likely robustness. + +I was taught by the great experimental economist Charlie Plott. Charlie obsessed over writing clear, plain instructions. For example, he avoided the term “probability”; instead, he would always describe a relative frequency of chance events that subjects could actually see. In many early experiments we also used his trick—a bingo cage of numbered balls which generated relative frequencies. + +The simplicity of his instructions is now even more important as something to strive for, because online experiments, and the ability to do experiments in many countries and with people of many ages, requires the most culturally- and age-robust instructions. + +**Design for an online future**: This advice is, of course, closely related to the above. Davis et al were harshly confronted with this because, during pandemic shutdown of many in-lab experiments they had to do a lot more online, even when the original experiments being replicated had not originally been online. + +“Design for an open science future” is another desirable goal. In Camerer et al (2016) we had to entirely recreate software that was obsolete just to done study. The Davis et al replication project also created some *de novo* software. + +Software degrades. If you create or use software for your experiments, the more customized, local, and poorly supported your software is, the more likely it will not be useable \_\_\_ years later (fill in the blank, 2, 3, 5). + +Let’s end on a positive note. The Reproducibility Reboot is well underway, has been successful, and has been embraced by funders, journal editors, and energetic scientists willing to spend their valuable time on this enterprise. Korbmacher et al (2023) is a superb and thorough recent review. There is much to celebrate. We are like a patient going to the doctor, hoping for the best, but hearing that we need to put in more walking steps and eat healthier for or selves and our science to live longer. In five or twenty years, all of us will look back with some pride at the improvement in open science methods going on right now, of which this terrific behavioral operations project is an example. + +[**[1]**](https://www.informs.org/Blogs/ManSci-Blogs/From-the-Editor/Behavioral-science-s-credibility-is-at-risk.-Replication-studies-can-help?fbclid=IwAR3OTh9PUEWcZmcqkHy8XsKlNOOAxadXyoRqfV8_yZeHNS-Foy_FkvKY9b#_ftnref1) As the saying goes “K**eep your minds open—but not so open that your brains fall out.”** + +[**[2]**](https://www.informs.org/Blogs/ManSci-Blogs/From-the-Editor/Behavioral-science-s-credibility-is-at-risk.-Replication-studies-can-help?fbclid=IwAR3OTh9PUEWcZmcqkHy8XsKlNOOAxadXyoRqfV8_yZeHNS-Foy_FkvKY9b#_ftnref2) Disclosure: Teck-Hua Ho was a PhD student at Wharton when I was on the faculty there and we later wrote many papers together. Juan-juan Zhang was his advisee at Berkeley. + +**References** + +Altmejd A, Dreber A, Forsell E, Huber J, Imai T, et al. (2019) Predicting the replicability of social science lab experiments. PLOS ONE 14(12): e0225826. **[*https://doi.org/10.1371/journal.pone.0225826*](https://doi.org/10.1371/journal.pone.0225826)** + +Buck, Stuart. Metascience since 2012: A personal history. [***https://forum.effectivealtruism.org/posts/zyrfPkX6apFoDCQsh/metascience-since-2012-a-personal-history***](https://forum.effectivealtruism.org/posts/zyrfPkX6apFoDCQsh/metascience-since-2012-a-personal-history) + +Camerer CF, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, Kirchler M, et al. (2016) Evaluating replicability of laboratory experiments in economics. Science 351(6280):1433–1436. + +Camerer CF, Dreber A, Holzmeister F, Ho T-H, Huber J, Johannesson M, Kirchler M, et al. (2018) Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Hum. Behav. 2:637–644. + +Simchi-Levi D (2019) Management science: From the editor, January 2020. Management Sci. 66(1):1–4. + +Dreber A, Pfeiffer T, Almenberg J, Isaksson S, Wilson B, Chen Y, Nosek BA, et al. (2015) Using prediction markets to estimate the reproducibility of scientific research. Nature 112(50):15343–15347. + +Klein R, Ratliff K, Vianello M, Adams R Jr, Bahn´ık S, Bernstein M, Bocian K, et al. (2014) Data from investigating variation in replicability: A “many labs” replication project. J. Open Psych. Data 2(1):e4. + +Korbmacher, M., Azevedo, F., Pennington, C.R. *et al.* The replication crisis has led to positive structural, procedural, and community changes. *Commun Psychol* **1**, 3 (2023). + +Leamer, Edward. 1978. *Specification Searches: Ad Hoc Inference with Nonexperimental Data*. Wiley + +Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716. + +Simmons, J.P.,Nelson,L.D.,& Simonsohn,U.(2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. + +**——————————————** + +**Yan Chen (University of Michigan)** + +**Title**The Emerging Science of Replication + +[**yanchen@umich.edu**](mailto:yanchen@umich.edu) + +**Background.**Replications in the social and management sciences are essential for ensuring the reliability and progress of scientific knowledge. By replicating a study, researchers can assess whether the original results are consistent and trustworthy. This is crucial for building a robust body of knowledge. Furthermore, replications can help identify the conditions under which a phenomenon does or does not occur, thus honing theories and rendering them more precise and practical in real-world contexts. If a finding is consistently replicated, it becomes more likely to be accepted as a reliable part of the scientific understanding. + +Over the past decade, the issue of empirical replicability in social science research has received a great deal of attention. Prominent examples include the Reproducibility Project: Psychology (Open Science Collaboration 2015), the Experimental Economics Replication Project (Camerer et al. 2016), and the Social Sciences Replication Project (Camerer et al. 2018). In particular, this discussion has focused on the degree of success in replicating laboratory experiments, the interpretation when a study fails to be replicated (Gilbert et al. 2016), the development of recommendations on how to approach a replication study (Shrout and Rodgers 2018), and best practices in replication (Chen et al. 2021). + +This is an exciting time when the scientific community is converging on a set of principles for the emerging science of replication. Davis *et al.*(forthcoming) make significant contributions towards our shared understanding of these principles. + +**Summary.**Davis et al. (forthcoming) present the first large-scale replication study in the area of operations management published in Management Science prior to 2020. Using a two-stage paper selection process including inputs from the research community, the authors identify ten prominent experimental operations management papers published in this journal. For each paper, they conduct high-powered (90% to detect the original effect size at the 5% significance level) replication study of the main results across at least two locations using original materials. Due to lab closures during the COVID-19 pandemic, this study also tests replicability in multiple modalities (in-person and online). Of these ten papers, six achieve full replication (at both sites), two achieve partial replication (at one site), and two do not replicate. In the discussion section, the authors share their insights on the tradeoffs and compromises in designing and implementing an ambitious replication study, which contribute to the emerging science of replication. + +**Comments.**Compared to prior replication studies, this study has several distinct characteristics. First, this study contains independent replications across two locations, whereas the three prominent earlier replication studies contain one location for each original study (Open Science Collaboration 2015, Camerer et al. 2016, 2018). A multi-site replication helps determine whether a set of findings hold across diverse subject pools and cultural contexts, and enhance the robustness of the findings. Second, this study tests replicability in multiple modalities (in-person and online), whereas most of previous replication projects contain one modality. While replicating individual choice experiments online was unexpected due to the lab closure during the pandemic, it does teach us that it is important to “*design for an online future*.” + +For researchers conducting lab and field experiments, in addition to design for an on- line future, we should “*design for a generic subject pool*.” How might researchers of original experiments design for a generic subject pool? In lab experiments, we could aim at collecting data from distinct subject pools. For example, in the classic centipede game experiment (McKelvey and Palfrey 1992), researchers recruited subjects from Caltech and Pasadena City College, two subject pools with different average analytical abilities. In a recent field experiment on a ride-sharing platform (Ye *et al.*2022), researchers conducted the same field experiment across three cities of distinct size, competitiveness and culture. These multi-site design helps define the boundaries of each finding. + +While this and previous replication studies use classical null hypothesis significance testing to evaluate replication results, we note that several studies have proposed that repli- cations take a Bayesian approach. Verhagen and Wagenmakers (2014) explore various Bayesian tests and demonstrate how previous studies and replications alter established knowledge through the Bayesian approach. When addressing how the field should con- sider failed replications, Earp and Trafimow (2015) consider the Bayesian approach and how failed replications affect the confidence of original findings. + +McShane and Gal (2016) propose “a more holistic and integrative view of evidence that includes consideration of prior and related evidence, the type of problem being evaluated, the quality of the data, the effect size, and other considerations.” They warn that null hypothesis significance testing leads to dichotomous rather than continuous interpretations of evidence. Future replications might consider both the classical and Bayesian approaches to evaluate replication results. + +**References** + +**Camerer, Colin F., Ann Dreber, Eskil Forsell, Teck-Hua Ho, Ju¨ rgen Huber, Magnus Johannesson, Michael Kirchler, Johan Almenberg, Adam Altmejd, Taizan Chan, Emma Heikensten, Felix Holzmeister, Taisuke Imai, Siri Isaksson, Gideon Nave, Thomas Pfeiffer, Michael Razen, and Hang Wu**, “Evaluating replicability of laboratory experiments in economics,” *Science*, 2016, *351*(6280), 143–1436. + +**, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jurgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A. Nosek, Thomas Pfeiffer, Adam Altmejd, Nick Buttrick, Taizan Chan, Yiling Chen, Eskil Forsell, Anup Gampa, Emma Heikensten, Lily Hummer, Taisuke Imai, Siri Isaksson, Dylan Manfredi, Julia Rose, Eric-Jan Wagenmakers, and Hang Wu**, “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015,” *Nature Human Behaviour*, 2018, *2*, 637 – 644. + +**Chen, Roy, Yan Chen, and Yohanes E. Riyanto**, “Best practices in replication: a case study of common information in coordination games,” *Experimental Economics*, 2021, *24*, 2 – 30. + +**Davis, Andrew M., Blair Flicker, Kyle Hyndman, Elena Katok, Samantha Keppler, Stephen Leider, Xiaoyang Long, and Jordan D. Tong**, “A Replication Study of Operations Management Experiments in Management Science,” *Management Science*, forthcoming. + +**Earp, Brian D. and David Trafimow**, “Replication, falsification, and the crisis of confidence in social psychology,” *Frontiers in Psychology*, May 2015, *6*(621), 1–11. + +**Gilbert, Daniel T., Gary King, Stephen Pettigrew, and Timothy D. Wilson**, “Comment on “Estimating the reproducibility of psychological science”,” *Science*, 2016, *351*(6277), 1037–1037. + +**McKelvey, Richard D. and Thomas R. Palfrey**, “An Experimental Study of the Centipede Game,” *Econometrica*, 1992, *60*(4), 803–836. + +**McShane, Blakeley B. and David Gal**, “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence”, *Management Science*, June 2016, *62*(6), 1707–1718. + +**Open Science Collaboration**, “Estimating the reproducibility of psychological science,” *Science*, 2015, *349*(6251). + +**Shrout, Patrick E. and Joseph L. Rodgers**, “Psychology, Science, and Knowledge Construction: Broadening Perspectives from the Replication Crisis,” *Annual Review of Psychology*, 2018, *69*(1), 487–510. PMID: 29300688. + +**Verhagen, Josine and Eric-Jan Wagenmakers**, “Bayesian Tests to Quantify the Result of a Replication Attempt,” *Journal of Experimental Psychology: General*, August 2014, *143*(4), 1457–1475. + +**Ye, Teng, Wei Ai, Yan Chen, Qiaozhu Mei, Jieping Ye, and Lingyu Zhang**, “Virtual teams in a gig economy,” *Proceedings of the National Academy of Sciences*, 2022, *119*(51), e2206580119. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/11/19/simchi-levy-behavioral-sciences-credibility-is-at-risk-replication-studies-can-help/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/11/19/simchi-levy-behavioral-sciences-credibility-is-at-risk-replication-studies-can-help/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/stephanie-wykstra-on-data-re-use.md b/content/replication-hub/blog/stephanie-wykstra-on-data-re-use.md new file mode 100644 index 00000000000..b649251eb56 --- /dev/null +++ b/content/replication-hub/blog/stephanie-wykstra-on-data-re-use.md @@ -0,0 +1,77 @@ +--- +title: "STEPHANIE WYKSTRA: On Data Re-use" +date: 2016-05-19 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "BITSS" + - "Data Re-use" + - "Open data" + - "Research Transparency Initiative" + - "Stephanie Wykstra" +draft: false +type: blog +--- + +###### [*THIS BLOG ORIGINALLY APPEARED ON THE **[BITSS WEBSITE](http://www.bitss.org/2016/05/17/call-for-cases-of-data-reuse-still-seeking-answers/)***]  As advocates for open data, my colleagues and I often point to re-use of data for further research as a major benefit of data-sharing. In fact there *are* many cases in which shared data was clearly very useful for further research. Take the Sloan Digital Sky Survey (***[SDSS](http://www.sdss.org/)***) data, which researchers have used for ***[nearly 6,000 papers](http://www.sloan.org/major-program-areas/stem-research/sloan-digital-sky-survey/?L=02525252525252F)***. Or take ***[Genbank](http://www.ncbi.nlm.nih.gov/genbank/)***, within bioinformatics, which is a widely used database of nucleotide and protein sequence data. Within social science, large-scale surveys such as the Demographic and Health Survey (***[DHS](http://dhsprogram.com/What-We-Do/Survey-Types/DHS.cfm)***) are used by many, many researchers as well as policy-makers. + +###### **Research data re-use: where are the cases?** + +###### In spite of the obviousness of the value of data-sharing *in general*, I realized that we didn’t have many cases of re-use of research data. By “research data” here, I have in mind data which were collected by an individual researcher or research team for their own project (e.g. from a field experiment), and then shared along with the publication. This differs from the databases like SDSS, Genbank and DHS in a few different ways: + +###### — The data are often much smaller scale than DHS or SDSS; they are often studies of a few hundred to a few thousand subjects. + +###### — They are not part of a unified data-gathering effort using common measures (as are SDSS and DHS), but rather use their own instruments, often with their own non-standardized measures. + +###### — While it’s fairly clear that researchers can use SDSS data for their own research, and bioinformaticists can use Genbank data, it’s less clear how social scientists would re-use data that other researchers collected for the purpose of their own study. In general, they could use data for secondary analysis or meta-analysis; however, we haven’t seen numerous examples. + +###### **A call for cases studies of data re-use** + +###### After a brainstorming session with ***[Stephanie Wright](https://www.mozillascience.org/u/stephwright)***, a colleague at Mozilla Science Lab, we decided to put out a call for cases of data re-use. For this project, we were particularly interested in cases of re-use within economics or political science. Since we support data-sharing among researchers and research staff, we want to be able to point to cases of real world re-use, and to delve into what made the data particularly useful. We wrote a ***[post](https://www.mozillascience.org/share-your-story)*** on our project, along with a ***[survey](https://docs.google.com/forms/d/1z2vFXX9BK4mNc5oDOnVG7MwTrLfN5CJebKjmsqHbTXA/viewform)*** on data re-use, and shared in venues such as IASSIST, Polmeth, Berkeley Initiative for Transparency in the Social Sciences’ blog, Open Science Collaboration’s discussion board, Mozilla Science Lab’s blog and various data librarian email lists. + +###### **What did we find through our call?** + +###### We received 14 responses to our call, including 10 responses to our [survey](https://docs.google.com/forms/d/1z2vFXX9BK4mNc5oDOnVG7MwTrLfN5CJebKjmsqHbTXA/edit) and 4 emailed responses. While the number and quality of responses isn’t sufficient for us to learn a great deal, we want to share what we found in any case, for two reasons: (1) This call and response could be informative to those who are considering putting out a similar survey and (2) we think our findings do provide some evidence which confirms our initial feeling, which is that this is an area which warrants further work and research. + +###### Our 10 survey respondents are in a variety of fields: one in political science, two in psychology, one in education and most of the rest in biochemistry. While all respondents did mention some data that were re-used, only three gave examples of the kind that we had requested e.g. data that had been collected by other researchers for their study, and then re-used for further research. The three cases included: + +###### — Re-use of data from a collaboration of Psychology instructors, which collected data on emerging adulthood and politics. The data were not initially used for a publication, as intended, but were archived and were used for nine published articles later on. + +###### — A researcher in political science gave several of his own research re-use cases in which publicly available data were used for a) a replication to “illustrate the usefulness of a new fit assessment technique for binary DV [dependent variable] models,” b) for pedagogical purposes in a book on causal inference and c) to test a new theory. + +###### — Researchers in psychology used data from two large-scale studies on the benefits and transfer effects of a cognitive training for older adults. The data were used to test whether a subset of one test (the Useful Field of View test) were able to predict scores on another test (the Instrumental Activities of Daily Living test). + +###### Beyond the cases above, we heard about re-use of protein sequence data and genomics data from databases such as [ArrayExpress](https://www.ebi.ac.uk/arrayexpress/) and ***[Protein Data Bank](http://www.rcsb.org/pdb/home/home.do)***, as well as government data from ***[Open Data Toronto](http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=9e56e03bb8d1e310VgnVCM10000071d60f89RCRD)*** and ***[Statistics Canada](http://www.statcan.gc.ca/start-debut-eng.html)***. See our ***[spreadsheet](https://docs.google.com/spreadsheets/d/1rZTRX8XHE2WovwCNKIsso-AbIkbbX7mdxhtSK8JhtTQ/pubhtml)*** for further details (we asked for permission to share responses). + +###### In addition to the cases gathered through our survey, we received four emails with tips about where to find additional cases. One of the suggestions mentioned the ***[Global Biodiversity Information Facility](http://www.gbif.org/)*** (GBIF), a database on global biodiversity, as well as ***[International Polar Year](http://www.arctic.noaa.gov/ipy.html)*** (IPY), a coordination of research on the Polar regions. A second suggestion from a political scientist pointed to several sites, ***[Uppsala Conflict Data Program](http://ucdp.uu.se/)*** and the ***[Correlates of War Program](http://www.correlatesofwar.org/)***. Both sites offer data which are widely used by scholars within international relations, and include variables which are constructed by scholars for their own research, and then submitted to the databases for others to re-use. + +###### Finally, we received several suggestions from fellow open data advocates, of places to look for cases of re-use. The first source, ***[Dissemination Information Packages for Information Re-use](http://www.oclc.org/research/themes/user-studies/dipir.html)*** (DIPIR) is a study of data re-use in three communities (quantitative social scientists, archaeologists, and zoologists). The second is ICPSR’s ***[bibliography of data-related literature](https://www.icpsr.umich.edu/icpsrweb/ICPSR/citations/)***, which is a searchable database of “over 70,000 citations of published and unpublished works resulting from analyses of data held in the ICPSR archive.” The third is UK Data Archive’s list of ***[case studies of data re-use](https://www.ukdataservice.ac.uk/use-data/data-in-use)***. + +###### **Data re-use: key for rewarding data-sharing** + +###### The data-sharing movement is gaining steam. From funders requiring data-sharing to new guidelines for journals (***[TOP guidelines](https://cos.io/top/)***) and journal requirements, to the rise of many data repositories, there is plenty of effort going into requiring and supporting data-sharing. Yet there are huge issues to confront, as we move forward. One of the biggest is how to change from a culture in which data-sharing is not a norm among researchers (as is still the case in many scientific fields) to one in which it is. + +###### Researchers are rewarded for publishing, not for sharing data, and many researchers cite barriers to sharing data such as lack of time and lack of support (***[Tenopir et al. 2011](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0021101)***). How will we shift to rewarding researchers for sharing their data, so that they have professional incentives to take the time to prepare and share data? One of the most-discussed ways is to develop good data-citation norms, and then to reward researchers (via tenure committee decisions) when others re-use and [cite their data](https://www.datacite.org/). + +###### So, the question of how to promote and encourage data re-use is of clear importance. Yet, as practitioners in the open science movement, we have many questions. When it comes to re-using data from colleagues’ studies, particularly in the social sciences, what factors make datasets particularly helpful to researchers? What challenges arise in re-using data? As data curators and open data advocates, what could we do better to facilitate re-use?  Is there something we can do to encourage others to look at and reuse existing data when they are considering new research projects? How can we increase opportunities for re-using data and decrease barriers? + +###### **Next steps** + +###### Particularly when it comes to data shared by researchers in the social sciences, we still need more examples of re-use. We also need much more investigation into what would make researchers more likely to re-use data from colleagues for their own research. We can think of a couple of interesting projects that we could undertake: + +###### — Delving into the archives from ICPSR and UK Data Archive, as well as others mentioned above, and attempting to glean lessons from specific cases of re-use found there. + +###### — Contacting researchers that have downloaded data from archives such as IPA’s data repository (we track data users and ask them permission to contact them, when they download data). We could gather more detailed information about what was or wasn’t helpful for re-use about the data and other materials as presented in the repository. We could also try to gather more information on whether data were re-used for further research (and if so, what made them particularly attractive for re-use). + +###### We’re certainly open to further suggestions, so please get in touch if you have ideas! + +###### *Stephanie Wykstra directs the **[Research Transparency Initiative](http://www.poverty-action.org/researchers/research-resources/research-transparency)** at Innovations for Poverty**Action, and also works as an independent research consultant. She may be contacted at [stephanie.wykstra@gmail.com](mailto:stephanie.wykstra@gmail.com).* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/05/19/stephanie-wykstra-on-data-reuse/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/05/19/stephanie-wykstra-on-data-reuse/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/sutch-lesson-learned-from-replicating-piketty-how-not-to-do-economic-history.md b/content/replication-hub/blog/sutch-lesson-learned-from-replicating-piketty-how-not-to-do-economic-history.md new file mode 100644 index 00000000000..94859a5ed23 --- /dev/null +++ b/content/replication-hub/blog/sutch-lesson-learned-from-replicating-piketty-how-not-to-do-economic-history.md @@ -0,0 +1,58 @@ +--- +title: "SUTCH: Lesson Learned from Replicating Piketty — How Not To Do Economic History" +date: 2017-10-26 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Capitalism" + - "Economic History" + - "inequality" + - "replication" + - "Richard Sutch" + - "Thomas Piketty" +draft: false +type: blog +--- + +###### *[NOTE: This post refers to the article “The One Percent across Two Centuries: A Replication of Thomas Piketty’s Data on the Concentration of Wealth in the United States” by Richard Sutch. It appears in the current issue of the journal***[*Social Science History*](https://www.cambridge.org/core/journals/social-science-history/article/one-percent-across-two-centuries-a-replication-of-thomas-pikettys-data-on-the-concentration-of-wealth-in-the-united-states/20F44C37D29070B205D5FF33B30131C1)***].* + +###### When Thomas Piketty’s blockbuster on economic inequality, **[*Capital in the Twenty-First Century*](https://en.wikipedia.org/wiki/Capital_in_the_Twenty-First_Century)**, appeared several years ago, economists quickly praised and then passed over the data he had packaged in graphic form in order to scrutinize, criticize, and debate his interpretation and analysis. Even those most skeptical of Piketty’s theories offered uncritical praise for his data. Yet, there is a danger lurking here. + +###### The British historian Herbert Butterfield warned that the “truth of history is no simple matter, all packed and parcelled ready for handling in the market-place. … The understanding of the past is not so easy as it is sometimes made to appear” [Butterfield 1931: 132]. Economists, perhaps more so than historians, are apt to take historical statistics as given, ready for interpretation and analysis. They forget that the ingenuity and the artistry that created the spreadsheet of numbers also produces an idiosyncratic picture of the past. + +###### In my article I retrace the steps Piketty took to come up with his estimates for the fraction of the total wealth of the United States owned by the wealthiest one-percent and the wealthiest ten-percent of U.S. households. These time series span two centuries beginning in 1810. Piketty displayed his estimates, conveniently packed and parceled for easy reference, in a single chart (Figure 10.5, p. 348). The book does not go behind the scenes to describe how he came by the numbers; but, to his credit, Piketty made that information available in an online technical appendix. + +###### I conclude that Piketty’s data for the wealth share of the top *ten percent* over the period 1870-1970 are unacceptable – they add nothing to the evidence base.  The values he reported are manufactured from the observations for the top *one percent* inflated by a constant 36 percentage points. He does not explain or defend this dubious procedure. + +###### Piketty’s data for the top one percent of the distribution for the nineteenth century (1810-1910) are also unhelpful. They are based on a single mid-century observation (for 1870) that provides no guidance about the antebellum trend in inequality and only very tenuous information about trends in inequality during the Gilded Age. + +###### The values for the top one percent that Piketty reported for the twentieth century (1910-2010) are based on more solid ground, but a smoothing procedure he applied to the noisy raw data muted the marked rise of inequality during the Roaring Twenties and the decline associated with the Great Depression.  The reversal of the sustained decline in inequality during the 1960s and 1970s and the subsequent sharp rise in the 1980s are hidden by a twenty-six-year interpolation. + +###### Ironically, Piketty underestimated the rise in inequality over the last decade. This neglect of the shorter-run changes is unfortunate because it makes it difficult to discern the impact of policy changes (income and estate tax rates) and shifts in the structure and performance of the economy (depression, inflation, executive compensation) on changes in wealth inequality. + +###### How serious are Piketty’s departures from good practice? On one level, you might say his major point of alarm is undisturbed. His sloppiness caused him to underestimate the seriousness of today’s problem by neglecting the increase in inequality produced by the Reagan-era tax cuts. + +###### But, Piketty goes beyond presenting the numbers in his chart. He makes conjectures about how America became so unequal. He makes predictions about the future. He makes policy suggestions based on those conjectures. His rhetoric implies confidence in his reading of history. His projections and policy solutions imply that the confidence is warranted by a solid under-girding of data. + +###### As an economic historian I am unhappy with his historical narrative. As a citizen I am concerned that policies based uncritically on his theoretical model and predictions will not remedy the problem. + +###### The results I report inflict some damage to Piketty’s credibility. I fear that they also weaken the credibility of the economics profession’s ability to derive insight from a scientific examination of historical data. That is a shame since the increasing concentration of wealth is a serious problem that deserves to be studied by experts and addressed by policy makers. + +###### *Richard Sutch is Distinguished Professor of Economics at the University of California, Riverside (emeritus). He can be contacted via email at richard.sutch@ucr.edu.* + +###### REFERENCES + +###### Butterfield, Herbert (1931). *The Whig Interpretation of History*, W.W. Norton, 1965. + +###### Piketty, Thomas (2014). *Capital in the Twenty-First Century*, translated by Arthur Goldhammer, Harvard University Press, 2014. + +###### Sutch, Richard (2017). “The One Percent across Two Centuries: A Replication of Thomas Piketty’s Data on the Concentration of Wealth in the United States.” *Social Science History* 41 (4) Winter 2017: 587-613. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/10/26/sutch-lesson-learned-from-replicating-piketty-how-not-to-do-economic-history/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/10/26/sutch-lesson-learned-from-replicating-piketty-how-not-to-do-economic-history/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/sven-vlaeminck-data-policies-at-economics-journals-theory-and-practice.md b/content/replication-hub/blog/sven-vlaeminck-data-policies-at-economics-journals-theory-and-practice.md new file mode 100644 index 00000000000..ce6cf3803a7 --- /dev/null +++ b/content/replication-hub/blog/sven-vlaeminck-data-policies-at-economics-journals-theory-and-practice.md @@ -0,0 +1,54 @@ +--- +title: "SVEN VLAEMINCK: Data Policies at Economics Journals: Theory and Practice" +date: 2015-10-09 +author: "The Replication Network" +draft: false +type: blog +--- + +###### In economic sciences, empirically-based studies have become increasingly important: According to **[Hamermesh (2012)](http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04deaa6388b1/publication/566)**, the number of contributions to journals in which authors utilized self-collected or externally produced datasets for statistical analyses have massively increased in the course of the last decades. + +###### With the growing relevance of publications based on empirical research, new questions and challenges emerge: Issues like integrating research data and scripts to run a data model in the broader context of a published article to foster replicable research and validation of scientific results are becoming increasingly important for both researchers and editors of scholarly journals. + +###### Especially for a scientific discipline like economics, the effects of flawed research might have a huge impact on society, as the **[prominent debate](http://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html?_r=2)** of Reinhart’s and Rogoff’s “**[Growth in a time of debt](https://www.aeaweb.org/articles.php?doi=10.1257/aer.100.2.573)**” (2010) illustrated. Their paper attracted much attention and the results were cited by US vice presidential candidate [Paul Ryan](http://www.budget.house.gov/uploadedfiles/pathtoprosperity2013.pdf) and EU monetary affairs commissioner **[Olli Rehn](http://www.ec.europa.eu/archives/commission_2010-2014/rehn/documents/cab20130213_en.pdf)** to justify austerity policy. + +###### But when Rogoff and Reinhart provided the Excel-sheet of their calculations for teaching purposes to a student in 2013, this student, Thomas Herndon, was not able to replicate the results of the paper. Furthermore, he **[discovered](http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04deaa6388b1/publication/566)** that the Excel-sheet contained faulty calculations and selectively omitted data, which casted massive doubts on Reinhart’s and Rogoff’s findings. + +###### Despite the fact, that the paper of Rogoff and Reinhart has been published in the American Economic Review (AER), a journal having a strict **[data availability policy](https://www.aeaweb.org/aer/data.php)**, the paper of the two American researchers has been exempted from this policy. + +###### Therefore one could ask how journals in economics and business studies generally handle the challenges associated with empirically-based research. At least in theory, journals should serve as a quality assurance for economic research. On these grounds, peer-review was established to ensure a high quality of published research. But apparently, peer-review does not include the data appendices or other materials associated with empirically-based research: According to the US-economist B.D. McCullough, journals often fail to ensure the robustness of published results. After investigating the data archives of some economic journals, he concluded: *„Despite claims that economics is a science, no applied economics journal can demonstrate that the results published in its pages are replicable, i.e., that there exist data and code that can reproduce the published results. No theory journal would dream of publishing a result (theorem) without a demonstration (proof) that the reader can trust the result.”* (McCullough, 2009) + +###### To analyse how journals in economics and business studies deal with the challenge of reproducible research since the new decade, in 2010 we applied for funding from the German Research Foundation (DFG) for a project called “European Data Watch Extended – **[EDaWaX](http://www.edawax.de)**“. EDaWaX has several goals: The main objective of the project is to develop a software application for editors of social sciences’ journals. This software facilitates the management of publication-related research data. To gather some of the functional requirements for the development of the application, we analysed the number and specifications of existing data policies of economics journals for the first time in 2011. In 2014 we expanded our study. In our recent paper, we analysed the data policies of scholarly journals available in a sample of 346 journals. Many of them are among the top-journals of the profession. In contrast to our study in 2011, we also included a big share of journals in business studies to compare both branches of economic research. + +###### Especially for economics journals we are able to state that things are changing slowly but steady: More than fourth of all economics journals in our sample are equipped with more or less functional data policies. While some journals pay lip-service to reproducible research, others effectively enforce their data policy. + +###### In ***[our paper](http://ebooks.iospress.nl/publication/40893)*** we summarise the findings of this empirical study. We regard both the extent and the quality of journals’ data policies, which should facilitate replications of published empirical research. The paper presents some characteristics of journals equipped with data policies and gives some recommendations for suitable data policies in economics and business sciences journals. In addition, we also evaluate the journals’ data archives to roughly estimate whether these journals enforce data availability. + +###### References: + +###### Hamermesh, D.S. (2012), Six Decades of Top Economics Publishing: Who and How?, National Bureau of Economic Research, Working Paper 18635. Retrieved from . + +###### Herndon, T.; Ash, M. & Pollin R. (2013), Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff, Political Economy Research Institute. Retrieved from: . + +###### Krugman, P. (2013), The Excel Depression, The New York Times, April 19 2013, p. A31. Retrieved from . + +###### McCullough, B.D. (2009): Open Access Economics Journals and the Market for Reproducible Economic Research. Economic Analysis and Policy 39, 1, pp. 117-126. + +###### Rehn, O. (2013), Letter to ECOFIN Ministers – ARES (2013)185796, February 13 2013, Retrieved from: . + +###### Reinhart, C.M. & Rogoff, K.S. (2010), Growth in a Time of Debt, American Economic Review 100, 2, 573–578. Retrieved from . + +###### Ryan, P. (2013), The Path to Prosperity: A Blueprint for American Renewal. Fiscal Year 2013 Budget Resolution, House Budget Committee. Retrieved from . + +###### The American Economic Review (2005), Data Availability Policy. Retrieved from: . + +###### Vlaeminck, S. / Herrmann L.K. (2015), Data policies and data archives: A new paradigm for academic publishing in economic sciences? In: Schmidt, B. & Dobreva, M. (Eds.), New Avenues for Electronic Publishing in the Age of Infinite Collections and Citizen Science: Scale, Openness and Trust. Proceedings of the 19th International Conference on Electronic Publishing, September 2015. Retrieved from [doi:10.3233/978-1-61499-562-3-145](http://www.dx.doi.org/10.3233/978-1-61499-562-3-145). + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2015/10/09/vlaeminck-data-policies-at-economics-journals-theory-and-practice/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2015/10/09/vlaeminck-data-policies-at-economics-journals-theory-and-practice/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/ter-schure-accumulation-bias-how-to-handle-it-all-in.md b/content/replication-hub/blog/ter-schure-accumulation-bias-how-to-handle-it-all-in.md new file mode 100644 index 00000000000..cce57f22f06 --- /dev/null +++ b/content/replication-hub/blog/ter-schure-accumulation-bias-how-to-handle-it-all-in.md @@ -0,0 +1,214 @@ +--- +title: "TER SCHURE: Accumulation Bias – How to handle it ALL-IN" +date: 2020-12-04 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Uncategorised" + - "Accumulation bias" + - "Meta-analysis" +draft: false +type: blog +--- + +An estimated 85% of global health research investment is wasted (Chalmers and Glasziou, 2009); a total of one hundred billion US dollars in the year 2009 when it was estimated. The movement to reduce this research waste recommends that previous study results be taken into account when prioritising, designing and interpreting new research (Chalmers et al., 2014; Lund et al., 2016). Yet any recommendation to increase efficiency this way requires that researchers evaluate whether the studies already available are sufficient to complete the research effort; whether a new study is necessary or wasteful. These decisions are essentially stopping rules – or rather noisy accumulation processes, when no rules are enforced – and unaccounted for in standard meta-analysis. Hence reducing waste invalidates the assumptions underlying many typical statistical procedures. + +Ter Schure and Grünwald (2019) detail all the possible ways in which the size of a study series up for meta- analysis, or the timing of the meta-analysis, might be driven by the results within those studies. Any such dependency introduces *accumulation bias*. Unfortunately, it is often impossible to fully characterize the processes at play in retrospective meta-analysis. The bias cannot be accounted for. In this blog we revisit an example accumulation bias process, that can be one of many influencing a single meta-analysis, and use it to illustrate the following key points: + +– Standard meta-analysis does not take into account that researchers decide on new studies based on other study results already available. These decisions introduce accumulation bias because the analysis assumes that the size of the study series is unrelated to the studies within; it essentially conditions on the number of studies available. + +– Accumulation bias does not result from questionable research practices, such as publication bias from file-drawering a selection of results. The decision to replicate only some studies instead of all of them biases the sampling distribution of study series, but can be a very efficient approach to set priorities in research and reduce research waste. + +– ALL-IN meta-analysis stands for *Anytime*, *Live* and *Leading INterim* meta-analysis. It can handle accumulation bias because it does not require a set number of studies, but performs analysis on a growing series – starting from a single study and accumulating as many studies as needed. + +– ALL-IN meta-analysis also allows for continuous monitoring of the evidence as new studies arrive, even as new interim results arrive. Any decision to start, stop or expand studies is possible, while keeping valid inference and type-I error control intact. Such decisions can be strategic: increasing the value of new studies, and reducing research waste. + +**Our example: extreme *Gold Rush* accumulation bias** + +We imagine a world in which a series of studies is meta-analyzed as soon as three studies become available. Many topics deserve a first initial study, but the research field is very selective with its replications. Nevertheless, for significant results in the right direction, a replication is warranted. We call this the *Gold Rush* scenario, because after each finding of a positive significant result – the gold in science – some research group rushes into a replication, but as soon as a study disappoints, the research effort is terminated and no-one bothers to ever try again. This scenario was first proposed by Ellis and Stewart (2009) and formulated in detail and under this name by Ter Schure and Grünwald (2019). Here we consider the most extreme version of the *Gold Rush* where finding a significant positive result not only makes a replication more probable, but even inevitable: the dependency of occurring replications on their predecessor’s result is deterministic. + +**Biased *Gold Rush* sampling** + +We denote the number of studies available on a certain topic by *t*. This number *t* can also indicate the *timing* of a meta-analysis, such that a meta-analysis can possibly occur at number of studies *t* = 1*,* 2*,* 3*, . . .* up to some maximum number of studies *T* . This notation follows from Ter Schure and Grünwald (2019); the Technical Details at the end of this blog make the notation involved in this blog more explicit. + +We summarize the results of individual studies into a single per-study *Z*-score (*z*1 for the first study, *z*2 for the second, etc), such that we have the following information on a series of size *t*: *z*1*, z*2*, . . . , zt* . We distinguish between *Z*-scores that are significant and in the right direction, and *Z*-scores that are not. A first significant positive study is indicated by *z*1 = *z*1\*(*z*1\* *> zα* with *zα* = 1*.*96 for *α* = 2*.*5%).  A first nonsignificant or negative study is indicated by *z*1 = **z*1–* (*z*1 <= *zα*).  We use the same notation for the second and third study and limit our world to three studies (our maximum *T* = 3). After all, we meta-analyze studies on all topics and only those topics that have spurred a series of three studies. Our *Gold Rush* world consists of the following possible study series: + +***Gold Rush world*** + +[![](/replication-network-blog/image-17.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-17.webp) + +Here *A*(*t*) denotes whether we accumulate *and* analyze the *t* studies: It can be that *A*(2) = 0 and *A*(3) = 0 because we are stuck at one study, but also *A*(1) = 0 because we don’t “meta-analyze” that single study. It can only be that *A*(2) = 1 if we accumulate *and* meta-analyze a two-study series and *A*(3) = 1 if we accumulate *and* meta-analyze a three-study series. In our *Gold Rush* world a very specific subset of studies accumulate into a three-study series such that they are meta-analyzed (*A*(3) = 1). + +*z*(3) denotes the *Z*-score of a fixed effects meta-analysis. This meta-analysis *Z*-score is simply a re-normalized average and can, assuming equal sample size and variances in all studies, be obtained from the individual study  *Z*-scores  as  follows: *z*(3) =[1/sqrt(3)) × sum(zi)i = 1 to 3]. The effects of accumulation bias are not limited to fixed-effects meta-analysis (see for example Kulinskaya et al. (2016)), but fixed-effects meta-analysis does provide us with a simple illustration for the purposes of this blog. + +We observe in our *Gold Rush* world above that the study series that are eventually meta-analyzed into a *Z*-score *z*(3) are a very biased subset of all possible study series. So we expect these *z*(3) scores to be biased as well. In the next section, we simulate the sampling distribution of these *z*(3) scores to illustrate this bias. + +**The conditional sampling distribution under extreme *Gold Rush* accumulation bias** + +Assume that we are in the scenario that only true null effects are studied in our *Gold Rush* world, such that any new study builds on a false-positive result. How large would the bias be if the three-study series are simply analyzed by standard meta-analysis? We illustrate this by simulating this *Gold Rush* world using the R code below. + +[![](/replication-network-blog/trn220201204-1.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/trn220201204-1.webp) + +**Theoretical sampling process:** A fixed-effects meta-analysis assumes that if three studies *z*1*, z*2*, z*3 are each sampled under the null hypothesis, each has a standard normal with mean zero and the standard normal sampling distribution also applies for the combined *z*(3) score. The R code in Figure 1illustrates this sampling process: First, a large population is simulated of possible first (Z1), second (Z2) and third (Z3) studies from a standard normal distribution. Then in Zmeta3 each index i represents a possible study series, such that c(Z1[i], Z2[i], Z3[i]) samples an unbiased study series and calcZmeta calculates its fixed-effects meta- analysis *Z*-score *z*(3). So the large number of *Z*-scores in Zmeta3 capture the unbiased sampling distribution that is assumed for fixed-effects meta-analysis *z*(3)-scores. + +***Gold Rush* sampling process:** In contrast, the code resulting in A3 selects only those study series for which *A*(3) = 1 under extreme *Gold Rush* accumulation bias. So the large number of *Z*-scores in Zmeta3. A3 captures a biased sampling distribution for the fixed effects meta-analysis *z*(3)-scores. + +**Meta-analysis under *Gold Rush* accumulation bias:** The final lines of code in Figure 1plot two histograms of *z*(3) samples, one with and one without the *Gold Rush A*(*t*) accumulation bias process, based on Zmeta3.A3 and Zmeta3 respectively. Figure 2gives the result. + +[![](/replication-network-blog/image.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image.webp) + +We observe in Figure 2that the theoretical sampling process, resulting in the pink histogram, gives a distribution for the three-study meta-analysis *z*(3)-scores that is centered around zero. Under the *Gold Rush* sampling process, however, our three-study *z*(3)-scores do not behave like this theoretical distribution at all. The blue histogram has a smaller variance and is shifted to the right – representing the bias. + +We conclude that we should not use conventional meta-analysis techniques to analyze our study series under *Gold Rush* accumulation bias: Conventional fixed-effects meta-analysis assumes that any three-study summary statistic *Z*(3) is sampled from the pink distribution in Figure 2under the null hypothesis, such that the meta- analysis is significant for *Z*(3)-scores larger than *zα* = 1*.*96 for a right-sided test with type-I error control  *α* = 2*.*5%. Yet the actual blue sampling distribution under this accumulation bias process shows that a much larger fraction of series that accumulate three studies will have *Z*(3)-scores larger than 1.96 than is assumed by the theory of random sampling. This (extremely) inflated proportion of type-I errors is 88% instead of 2.5% in our extreme *Gold Rush*, and can be obtained from our simulation by the code in Figure 3. + +[![](/replication-network-blog/trn320201204.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/trn320201204.webp) + +**Accumulation bias can be efficient** + +The steps in the code from Figure 1 that arrive at the biased distribution in Figure 2illustrate that accumulation bias is in fact a selection bias. Nevertheless, accumulation bias does not result from questionable research practices, such as publication bias from file-drawering a selection of results. The selection to replicate only some studies instead of all of them biases the sampling distribution of study series, but can be a very efficient approach to set priorities in research and reduce research waste. + +By inspecting our *Gold Rush* world a bit closer, we observe that a fixed-effects meta-analysis of three studies actually *conditions* on this number of studies ((*A*(*t*) needs to be *A*(3) to be 1), and that this conditional nature is what is driving the accumulation bias; in technical details subsection A.3we show this explicitly. In the next section we take the unconditional view. + +**The unconditional sampling distribution under extreme *Gold Rush* accumulation bias** + +We first adapt our *Gold Rush* accumulation bias world a bit, and not only meta-analyze three-study series but one-study “series” and two-study series as well. All possible scenarios for study series in this “all-series-size” *Gold Rush* world are illustrated below. We assume that we only meta-analyze series in a terminated state, and therefore first await a replication for significant studies before performing the meta-analysis. So a single-study “meta-analysis” can only consist of a negative or nonsignificant initial study (*z*1*−*); only in that case we are in a terminated state with *A*(1) = 1 and the series does not grow to two (*A*(2) = 0). In a two-study meta-analysis the series starts with a significant positive initial study and is replicated by a nonsignificant or negative one; only in that case *A*(2) = 1, and the series does not grow to three so *A*(3) = 0. And only three-study series that start with two significant positive studies are meta-analyzed in a three-study synthesis; only in that case *A*(3) = 1. + +**Gold Rush world; all-series-size** + +[![](/replication-network-blog/image-18.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-18.webp) + +The R code in Figure 4calculates the fixed-effects meta-analysis *z*(1), *z*(2) and *z*(3) scores, conditional on meta- analyzing a one-study, two-study, or three-study series in this adjusted *Gold Rush* accumulation bias scenario. The histograms of these conditional *z*(*t*) scores are shown in Figure 5, including the theoretical unbiased *z*(3) histogram that was also shown in Figure 2 and largely overlaps with the “*A*(1) = 1*, A*(2) = 0”-scenario. The difference between these two sampling distributions is only visible in their right tail, with the green histogram excluding values larger than *zα*= 1*.*96 and redistributing their mass over other values. + +Figure 5 clarifies that single studies are hardly biased in this extreme *Gold Rush* scenario, that the bias is problematic for two-study series and most extreme for three-study ones. + +However, what this plot does not show us is how often we are in the one-study, two-study and three-study case. + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-1.webp) +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-2.webp) + +To illustrate the relative frequencies of one-study, two-study and three-study meta-analyses, the code in Figure 6 samples the series in their respective numbers, instead of in equal numbers (which happens in the size = numSim.3series statement in Figure 4, part of creating the data frame). Plotting the total number of sampled *Z*-scores is dangerous for the single study *z*(1)-scores, however, since there are so many of them (it can crash your R studio). So before plotting the histogram, a smaller sample (of size = 3\*numSim.3series in total) is drawn that keeps the ratios between *z*(1)s, *z*(2)s and *z*(3)s intact. + +The histogram in Figure 7illustrates an unconditional distribution by the raw counts of the *z*(*t*)-scores: many result from a single study, very few from a two-study series and almost none from a three-study series. In fact, this unconditional sampling distribution is hardly biased, as we will illustrate with our table further below. + +[![](/replication-network-blog/image-3.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-3.webp) +[![](/replication-network-blog/image-4.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-4.webp) + +We first introduce an example of an ALL-IN meta-analysis to argue that such an unconditional approach can in fact be very efficient. + +**ALL-IN meta-analysis** + +Figure 8shows an example of an ALL-IN meta-analysis. Each of the red/orange/yellow lines represents a study out of the ten separate studies in as many different countries. The blue line indicates the meta-analysis synthesis of the evidence; a live account of the evidence so far in the underlying studies. In fact, *ALL-IN* meta-analysis stands for *Anytime, Live* and *Leading INterim* meta-analysis, in which the *Anytime Live* property assures valid inference under continuously monitoring and the *Leading* property allows the meta-analysis results to inform whether individual studies should be stopped or expanded. This is important to note that such data-driven decisions would invalidate conventional meta-analysis by introducing accumulation bias. + +To interpret Figure 8, we observe that initially only the Dutch (NL) study contributes to the meta-analysis and the blue line completely overlaps with the light yellow one. Very quickly, the Australian (AU) study also starts contributing and the blue meta-analysis line captures a synthesis of the evidence in two studies. Later on, also the study in the US, France (FR) and Uruguay (UY) start contributing and the meta-analysis becomes a three-study, four-study and five-study meta-analysis. How many studies contribute to the analysis, however, does not matter for its evidential value. + +[![](/replication-network-blog/image-5.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-5.webp) + +Some studies (like the Australian one) are much larger than others, such that under a lucky scenario this study could reach the evidential threshold even before other studies start observing data.  This threshold (indicated at 400) controls type-I errors at a rate of *α*= 1*/*400 = 0*.*0025 (details in the final section). So in repeated sampling under the null, the combined studies will only have a probability to cross this threshold that is smaller than 0*.*25%. In this repeated sampling the size of the study series is essentially random: we can be lucky and observe very convincing data in the early studies, making more studies superfluous, or we can be unlucky and in need of more studies. The threshold can be reached with a single study, with a two-study meta-analysis, with a three-study,.. etc, and the repeated sampling properties, like type-I error control, hold on average over all those sampling scenarios (so unconditional on the series size). + +ALL-IN meta-analysis allows for meta-analyses with Type-I error control, while completely avoiding the effects of accumulation bias and multiple testing. This is possible for two reasons: (1) we do not just perform meta- analyses on study series that have reached a certain size, but continuously monitor study series irrespective of the current number of studies in the series; (2) we use likelihood ratios (and their cousins, e-values (Grünwald et al., 2019) instead of raw *Z*-scores and *p*-values; we say more on likelihood ratios further below. + +**Accumulation bias from ALL-IN meta-analysis vs *Gold Rush*** + +The ALL-IN meta-analysis in Figure 8illustrates an improved efficiency by not setting the number of studies in advance, but let it rely on the data and be – just like the data itself – essentially random before the start of the research effort. This introduces dependencies between study results and series size that can be expressed in similar ways as *Gold Rush* accumulation bias. Yet this field of studies might make decisions differently to our *Gold Rush*: a positive nonsignificant result might not terminate the research effort, but encourage extra studies. And instead of always encouraging extra studies, a very convincing series of significant studies might conclude the research effort. If a series of studies is dependent on any such data-driven decisions, the use of conventional statistical methods is inappropriate. These dependencies actually do not have to be extreme at all: Many fields of research might be a bit like the *Gold Rush* scenario in their response to finding significant negative results of harm. A widely known study result that indicated significant harm might make it very unlikely that the series will continue to grow. So large study series will very rarely have a completely symmetric sampling distribution, since initial studies that observe results of significant harm do not grow into large series. Hence this small aspect of accumulation bias will already invalidate conventional meta-analysis, when it assumes such symmetric distributions under the null hypothesis with equal mass on significant effects of harm and benefit. + +**Properties averaged over time** + +Accumulation bias can already result from simply excluding results of significant harm from replication. This exclusion also takes place under extreme *Gold Rush* accumulation bias, since results of significant harm as well as all nonsignificant results are not replicated. Fortunately, any such scenarios can be handled by taking an unconditional approach to meta-analysis. We will now give an intuition for why this is true in case of our extreme *Gold Rush* scenario: initial studies have bias that balances the bias in larger study series when averaged over series size and analyzed in a certain way. + +Table 1 is inspired by Senn (2014) (different question, similar answer) and represents our extreme *Gold Rush* world of study series.  It takes the same approach as Figure 7 and indicates the probability to meta-analyze   a one-study, two-study or three-study series of each possible form under the null hypothesis. The three study series are very biased, with two or even three out of three studies showing a positive significant effect. But the P0 column shows that the probability of being in this scenario is very small under the null hypothesis, as was also apparent from Figure 7. In fact, most analysis will be of the one-study kind, that hardly have any bias, and are even slightly to the left of the theoretic standard null distribution. Exactly this phenomenon balances the biased samples of series of larger size. + +[![](/replication-network-blog/image-7.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-7.webp) + +A Z-score is marked by a \* and color orange (e.g.  z1\*)  in  case  the individual  study  result  is  significant  and  positive  (z1 ≥ zα  (one-sided test)) and  by  a (e.g.   z1−)  otherwise.   The  column  t  indicates  the number of  studies  and  the  column counts the number of significant studies. The fifth and sixth column multiply P0 with the column and t column to arrive at an expected value E0[\*] and E0[t] respectively in the bottom row. + +The bottom row of Table 1gives the expected values for the number of significant studies per series in the \*P0 column, and the expected value for the total number of studies per series in the *t* P0 column. If we use these expressions to obtain the proportion of expected number of significant to expected total number of studies, we get the following: + +[![](/replication-network-blog/image-8.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-8.webp) + +The proportion of expected significant effects to expected series size is still *α* in Table 1 under extreme *Gold Rush* accumulation bias, as it would also be without accumulation bias. + +This result is driven by the fact that there is a martingale process underlying this table. If a statistic is a martingale process and it has a certain value after *t* studies, the conditional expected value of the statistic after *t* + 1 studies, given all the past data, is equal to the statistic after *t* studies. So if our proportion of significant positive studies is exactly *α* for the first study (t = 1),  we  expect to also observe a proportion *α* if we  grow  our series with an additional study (t = 1+1 = 2). The Accumulation bias does not affect such statistics when averaged over time if martingales are involved (Doob’s optional stopping theorem for martingales). You can verify this aspect by deleting the last row for z1*\*,* z2*\*,* z3*\**from our table and adding two rows for *t* = 4 in its place with z1*\*,* z2*\*,* z3*\**  and either a fourth significant or a nonsignificant study.  If you calculate the expected significant effects to expected series size, you will again arrive at *α*. + +Martingale properties drive many approaches to sequential analysis, including the Sequential Probability Ratio Test (SPRT), group-sequential analysis and alpha spending. When applied to meta-analysis, any such inferences essentially average over series size, just like ALL-IN meta-analysis. + +**Multiple testing over time** + +Just having the expectation of some statistics not affected by stopping rules is not enough to monitor data continuously, as in ALL-IN meta-analysis. We need to account for the multiple testing as well. In that respect, the approaches to sequential analysis differ by either restricting inference to a strict stopping rule (SPRT), or setting a maximum sample size (group-sequential analysis and alpha spending). + +ALL-IN meta-analysis takes an approach that is different from its predecessors and is part of an upcoming field of sequential analysis for continuous monitoring with an unlimited horizon. These approaches are called *Safe* for optional stopping and/or continuation (Grünwald et al., 2019) *any-time valid* (Ramdas et al., 2020). Their methods rely on nonnegative martingales (Ramdas et al., 2020); with its most well-known and useful martingale: the likelihood ratio. For a meta-analysis *Z*-score, a martingale process of likelihood ratios could look as follows: + +[![](/replication-network-blog/image-10.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-10.webp) + +The subscript 10 indicates that the denominator of the likelihood ratio is the likelihood of the *Z*-scores under the null hypothesis of mean zero, and in the numerator is some alternative mean normal likelihood. The likelihood ratio becomes smaller when the data are more likely under the null hypothesis, but the likelihood ratio can never become smaller than 0 (hence the “nonnegative” martingale). This is crucial, because a nonnegative martingale allows us to use Ville’s inequality (Ville, 1939), also called the universal bound by Royall (1997). For likelihood ratios, this means that we can set a threshold that guarantees type-I error control under any accumulation bias process and at any time, as follows: + +[![](/replication-network-blog/image-11.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-11.webp) + +The ALL-IN meta-analysis in Figure 8 in fact is based on likelihood ratios like this, and controls the type-I error by the threshold 400 at level 1*/*400 = 0*.*25%. + +The code below illustrates that likelihood ratios can also control type-I error rates under continuous monitoring when extreme *Gold Rush* accumulation bias is at play. Within our previous simulation, we again assume a *Gold Rush* world with only true null studies and very biased two-study and three-study series. The code in Figure 11 calculates likelihood ratios for the growing study series under accumulation bias. Figure 11illustrates that still very few likelihood ratios ever grow very large. + +[![](/replication-network-blog/image-12.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-12.webp) +[![](/replication-network-blog/image-13.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-13.webp) + +If we set our type-I error rate *α* to 5%, and compare our likelihood ratios to 1*/α* = 20 we observe that less than  1*/*20 = 5% of  the  study  series  *ever* achieves  a  value  of  LR10   larger  than  20 (Figure 12).  The  simulated type-I error is even much smaller than 5% since in our *Gold Rush* world series stop growing at three studies, yet this procedure controls type-I error also in the case none of these series stops growing at three studies, but all continue to grow forever. + +[![](/replication-network-blog/image-14.webp)](https://replicationnetwork.com/wp-content/uploads/2020/12/image-14.webp) + +The type-I error control is thus conservative, and we pay a small price in terms of power. That price is quite manageable, however, and can be tuned by setting the mean value of the alternative likelihood (arbitrarily set to mean = 1 in the code for calcLR of Figure 10). More on that in Grünwald et al. (2019) and the forthcoming preprint paper on ALL-IN meta-analysis that will appear on **.** + +It is this small conservatism in controlling type-I error that allows for full flexibility: There isn’t a single accumulation bias process that could invalidate the inference. Any data-driven decision is allowed. And data- driven decisions can increase the value of new studies and reduce research waste. + +**Conclusion** + +In our imaginary world of extreme *Gold Rush* accumulation bias, the sampling distribution of the meta-analysis *Z*-score behaves very different from the sampling distribution assumed to calculate p-values and confidence intervals. A meta-analysis p-value conditions on the available sample size – on the sample size of the studies and on the number of studies available – and represents the tail area of this conditional sampling distribution under the null based on the observed *Z*-statistic. Analogously, a meta-analysis confidence interval provides coverage under repeated sampling from this conditional distribution. So if this sample size is driven by the data, as in any accumulation bias process, there is a mismatch between the assumed sampling distribution of the meta-analysis *Z*-statistic, and the actual sampling distribution. + +We believe that some accumulation bias is at play in almost any retrospective meta-analysis, such that p-values and confidence intervals generally do not have their promised type-I error control and coverage. ALL-IN meta- analysis based on likelihood ratios can handle accumulation bias, even if the exact process is unknown. It also allows for continuous monitoring; multiple testing is no problem. Hence taking the ALL-IN perspective on meta-analysis will reduce research waste by allowing efficient data-driven decisions – not letting them invalidate the inference – and incorporating single studies and small study series into meta-analysis inference. + +**Postscript** + +ALL-IN meta-analysis has been applied during the corona pandemic to analyze an accumulating series of studies while they were still ongoing. Each study investigated the ability of the BCG vaccine to prevent covid-19, but data on covid cases came in only slowly (fortunately). Meta-analyzing interim results and data-driven decisions improved the possibility of finding efficacy earlier in the pandemic. A webinar on the methodology underlying this meta-analysis – the specific likelihood ratios – is available on [**https://projects.cwi.nl/safestats**/](https://projects.cwi.nl/safestats/) under the name ALL-IN-META-BCG-CORONA. + +*Judith ter Schure is a PhD student in the Department of Machine Learning at Centrum Wiskunde & Informatica in the Netherlands. She can be contacted at Judith.ter.Schure@cwi.nl.* + +**Acknowledgements** + +My thanks go to Professor Bob Reed for inviting this contribution to his website and his patience with its publication. I also want to acknowledge Professor Peter Grünwald for checking the details. Daniel Lakens provided me with great advice to write this text more blog-like. Muriel Pérez helped me with the details of the martingale underlying the table. + +**References** + +Iain Chalmers and Paul Glasziou. Avoidable waste in the production and reporting of research evidence. *The Lancet*, 114(6):1341–1345, 2009. + +Iain Chalmers, Michael B Bracken, Ben Djulbegovic, Silvio Garattini, Jonathan Grant, A Metin Gülmezoglu, David W Howells, John PA Ioannidis, and Sandy Oliver. How to increase value and reduce waste when research priorities are set. *The Lancet*, 383(9912):156–165, 2014. + +Hans Lund, Klara Brunnhuber, Carsten Juhl, Karen Robinson, Marlies Leenaars, Bertil F Dorch, Gro Jamtvedt, Monica W Nortvedt, Robin Christensen, and Iain Chalmers. Towards evidence based research. *Bmj*, 355: i5440, 2016. + +Judith ter Schure and Peter Grünwald. Accumulation Bias in meta-analysis: the need to consider time in error control [version 1; peer review: 2 approved]. *F1000Research*, 8:962, June 2019. ISSN 2046-1402. doi: 10.12688/f1000research.19375.1. URL . + +Steven P Ellis and Jonathan W Stewart. Temporal dependence and bias in meta-analysis. *Communications in Statistics—Theory and Methods*, 38(15):2453–2462, 2009. + +Elena Kulinskaya, Richard Huggins, and Samson Henry Dogo. Sequential biases in accumulating evidence. *Research synthesis methods*, 7(3):294–305, 2016. + +Peter Grünwald, Rianne de Heide, and Wouter Koolen. Safe testing. *arXiv preprint arXiv:1906.07801*, 2019. + +Stephen Senn. A note regarding meta-analysis of sequential trials with stopping for efficacy. *Pharmaceutical Statistics*, 13(6):371–375, 2014. + +Aaditya Ramdas, Johannes Ruf, Martin Larsson, and Wouter Koolen. Admissible anytime-valid sequential inference must rely on nonnegative martingales. *arXiv preprint arXiv:2009.03167*, 2020. + +Jean Ville. Etude critique de la notion de collectif. *Bull. Amer. Math. Soc*, 45(11):824, 1939. + +Richard Royall. *Statistical evidence: a likelihood paradigm*, volume 71. CRC press, 1997. + +Judith ter Schure, Alexander Ly, Muriel F. Pérez-Ortiz, and Peter Grünwald. Safestats and all-in meta-analysis project page. , 2020. + +This blog post discusses approaches to meta-analysis that control type-I error averaged over study series size. This is called error control *surviving over time* in Ter Schure and Grünwald (2019), as will become more clear in the technical details. + +The R code used in this blog and a pdf with technical details can be found ***[here](https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fosf.io%2Fp2rtw%2F&data=05%7C01%7Cj.a.terschure%40amsterdamumc.nl%7Ca91323af5e72460b021608da44d55147%7C68dfab1a11bb4cc6beb528d756984fb6%7C0%7C0%7C637897981106202574%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=LgCsqYjIDAa3hoRmOMSw6zGsi6eMq5YxVlB1lE2YLPw%3D&reserved=0)***. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2020/12/04/ter-schure-accumulation-bias-how-to-handle-it-all-in/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2020/12/04/ter-schure-accumulation-bias-how-to-handle-it-all-in/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/tol-de-weerd-wilson-special-issue-on-replication-in-energy-economics.md b/content/replication-hub/blog/tol-de-weerd-wilson-special-issue-on-replication-in-energy-economics.md new file mode 100644 index 00000000000..7dd0f2f46b4 --- /dev/null +++ b/content/replication-hub/blog/tol-de-weerd-wilson-special-issue-on-replication-in-energy-economics.md @@ -0,0 +1,43 @@ +--- +title: "TOL & DE WEERD-WILSON: Special Issue on Replication in Energy Economics" +date: 2016-12-22 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Energy Economics" + - "Journal Data Policies" + - "Journals" + - "replication" + - "Replication section" +draft: false +type: blog +--- + +###### Economics has become an empirical discipline. Applied econometrics has replaced mathematical economics in all but a few niche journals, and economists are collecting primary data again. But publication practices are lagging behind. Replication of a theoretical paper has never been an issue. You get out your pencil and paper and work through the proof of this or that theorem. Replication of empirical papers requires more consideration. + +###### Led by the American Economic Association, an increasing number of economics journals demand that empirical papers be replicable. Data and code are archived with the paper. Efforts are now underway, by ***[Mendeley](https://www.mendeley.com/datasets)*** and others, to make data and code searchable – papers, of course, have been searchable for a long time. + +###### But replicability is not replication. Economics papers are replicated all the time, but the results are the subject of classroom discussions, online gossip, and whispers at conferences – rather than published formally. The profession rewards original contributions and looks down at derivative research. The former sentiment is fine but the latter is not. Not every paper can break new ground. Not every economist can win a Nobel Prize. The numbers matter, particularly in policy advice, and checking someone else’s results is important for building confidence in the predictions we make about the impact of policy interventions. + +###### Energy economics is a subdiscipline of applied economics. Reliable, affordable and clean energy is fundamental to economic activity, social justice, and environmental quality. The results published in *Energy Economics* inform and shape energy policy. The results therefore had better be right. In this regard, energy economics is not different from health economics, labour economics, or education economics. *Energy Economics* takes replication sufficiently seriously to incentivise it. + +###### Inviting replication papers is dangerous. Anyone can download the .dta and .do files for a paper, click a few buttons, and claim success. At *Energy Economics*, we are still wondering how to report successful replications, how to reward replicators, and how to tell genuine claims of successful replication from trivial or fake ones. + +###### For the ***[special issue of Energy Economics](https://www.journals.elsevier.com/energy-economics/call-for-papers/special-issue-on-replication-in-energy-economics)***, we therefore opted to stretch the notion of replication. We call for papers that replicate and update important, but outdated papers. We call for papers that encompass and explain contrasting findings in previously published papers. Thus defined, replication is still derivative – there is no escape from that – but it is not intellectually barren. Replicators have to make an effort to get the reward, a publication in the top field journal. + +###### This is an experiment. We hope to attract interesting replications. While there certainly has been a lot of interest, we will have to carefully study whether the submitted papers meet our expectations. + +###### The experiment is not limited to *Energy Economics*. Elsevier wants to foster a new surge in reproducibility and reproduction. There are some perceived barriers to disseminating replication studies, such as that they are only valuable if the results disagree with the original research, or that editors don’t want to publish these studies. We want to break these myths. Elsevier are now working on a range of initiatives that raise the bar on reproducibility and lower the barriers for researchers to publish replication studies, including a ***[series](https://www.elsevier.com/life-sciences/neuroscience/virtual-special-issue-neuroscience)*** of ***[Virtual Special Issues](https://www.elsevier.com/social-sciences/economics-and-finance/virtual-special-issue-on-replication-studies)***, a new article type especially for replication studies, and various calls for papers (the first one in Energy Economics) to encourage submissions. By empowering researchers to share their methods and data, championing rigorous and transparent reporting and creating outlets for replication research, Elsevier is helping to make reproducibility and replication a reality. + +###### Making sure (published) research can be reproduced is a massive step towards making it trustworthy and showing peers, funders and the public that science can be trusted. Publishing replication studies contributes to building this trust, ultimately safeguarding science. + +###### *Richard Tol is the Editor-in-Chief of Energy Economics. He teaches at the University of Sussex and the Vrije Universiteit Amsterdam. Donna de Weerd-Wilson is the Executive Publisher of Energy Economics, and manages one of the Economics portfolios at Elsevier.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/12/22/tol-de-weerd-wilson-special-issue-on-replication-in-energy-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/12/22/tol-de-weerd-wilson-special-issue-on-replication-in-energy-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/tol-special-issue-on-replication-at-energy-economics.md b/content/replication-hub/blog/tol-special-issue-on-replication-at-energy-economics.md new file mode 100644 index 00000000000..80c939b6fc8 --- /dev/null +++ b/content/replication-hub/blog/tol-special-issue-on-replication-at-energy-economics.md @@ -0,0 +1,42 @@ +--- +title: "TOL: Special Issue on Replication at Energy Economics" +date: 2019-12-17 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "academic publishing" + - "Energy Economics" + - "Journal policies" + - "replication" + - "Reproducibility" + - "Richard Tol" +draft: false +type: blog +--- + +###### Replication is important. Many journals in economics, including *Energy Economics*, now insist on papers being published together with a replication package, and a few journals check that package prior to publication. This is a world apart from the common practice only a decade ago. However, the step change in *replicability* did not lead to a step change in *replication*. + +###### *Energy Economics* has therefore published a special issue on replication. We particularly invited replication of older but prominent research, that is, papers that are frequently cited or used in policy making. This type of paper asks whether the old results stand up if newer data are added and methods are brought up to date, and if not why. + +###### We also invited encompassing papers, taking a number of recent articles to check whether the results still hold if all the evidence is put together, comparing results across methods and data sets. No such papers were submitted to the special issue. *Energy Economics* now has “replication paper” as a new type of submission. + +###### Fifty-seven papers were submitted to the special issue, of which twenty-four were accepted. One author of a replicated paper submitted a comment. Most rejections were because the paper did not add much beyond a replication. The referees, unfamiliar with replication papers, to a person drew a clear distinction between a replication paper that confirms the technical competence of the original authors and a replication paper that adds value. + +###### Six of the twenty-three replications were unsuccessful. The relatively high success rate may be because energy economics is a mature field, and a modest one were few people chase headlines. + +###### Two papers stand out. Jeffrey Racine’s ***[paper](https://www.sciencedirect.com/science/article/pii/S0140988317302219)*** reviews software tools that integrate data, analysis, and writing, so as to minimize errors and ensure internal consistency. Bruns and Koenig wrote a pre-replication plan and invited the author of the replicated paper, Stern, to join in the replication. In the resulting ***[paper](https://www.sciencedirect.com/science/article/pii/S0140988318304031)***, they emphasize the importance of the pre-analysis plan to maximise objectivity and minimize conflict. + +###### The special issue demonstrates that there is a supply of replication papers. Serious scholars are prepared to make the time and effort to take a piece of previous research, check whether it withstands scrutiny, and report their findings in a constructive and respectful manner. The special issue also shows that referees are able to tell quality and worthwhile replications from ones that are less so. It is too early to say whether these replication papers are cited and count towards promotion. Finally, the special issue reveals that publishers too can be moved towards replication. + +###### **To check out the special issue on replication at *Energy Economics,* *[click here](https://www.sciencedirect.com/journal/energy-economics/vol/82/suppl/C)*.** + +###### *Richard Tol is a professor of economics at the University of Sussex and professor of the economics of climate change at the Vrije Universiteit Amsterdam. He is Editor-in-Chief at Energy Economics.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2019/12/17/tol-special-issue-on-replication-at-energy-economics/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2019/12/17/tol-special-issue-on-replication-at-energy-economics/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/vallois-jullien-is-experimental-economics-really-doing-better-the-case-of-public-goods-experiments.md b/content/replication-hub/blog/vallois-jullien-is-experimental-economics-really-doing-better-the-case-of-public-goods-experiments.md new file mode 100644 index 00000000000..5fd79974001 --- /dev/null +++ b/content/replication-hub/blog/vallois-jullien-is-experimental-economics-really-doing-better-the-case-of-public-goods-experiments.md @@ -0,0 +1,54 @@ +--- +title: "VALLOIS & JULLIEN: Is Experimental Economics Really Doing Better? The Case of Public Goods Experiments" +date: 2018-02-16 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "COLIN CAMERER" + - "experimental economics" + - "Public goods game" + - "replication" +draft: false +type: blog +--- + +###### *[From the working paper, “Replication in experimental economics: A historical and**quantitative approach focused on public good game experiments” by* *Nicolas Vallois* *and* *Dorian Jullien**]* + +###### The current “replication crisis” concerns the inability of scientists to “replicate”, *i.e.* to reproduce a great number of their empirical findings. Many disciplines are concerned. Yet things appear to be better in experimental economics (EE). 61,1% of experimental results were successfully replicated in a large, collaborative project recently published by eminent experimental economists in *Science* (Camerer et al., 2016). The authors suggest that EE’s results are relatively more reproducible and robust than in psychology, where a similar study found a replication rate of 38% (Collaboration et al., 2015). + +###### In our article, “Replication in experimental economics: A historical and quantitative approach focused on public good game experiments”, we provide a different perspective on the place of EE within the replication crisis. Our methodological innovation consists of looking at what we call “baseline replication”. The idea is straightforward. Experimental results are usually reported as a significant difference between a so-called “baseline condition” (or “control group”) and a treatment condition (usually similar to the baseline expect for one detail). For a given type of experiments in economics, most studies will have a baseline condition that is similar or very close to another baseline condition, so that, overall, it make sense to check whether the observation in a given baseline condition is close to the average observation across all baseline conditions. “Baseline replication” refers to the fact that results in baseline conditions of similar experiments are converging toward the same level. In other words, while most studies investigate replications of “effects” between baseline and treatment conditions, we abstract from treatment conditions to look only at baseline replication. + +###### Our observations are restricted to a specific type of economic experiments: public goods (PG) game experiments. We chose the PG game because the field is relatively homogeneous and representative of the whole discipline of EE. A typical PG game consists of a group of subjects, each of which has money (experimental tokens) that can be used either to contribute to a PG (yielding returns to all the subjects in the group) or to invest in a private good (yielding returns only to the investor). + +###### Our data set consists of 66 published papers on PG game experiments. Sampling methods are described in the paper. We collected the baseline result of each study, i.e. mean contribution rate in the baseline condition. Figure 1 (below) provides a graphical display of our data. + +###### TRNOur results are twofold: + +###### – First, there is a slight yet significant tendency for baseline results to converge over time. But the effect is very noisy and still allows for a substantial amount of between-studies variation in the more recent time period. + +###### – Second, there is also a strongly significant decline of baseline results over time. The size effect is large: results in control conditions have decreased on average by 20% from 1979 to 2015. + +###### The first result (slight convergence) suggests that baseline replication over time plays the role of a “weak constraint” on experimental results. Very high contribution rates (superior to 60%) are less likely to be found in the 2000’s-2010’s. But the fluctuation range remains important and a 50% baseline contribution rate might still seem acceptable in the 2010’s (where average baseline is 32,9%). + +###### The second result (decrease of baseline results over time) was unexpected and seemingly unrelated to our initial question, since we were investigating convergence between experimental results, and not their decrease or increase over time. The 20% decrease in baseline contribution from 1979 to 2015 might suggest that early results were overestimated. A classical explanation for overestimation of size effect in empirical sciences is publication bias. Impressive results are easier to get published at first; once the research domain gets legitimized, publication practices favor more “normal” size effects + +###### Hence, a first optimistic interpretation of both results is that lab experiments are “self-correcting” over time. Less and less exceptionally high control results are found in later time periods, meaning that initial overestimation of size effects is then corrected. + +###### A second, less optimistic (though not pessimistic *per se*) interpretation, is that both convergence and decrease in baseline results are the effect of a tendency toward standardization in experimental designs. Experimental protocols in the 2000’s-2010’s for baseline conditions indeed seem to be more and more similar. Similar experiments can be expected to yield similar results. But it does not necessarily constitute a scientific improvement. More homogeneous experimental methods might be the result of mimetic dynamics in research and might not measure the “real” contribution to PG in the “real world”. If we suppose that the real rate is somewhere around 70%, initial high results in the 1980’s would be actually closer to the real size effect than the 32,91% average contribution rate found in the 2010’s + +###### To test this hypothesis about “standardization”, we collected data about two important experimental parameters: the marginal per capita return (i.e., by how much each dollar contributed to the PG is multiplied before redistribution of the whole PG) and group size. We observe a clear tendency toward standardization from 1979 to 2015. After 2000, about two-third of PG games use the same basic experimental protocol with 4 or 5 persons-groups and a linear PG payoff yielding an exact and fixed return of 0,3 or 0,4 or 0,5; whereas those values were found only in approximately one experiment out of four in the 1980’s. + +###### To conclude on our initial question, EE is not immune to the replication crisis. We found that baseline replication provides a “weak constraint” on experimental results. This might explain why EE performed relatively better than experimental psychology in the recent replication survey mentioned above (Camerer et al., 2016). We therefore agree with Camerer et al. on the “*relatively good replication success*” of EE (compared to psychology). Yet we disagree on the interpretation of this result. According to many experimental economists, EE’s results are more robust because they are based on more reliable methods: paying subjects and transparency in editorial practices. We provided evidence suggesting that better reproducibility in EE is not the effect of better methods but rather reflects a tendency to standardize experimental protocols. Such standardization does not necessarily imply that EE is relatively more scientifically advanced than experimental psychology. In this regard, it might be interesting for further research to compare the state of standardization in EE and experimental psychology. + +###### To read the working paper, [***click here***](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3116245). + +###### *Nicolas Vallois is an economist at the CRIISEA – Centre de recherche sur les institutions, l’industrie et les systèmes économiques d’Amiens, Université de Picardie Jules Verne. He can be contacted at nicolas.vallois@u-picardie.fr. Dorian Jullien is a postdoctoral fellow at CHOPE (Center for the History of Political Economy), Duke University and a research associate at the GREDEG (Groupe de Recherche en Droit, Economie et Gestion), Université Côte d’Azur. His email is dorian.jullien@gredeg.cnrs.fr.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/02/16/vallois-jullien-is-experimental-economics-really-doing-better-the-case-of-public-goods-experiments/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/02/16/vallois-jullien-is-experimental-economics-really-doing-better-the-case-of-public-goods-experiments/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/van-bergeijk-what-a-difference-a-data-version-makes.md b/content/replication-hub/blog/van-bergeijk-what-a-difference-a-data-version-makes.md new file mode 100644 index 00000000000..58d4eda1353 --- /dev/null +++ b/content/replication-hub/blog/van-bergeijk-what-a-difference-a-data-version-makes.md @@ -0,0 +1,54 @@ +--- +title: "VAN BERGEIJK: What a Difference a Data Version Makes" +date: 2016-11-29 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Data versions" + - "Data vintage" + - "Gross Planet Product" + - "IMF World Economic Outlook" + - "measurement error" + - "Penn World Tables" + - "Peter van Bergeijk" + - "World Bank World Development Indicators" +draft: false +type: blog +--- + +###### *Data sources are regularly updated. Users typically assume that this means that new, more recent data are added and that errors are corrected. Newer data are better. But are they?  And what are the implications for replication? This guest blog points out challenges and potential benefits of the existence of different data versions.* + +###### Often unnoticed, economic history is constantly being rewritten.  This results in different vintages or versions of data. By way of illustration Figure 1 reports the real rate of growth of GPP (Gross Planet Product; see van Bergeijk 2013) for the year 2003. The 11 data versions have been reported in 2006-2016 alongside the IMF flagship publication *World Economic Outlook* (the so-called October version). The lowest number reported for 2003 was published in 2009 (3.61%). The highest value for the 2003 growth rate (4.29%) was published in 2016. The reported growth rate for the year 2003 varies thus by 0.68 percentage points between the different data versions. This is an economically relevant difference of 16 to 19% depending on whether one uses the highest or lowest growth rate to calculate the percentage. + +###### figure1 + +###### **Revising without and with transparency** + +###### Figure 2 illustrates that this variation for historical data is a regular phenomenon in the *IMF World Economic Outlook* data base. Using the same 11 data versions above, the figure reports the minimum and maximum (bar and grey area) and the median GPP growth rate (dotted line) for the years 1986-2005. Consider the different GPP growth rates reported for the year 1991 across the different vintages.  Despite the fact that all the data vintages were published at least 25 years after the event, the variation in reported GPP values for 1991 differ by as much as 1.1 percentage points (or 50% of the median value). While this is the largest variation in the figure, several of the ‘revisions’ for other years are also substantial. + +###### The IMF’s opaqueness is perhaps exceptional. Other leading data sources such as the World Bank’s *World Development Indicators* or the *Penn World Tables* do report changes in methodology, estimates and underlying series transparently and in detail. The point is that these data that are used on a daily basis by many analysts and researchers are likely to change after an analysis has been done and published. + +###### figure2 + +###### **Challenges for replication research** + +###### Obviously the constant rewriting of historical data is a challenge for replication. For exact replication it is important to know which version of data was used. Although many authors report the data source, the version and the date accessed, other scientists may only report the source and, possibly, year of publication. In order to undertake an exact replication, replicating authors may need to contact authors of the original studies in order to use the identical vintage. For replication designs that want to test if the reported findings continue to hold for longer time spans (and include more recent data) a de-composition may be necessary to find out what part of the estimated effect is due to the new vintage and what to the more recent data. + +###### **Potential benefits for replication research** + +###### Variations between the different vintages of a data set are not necessarily problematic. Variations provide insight in the measurement error in the data source. A better understanding of measurement error may be helpful for establishing why a replication fails or succeeds. Moreover, performing replications over many different vintages can support the robustness of the original study’s findings. If all the data versions arrive at the same conclusion, this strengthens confidence in the replication’s verdict on the original study (be it positive or negative). It is not the difference of the data version that matters, but the similarity of findings across different data versions. As a result, different data versions can be turned into an important asset for replication research. + +###### **Reference** + +###### Bergeijk, P.A.G. van, *Earth Economics: An Introduction to Demand Management, Long-Run Growth and Global Economic Governance*, Edward Elgar: Cheltenham, 2013 + +###### *Peter A.G. van Bergeijk is professor of international economics and macroeconomics at the Institute of Social Studies, Erasmus University. More information can be found here: ****.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/11/29/van-bergeijk-what-a-difference-a-data-version-makes/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/11/29/van-bergeijk-what-a-difference-a-data-version-makes/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/vasishth-the-statistical-significance-filter-leads-to-overoptimistic-expectations-of-replicability.md b/content/replication-hub/blog/vasishth-the-statistical-significance-filter-leads-to-overoptimistic-expectations-of-replicability.md new file mode 100644 index 00000000000..6f14fdddb76 --- /dev/null +++ b/content/replication-hub/blog/vasishth-the-statistical-significance-filter-leads-to-overoptimistic-expectations-of-replicability.md @@ -0,0 +1,88 @@ +--- +title: "VASISHTH: The Statistical Significance Filter Leads To Overoptimistic Expectations of Replicability" +date: 2018-09-11 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Andrew Gelman" + - "Replicability" + - "Reproducibility" + - "Statistical significance" + - "Type M error" + - "Underpowered" +draft: false +type: blog +--- + +###### *[This blog draws on the article “**The statistical significance filter leads to overoptimistic expectations of replicability”, authored by* *Shravan Vasishth, Daniela Mertzen, Lena A. Jäger, and Andrew Gelman, published in the Journal of Memory and Language, 103, 151-175, 2018. An open access version of the article is available **[here](https://osf.io/eyphj/)**.]* + +###### **The Problem** + +###### Statistics textbooks tell us that the sample mean is an unbiased estimate of the true mean. This is technically true. But the textbooks leave out a very important detail. + +###### When statistical power is low, any statistically significant effect that the researcher finds is guaranteed to be a mis-estimate: compared to the true value, the estimate will have a larger magnitude (so-called Type M error), and it could even have the wrong sign (so-called Type S error). This point can be illustrated through a quick simulation: + +###### Imagine that the true effect has value 15, and the standard deviation is 100. Assuming that the data are generated from a normal distribution, an independent and identically distributed sample of size 20 will have power 10%. In this scenario, if you repeatedly sample from this distribution, in a few cases you will get a statistically significant effect. Each of those cases will have a sample mean that is very far away from the true value, and might even have the wrong sign. The figure below reports results from 50 simulations, ordered from smallest to largest estimated mean. Now, in the same scenario, if you increase sample size to 350, you will have power at 80%. In this case, whenever you get a statistically significant effect, it will be close to the true value; you no longer have a Type M error problem. This is also shown in the figure below. + +###### Vasis1 + +###### As the statistician Andrew Gelman put it, the maximum likelihood estimate can be “super-duper biased”(StanCon 2017, New York). + +###### Given that we publish papers based on whether an effect is statistically significant or not, in situations where power is low, every one of those estimates that we see in papers will be severely biased. Surprisingly, I have never seen this point discussed in a mathematical statistics textbook. When I did my MSc in Statistics at the University of Sheffield (2011 to 2015), this important detail was never mentioned at any point. It should be the first thing one learns when studying null hypothesis significance testing. + +###### Because this point is not widely appreciated yet, I decided to spend two years attempting to replicate a well-known result from the top ranking journal in my field. The scientific details are irrelevant here; the important point is that there are several significant effects in the paper, and we attempt to obtain similar estimates by rerunning the study seven times. The effects reported in the paper that we try to replicate are quite plausible given theory, so there is no a priori reason to believe that these results might be false positives. + +###### You see the results of our replication attempts in the three figures below. The three blue bars show the results from the original data; the bars shown here represent 95% Bayesian credible intervals, and the midpoint is the mean of the posterior distribution. The original analyses were done using null hypothesis significance testing, which means that all the three published results were either significant or marginally significant. Now compare the blue bars with the black ones; the black bars represent our replication attempts. What is striking here is that all our estimates from the replication attempts are shrunk towards zero. This strongly implies that the original claims were driven by a classic type M error. The published results are “super-duper biased”, as Gelman would say. + +###### Vasis2 + +###### Vasis3 + +###### An important take-home point here is that the original data had significant or nearly significant effects, but the estimates also had extremely high uncertainty. We see that from the wide 95% credible intervals in the blue bars. We should pay attention to the uncertainty of our estimates, and not just whether the effect is significant or not. If the uncertainty is high, the significant effect probably represents a biased estimate, as explained above. + +###### **A Solution** + +###### What is a better way to proceed when analysing data? The null hypothesis significance testing framework is dead on arrival if you have insufficient power: you’re doomed to publish overestimates of the effects you are interested in. + +###### A much more sensible way is to focus on quantifying uncertainty about your estimate. The Bayesian framework provides a straightforward methodology for achieving this goal. + +###### 1) Run your experiment until you achieve a satisfactorily low uncertainty of your estimate; in each particular domain, what counts as satisfactorily low can be established by specifying the range of quantitative predictions made by theory. ***[In our paper](https://osf.io/eyphj/)***, we discuss the details of how we do this in our particular domain of interest. We also explain how we can interpret these results in the context of theoretical predictions; we use a method that is sometimes called the region of practical equivalence approach. + +###### 2) Conduct direct replications to establish the robustness of your estimates. Andrew Gelman has called this the “secret weapon”. There is only one way to determine whether one has found a replicable effect: actual replications. + +###### 3) Preregistration, open access to the published data and code are critical to the process of doing good science. Surprisingly, these important ideas have not yet been widely adopted. People continue to hoard their data and code, often refusing to release it even on request. This is true at least for psychology, linguistics, some areas of medicine, and surprisingly, even statistics. + +###### When I say all this to my colleagues in my field, a common reaction from them is that they can’t afford to run high-power studies. I have two responses to that. First, you need to use the right tool for the job. When power is low, prior knowledge needs to be brought into the analysis; in other words, you need Bayes. Second, for the researcher to say that they want to study subtle questions but they can’t be bothered to collect enough data to get an accurate answer is analogous to a medical researcher saying that he wants to cure cancer but all he has is duct tape, so let’s just go with that. + +###### **Is There Any Point In Discussing These Issues?** + +###### None of the points discussed here are new. Statisticians and psychologists have been pointing out these problems since at least the 1970s. Psychologists like Meehl and Cohen energetically tried to educate the world about the problems associated with low statistical power. Despite their efforts, not much has changed. Many scientists react extremely negatively to criticisms about the status quo. In fact, just three days ago, at an international conference I was delivered a message from a prominent US lab that I am seen as a “stats fetishist”. Instead of stopping to consider what they might be doing wrong, they dismiss any criticism as fetishism. + +###### One fundamental problem for science seems to be the problem of statistical ignorance. I don’t know anybody in my field who would knowingly make such mistakes. Most people who use statistics to carry out their scientific goals treat it as something of secondary importance. What is needed is an attitude shift: every scientist using statistical methods needs to spend a serious amount of effort into acquiring sufficient knowledge to use statistics as intended. Curricula in graduate programs need to be expanded to include courses taught by professional statisticians who know what they’re doing. + +###### Another fundamental problem here is that scientists are usually unwilling to accept that they ever get anything wrong. It is completely normal to find people who publish paper after paper over a 30 or 40-year career that seemingly validates every claim that they ever made in the past. When one believes that one is always right, how can one ever question the methods and the reasoning that one uses on a day-to-day basis? + +###### A good example is the recently reported behavior of the Max Planck director Tania Singer. ***[Science](http://www.sciencemag.org/news/2018/08/she-s-world-s-top-empathy-researcher-colleagues-say-she-bullied-and-intimidated-them)*** reports: “Scientific discussions could also get overheated, lab members say. “It was very difficult to tell her if the data did not support her hypothesis,” says one researcher who worked with Singer.” + +###### Until senior scientists such as Singer start modelling good behaviour by openly accepting that they can be wrong, and until the time comes that people start taking statistical education seriously, my expectation is that nothing will change, and business will go on as usual: + +###### 1) Run a low-powered study + +###### 2) P-hack the data until statistical significance is reached and the desired outcome found + +###### 3) Publish result + +###### 4) Repeat + +###### It has been a successful formula. It has given many people tenure and prestigious awards, so there is very little motivation for changing anything. + +###### *Shravan Vasishth is Professor of Linguistics, University of Potsdam, Germany. His biographical and contact information can be found **[here](http://www.ling.uni-potsdam.de/~vasishth/)***. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/09/11/vasishth-the-statistical-significance-filter-leads-to-overoptimistic-expectations-of-replicability/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/09/11/vasishth-the-statistical-significance-filter-leads-to-overoptimistic-expectations-of-replicability/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/vlaeminck-podkrajac-do-economics-journals-enforce-their-data-policies.md b/content/replication-hub/blog/vlaeminck-podkrajac-do-economics-journals-enforce-their-data-policies.md new file mode 100644 index 00000000000..867f53db3a2 --- /dev/null +++ b/content/replication-hub/blog/vlaeminck-podkrajac-do-economics-journals-enforce-their-data-policies.md @@ -0,0 +1,50 @@ +--- +title: "VLAEMINCK & PODKRAJAC: Do Economics Journals Enforce Their Data Policies?" +date: 2018-10-20 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "data availability policies" + - "data repositories" + - "economics" + - "Journal policies" + - "Sven Vlaeminck" + - "ZBW Journal Data Archive" +draft: false +type: blog +--- + +###### In the past, the findings of numerous replication studies in economics have raised serious concerns regarding the credibility and reliability of published applied economic research. Literature suggests several explanations for these findings: Beyond missing incentives and rewards for the disclosure of data and code, only a small proportion of journals in economic sciences have implemented functional data availability policies. Also authors frequently do not comply with the demands of these policies or provide insufficient data and code. Our paper (“***[Journals in Economic Sciences: Paying Lip Service to Reproducible Research?](http://hdl.handle.net/10419/177391)***”) regards an additional aspect and asks to which degree editorial offices enforce the data policies of their journals. + +###### To date, only a minority of journals in economics possesses a policy on the disclosure of data and code that has been used to achieve the results of an empirical research article. But the count is growing over the years. Drivers of this shift could be located in the ongoing debates on replicable research and also in the growing demands of research funders and science policy. + +###### In our paper we ask how much journals with a data policy enforced their policy in the past. To answer this question, we started our analyses with an evaluation of 599 articles published by 37 journals with a data availability policy. All articles have been published in the years 2013 and 2014. + +###### At first, our analysis carved out the share of articles that fall under a data policy, because replication data is needed to verify the results of these articles. In total, we classified more than 75% of these articles to be empirical – or as we defined it, to be ‘data-based’. + +###### Afterwards, we checked the journal data archives (if available) and the supplemental information section of each data-based article for the availability of replication files. We distinguished between articles using restricted data and such using non-restricted data. On average, only slightly more than a third of the data-based articles had accompanying data and code available. + +###### Subsequently, we compared the demands of journals’ data policies with the replication files available for a subsample of 245 articles published by 17 journals in detail. Thereby, we were able to determine for each of the journals investigated how much the journal enforces its data policy. + +###### For the years 2013 and 2014, our findings suggest a mixed picture: While one group of journals achieved high or very high compliance rates, a significant share of journals only sporadically provides replication files. + +###### Due to the limited sample size and our focus on two years of publication, our analysis only provides a snapshot of journals’ practises at that time. Therefore, our findings should not be seen as a general overview of journals’ willingness to enforce their data policies. Also, our findings make not statement regarding the replicability of the research findings. We only checked for the availability of the prerequisites for potential replication attempts. + +###### Based on the outcome of our analyses, we recommend editorial offices to pay more attention whether journal’s data policy has been fulfilled by their authors. Journals should be stricter in enforcing their data policies, because replicability of published research is a cornerstone of the scientific method. + +###### In the first place editors are accountable for enforcing journals’ data policies, but also reviewers should feel a responsibility to take care of a periodical’s data policy. Both, editors and reviewers, invest time in ensuring that authors comply with journal’s style sheet. To also invest efforts in ensuring that replication files are available according to journal’s data policy is a task that would strengthen the scientific reputation of a periodical furthermore. + +###### First and foremost journals play a crucial role for the scientific quality assurance. Thereby journals are also important for promoting a culture of research integrity because published articles are the most visible output of research. + +###### In the spirit of replicable research, you can find the data, code and supplemental information of our analysis in the ***[ZBW Journal Data Archive](http://journaldata.zbw.eu/dataset/journals-in-economic-sciences-paying-lip-service-to-reproducible-research-replication-data)***. + +###### *Sven Vlaeminck is a research assistant at the ZBW – Leibniz Information Centre for Economics. He also is the product manager of the ZBW Journal Data Archive. His biographical and contact information can be found **[here](https://www.zbw.eu/en/about-us/key-activities/research-data-management/sven-vlaeminck/)**.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2018/10/20/vlaeminck-podkrajac-do-economics-journals-enforce-their-data-policies/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2018/10/20/vlaeminck-podkrajac-do-economics-journals-enforce-their-data-policies/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/vlaeminck-replication-requires-data-depositories-introducing-edawax.md b/content/replication-hub/blog/vlaeminck-replication-requires-data-depositories-introducing-edawax.md new file mode 100644 index 00000000000..67907415574 --- /dev/null +++ b/content/replication-hub/blog/vlaeminck-replication-requires-data-depositories-introducing-edawax.md @@ -0,0 +1,57 @@ +--- +title: "VLAEMINCK: Replication Requires Data Depositories – Introducing EDaWaX" +date: 2016-11-04 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Dataverse" + - "EDAWAX" + - "Sven Vlaeminck" + - "ZBW" + - "ZBW Journal Data Archive" +draft: false +type: blog +--- + +###### EDaWaX stands for *European Data Watch Extended*.  It recently introduced a new service, the “***[ZBW Journal Data Archive](http://journaldata.zbw.eu/)***”, to assist journals in storing and managing published economic research.  This new service of the German National Library of Economics (ZBW) is free of charge to academic journals.  Here is EDaWaX’s story. + +###### **AN INFRASTRUCTURAL NEED ARTICULATED BY THE ECONOMIC COMMUNITY…** + +###### The project *European Data Watch Extended* (***[EDaWaX](http://www.edawax.de/)***) started in fall 2011, after editors of a scholarly journal in economics discussed the idea of developing a suitable infrastructure for replication files of economic journals jointly with ZBW. The partners received funding from the German Research Foundation (DFG). + +###### For several decades, researchers such as the economist B.D. McCullough have highlighted the poor record of replicability in applied economic research. Out of this came a call for journals to implement mandatory data availability policies with corresponding data archives. However, a suitable infrastructure did not exist.  A notable exception was ***[Dataverse](http://www.thedata.org)*** – developed by ***[IQSS](http://www.iq.harvard.edu/)*** at Harvard University.  Indeed, Dataverse, has evolved as a data depository for both individuals and journals (such as ***[Economics E-Journal](https://dataverse.harvard.edu/dataverse/economics)*** and ***[Review of Economics and Statistics](https://dataverse.harvard.edu/dataverse/restat)***). + +###### **…AND A NEW FIELD FOR RESEARCH LIBRARIES** + +###### For ZBW, EDaWaX was one of its first projects in the field of research data management. Together with partners from the ***[German Data Forum](http://www.ratswd.de/en/)***, the ***[Max Planck Institute for Innovation and Competition](http://www.ip.mpg.de/en.html)*** and – at a later stage – the Research Data Centre of the Socio-Economic Panel (***[FDZ-SOEP](http://www.diw.de/en/soep)***), EDaWaX started by evaluating the existing state of affairs. This analysis included ***[examinations of journal data policies](http://dx.doi.org/10.3233/978-1-61499-562-3-145)***, ***[data sharing behaviour of economists](http://dx.doi.org/10.1016/j.respol.2014.04.008)*** and surveys on perceptions of editorial offices towards data availability policies. EDaWaX then checked if ***[services for journals](http://hdl.handle.net/10419/88148)*** to manage replication data were available among scientific infrastructure service providers in Germany and elsewhere. + +###### From these beginnings, the project created a web-based application and developed a metadata scheme. The application uses ***[CKAN](http://www.ckan.org)*** – an open source software broadly used by open data portals of public sector entities around the globe. At the end of EDaWaX’s first funding term, a pilot application was available and, in late autumn 2013, the project ***[presented](http://www.edawax.de/2013/11/edawax-first-funding-period-terminates-with-evaluation-workshop/)*** its solution to a gathering of journal editors in the social sciences. + +###### **FROM PILOT APPLICATION TOWARDS A SERVICE** + +###### In the second funding period, between 2014 and 2016, the project worked on enhancing the application and making it fit for service. A key focus was developing the capability of the metadata component. Dealing with metadata is a balancing act: On the one hand, one would like to have as much metadata for the replication files as possible, because well described data and files are much more likely to be reused. In addition, the metadata can be used to discover the replication files in disciplinary portals, search engines and so on. But on the other hand, researchers are not willing to invest much of their time in creating metadata. Therefore, the project spent much effort to lower the burden of creating metadata.  The key was automatisation whenever possible.  Another feature was making the metadata field adaptable for different resources (e.g. datasets need more additional context information, while program code or documentation need less). + +###### In addition, EDaWaX also implemented the possibility to mint ***[DOIs](https://www.doi.org/doi_handbook/1_Introduction.html#1.6.1)*** (Digital Object Identifiers) for replication files. The main idea behind using a ***[Persistent Identifier](http://www.persistent-identifier.de/english/204-examples.php)*** is to credit researchers for their investment of time and for sharing their data: The data can be cited appropriately (just like a traditional publication) and therefore authors gain an incentive to share and to document their data. + +###### Finally, an important task was to present the web service to the community. The project held several workshops at annual meetings of German learned societies in economics and management, but also presented its work at the ***[2016 annual meeting](https://www.aeaweb.org/conference/2016/preliminary.php?search_string=replication&search_type=session&association=&jel_class=&search=Search)*** of the American Economic Association (ASSA). In total the project gave more than 27 talks and workshops at national and international conferences. + +###### **WORKING WITH THE ZBW JOURNAL DATA ARCHIVE** + +###### Working with the web service is easy and time-saving for editorial offices: In a nutshell, editorial offices have to register their authors to the ZBW Journal Data Archive. The authors subsequently receive an email and create their personal accounts. Afterwards, authors may deposit the replication files of their articles at the journal data archive. In addition, they are asked to describe their files with metadata. For assistance, ***[manuals](http://www.edawax.de/2542-1/)*** in English and German are available. + +###### When an author has completed the deposit of his/her replication files, the editorial office receives a notification by the web service. In a next step, the editorial office approves the replication files and checks the metadata for consistency. The editors supplement the URL or DOI of the published article, the page numbers, issue and volume of the journal. Subsequently, the replication files are ready to be published and to receive a DOI. + +###### Currently, the ***[Journal of Economics and Statistics](http://www.jbnst.de/en/)*** (listed in the ‘Social Sciences Citation Index’) is utilising the ZBW Journal Data Archive. Other journals will follow in the next months. + +###### Journals and editorial offices interested in employing the ZBW Journal Data Archive are warmly invited to contact the author to learn more about the software and how it can be of service to your journal. + +###### *Sven Vlaeminck is Project Manager for Research Data at ZBW – German National Library of Economics and the Leibniz Information Centre for Economics. He can be contacted at s.vlaeminck@zbw.eu.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2016/11/04/vlaeminck-replication-requires-data-depositories-introducing-edawax/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2016/11/04/vlaeminck-replication-requires-data-depositories-introducing-edawax/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/von-hippel-when-does-science-self-correct-lessons-from-a-replication-crisis-in-early-20th-century-ch.md b/content/replication-hub/blog/von-hippel-when-does-science-self-correct-lessons-from-a-replication-crisis-in-early-20th-century-ch.md new file mode 100644 index 00000000000..3e720c4df1a --- /dev/null +++ b/content/replication-hub/blog/von-hippel-when-does-science-self-correct-lessons-from-a-replication-crisis-in-early-20th-century-ch.md @@ -0,0 +1,132 @@ +--- +title: "VON HIPPEL: When Does Science Self-Correct? Lessons from a Replication Crisis in Early 20th Century Chemistry" +date: 2023-11-15 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Fertilizer" + - "replication" + - "Self-correcting Science" + - "The Good Science Project" +draft: false +type: blog +--- + +*NOTE: This blog is a repost of one originally published at The Good Science Project ([**click here**](https://goodscience.substack.com/p/when-does-science-self-correct-lessons?utm_campaign=email-half-post&r=28e60i&utm_source=substack&utm_medium=email))* + +Science is self-correcting—or so we are told. But in truth it can be very hard to expunge errors from the scientific record. In 2015, a [***massive effort***](https://www.science.org/doi/full/10.1126/science.aac4716) showed that 60 percent of the findings published in top psychology journals could not be replicated. This was distressing news, but it led to several healthy reforms in experimental psychology, where a growing number of journals now insist that investigators state their hypotheses in advance, ensure that their sample size is adequate, and publish their data and code. + +You might also imagine that the credibility of the non-replicable findings took a hit. But it didn’t—not really. In 2022—seven years after the poor replicability of many findings was revealed, the findings that had failed to replicate were still getting cited just as much as the findings that had replicated successfully. And when they were cited, the fact that they had failed to replicate was rarely mentioned. + +[![](/replication-network-blog/image-1.webp)](https://replicationnetwork.com/wp-content/uploads/2023/11/image-1.webp) + +This pattern has been [***demonstrated several times in psychology and economics***](https://replicationnetwork.com/2023/04/05/reed-is-science-self-correcting-evidence-from-5-recent-papers-on-the-effect-of-replications-on-citations/). Once a finding gains hinfluence, it continues to have influence even after it fails to replicate. The original finding keeps getting cited, and the replication study is all but ignored. The finding that replications don’t correct the record is unfortunately a finding that replicates pretty well. + +Social scientists facing systemic problems in their home fields sometimes look longingly at their neighbors in the natural sciences. Natural scientists make mistakes too, and have even been known to falsify results, but if the mistakes and falsifications are consequential, they are often exposed and corrected in fairly short order. + +One of the best-known examples is Fleischmann and Pons’ claim to have achieved “cold fusion” in 1989. Although the cold fusion episode is often viewed as an embarrassment in physics, it is actually [***a great example***](https://www.goodreads.com/book/show/1528050.Bad_Science) of a community correcting the scientific record quickly. Within 6 months of Fleischmann and Pons’ announcement, suspicious inconsistencies had been pointed out in their findings, and several other labs had reported failures to replicate. Finally, the Department of Energy announced that it would not support further work on cold fusion, and except for a handful of true believers the field turned its attention elsewhere. + +[![](/replication-network-blog/image-2.webp)](https://replicationnetwork.com/wp-content/uploads/2023/11/image-2.webp) + +I recently came across a similar episode from more than 100 years ago. The characters are of course different, but the first chapters are very much like those of the cold fusion episode. The later chapters, though, go in an unexpected direction, and show a Nobel laureate spending years pursuing nonreplicable results. + +I’ll draw some lessons after I tell the story, which relies heavily on several books and articles, especially Dan Charles’ 2005 book *Master Mind* and Thomas Hager’s 2008 book *The Alchemy of Air*. + +The Alchemy of Air +================== + +In an 1898 [***presidential address***](https://www.jstor.org/stable/1627238) to the British Association for the Advancement of Science, the physicist William Crookes announced that humanity was approaching the brink of starvation. By 1930, he estimated, the world’s population would exceed the food supply that could be extracted from all the world’s arable land. Famine would inevitably follow. This being turn-of-the-century England, Crookes also suggested that famine would threaten the dominance of “the great Caucasian race,” since for some reason he believed that the supply of wheat on which “civilized mankind” depended would be more impacted that the supply of rice, corn, or millet that were more important to “other races, vastly superior to us in numbers, but differing widely in material and intellectual progress.” + +Chemistry, Crookes argued, suggested a way to avoid that grim fate. Famine was only inevitable if the soil continued to produce crops at the rates that were typical at the end of the 19th century. But the soil could produce many more crops, and produce them more sustainably if it was enriched with massive amounts of fixed nitrogen. Natural nitrogen fertilizers, such as guano or saltpeter, were dwindling and might be gone by the 1930s. But 78 percent of air was nitrogen, even if it was not in a form that crops could use. + +Crookes charged chemists with figuring out a way to “fix” the nitrogen in air—to react the N2 in the atmosphere with hydrogen from water vapor and produce the ammonia NH3 which was already known to be the key component of natural fertilizers. Solving the problem of making synthetic fertilizer would be a scientific triumph and a boon to humanity. The chemist who solved it would write their name in the history books. They would have shown how to make “bread from air.” + +It did not escape notice that whatever chemist succeeded in making bread from air would become fabulously wealthy. And it did not escape notice that, in addition to making fertilizer, ammonia could be used to make explosives. The military applications of synthetic ammonia seemed especially important in Germany, which correctly anticipated that in the event of a war with Great Britain, the British navy could impose a blockade that would cut off Germany’s imports of both food and fertilizer. + +Although the basic chemical reaction seemed straightforward, making it actually happen was not easy. It would require pressures and temperatures that had never been achieved in a laboratory. And it would require just the right catalyst. + +In 1900, just two years after Crookes’ address, the German chemist Wilhelm Ostwald announced that he had synthesized ammonia. Ostwald was just the kind of person that other chemists imagined could make bread from air. He was nearly 50, well established, and widely regarded as one of the “fathers” of physical chemistry. He was already considered a strong candidate for a new prize funded by the estate of the recently deceased chemist Alfred Nobel. Synthesizing ammonia would make Ostwald a shoo-in. He applied for a patent and offered to sell his process for a million marks to the German company BASF. + +There was only one problem. BASF couldn’t replicate Ostwald’s results. They put a junior chemist named Carl Bosch on the problem, but when he tried Ostwald’s process Bosch couldn’t produce a meaningful amount of ammonia. At best, the process would produce a couple of drops of ammonia and then stop. Which made no sense because the supply of nitrogen in the air was practically unlimited. + +Bosch’s managers sent him back to the bench, and eventually he figured out what was going wrong. Ostwald had used iron as a catalyst, and the iron that Ostwald used, like the iron that Bosch used, was sometimes contaminated with a small amount of iron nitride. The nitrogen in the ammonia was coming from the iron nitride—not from the nitrogen in air. And since there was very little iron nitride, the process would never produce a meaningful amount of ammonia. + +BASF wrote to Ostwald that they could not license his process after all. Ostwald withdrew his patent application, but he wasn’t exactly a gentleman about it. “When you entrust a task to a newly hired, inexperienced, know-nothing chemist,” Ostwald wrote, disparaging Bosch in a note to BASF managers, “then naturally nothing will come of it.” + +But Bosch was right. No one at BASF could make Ostwald’s process work, and when they brought Ostwald in he couldn’t make it work either. He withdrew his patent application and the race to make bread from air continued. + +Nine years, the problem of synthesizing ammonia was actually solved, but it was solved by a less obvious person. The person to show how to make bread from air was Fritz Haber, a 40-year-old chemist who worked at a respectable but not terribly prestigious university. Haber was Jewish and on nobody’s short list for the Nobel Prize. + +Bosch was able to replicate Haber’s experiment, and spent the next few years scaling the process up. Bosch led a team of chemists and engineers who built large reaction chambers that could tolerate the required temperatures and pressures. + +Meanwhile Ostwald sued BASF in collaboration with a rival company called Hoechst. Their claim was not that Haber’s process was invalid, but that it was not novel and its patents were invalid. If Ostwald and Hoechst had won, they would have been able to get into the ammonia business, too. But BASF won the suit, in part by bribing one of the key witnesses, an erstwhile rival of Haber’s named Walther Nernst. In 1913 BASF started running the world’s first ammonia synthesis factory, which produced a promising amount of ammonia in 1913 and 1914. + +But in 1914 Germany entered World War One, Britain blockaded German ports, and all BASF’s ammonia was diverted from fertilizer to explosives, so that Germany could sustain its war effort even as its population starved. The repurposing of his life-giving work to deadly purposes upset Bosch so much that he went on a drinking binge. + +Nevertheless, after the War, the spread of the Haber-Bosch process and related development led to massive increases in food production, which in the developing world are called the “Green Revolution.” It has been estimated that more than half the people now on earth simply would not be here if it weren’t for synthetic nitrogen fertilizers. Many of us owe our lives to them. + +Why Did Chemistry Correct Chemists’ Errors? +=========================================== + +The story so far is very much to the credit of turn-of-the-century chemistry. A senior chemist overreached and made a mistake, as everyone eventually does, but a junior chemist corrected him, and the path was cleared for one of chemistry’s greatest contributions to human thriving (and human misery). + +But why did science self-correct so efficiently in this case? None of the characters in this story was a saint. Crookes was a racist (as were many English gentlemen at the time), and Ostwald could be a self-interested jerk. Haber and Bosch came off well in this story, but their later actions showed their dark side. Haber led Germany’s poison-gas program during and after World War One. Bosch eventually led IG Farben, the German chemical conglomerate that produced the Zyklon B gas used to kill Jews (including some of Haber’s relatives) in Nazi concentration camps. (Bosch’s personal feelings were anti-Nazi and anti-war, but that didn’t matter because he never took a stand.) + +So if early twentieth-century chemists weren’t better human beings than modern psychologists and economists, why did they do a better job at rooting out nonreplicable results? + +One reason, I think , was that chemistry had important practical uses. For modern social scientists, eminence is largely a social construction. It is measured by grants, publications, prizes, TED talks, appointments at prestigious universities. That was also true in early twentieth-century chemistry, but if it was done correctly, a breakthrough in chemistry would reach beyond one’s fellow chemists. It could lead to real, practical triumphs, like making bread from air. + +And that’s why mistakes had to be corrected. BASF fully recognized that Ostwald would be annoyed by criticism of his work. But they couldn’t tiptoe around it, because they were trying to make ammonia from water and air. If Ostwald’s work couldn’t help them do that, then they couldn’t get into the fertilizer and explosives business. They couldn’t make bread from air. And they couldn’t pay Ostwald royalties. If the work wasn’t right, it was useless to everyone, including Ostwald. + +That’s also the reason why the cold-fusion findings got corrected so quickly. If Fleischmann and Pons’ results were right, they could have led to an important new energy source. While it might have had unforeseen hazards, cold fusion would have made all other energy sources obsolete, solved global warming, and ended the need to buy oil from dangerous authoritarian countries. That’s why Fleischmann and Pons called a press conference, made the cover of the New York *Times*—and it’s why other scientists pounced on their result. + +My friends in the natural sciences tell me that their fields do have some non-replicable results, but they’re typically in backwaters where only one lab has the necessary equipment and other labs aren’t particularly interested. If a finding looks to have practical applications, then nonreplicable results get exposed much more quickly. + +Usually. But not always. + +Fritz Haber’s Quest for Fool’s Gold +=================================== + +Two years after World War One, Haber and Bosch won the Nobel Prize and the Entente powers decided not to try Haber as a war criminal for his work on poison gas. + +Now Fritz Haber began to look for other ways to serve his country. Germany’s new shortage was money. Under the Treaty of Versailles, Germany was charged with paying 132 billion gold marks in reparations—money it simply did not have. The burden brough hyperinflation and made it impossible for Germany to develop its economy and stabilize its experiment in democracy. + +Haber started working on a potential solution: a chemical process that would extract gold from the sea. Ocean water contained trace quantities of gold, and Haber thought he could come up with an economical way to precipitate it out. It was an audacious plan, and if anyone else had suggested it, they might have been laughed at. But other scientists willingly joined Haber’s effort. After all, Haber was the scientist who had made bread from air. + +Amazingly, Haber’s lab did come up with several ways to draw gold from sea water—and one of those methods would have been cost-effective if, as an 1878 article suggested, there were 65 micrograms of gold per liter of sea water. On the basis of that estimate, Haber’s financial backers built him a lab in an ocean liner, but once Haber got out to sea, in 1923, he found that the concentration of gold was more than 10,000 times lower than he expected. ([***Modern estimates***](https://www.sciencedirect.com/science/article/pii/0883292788901035?ref=cra_js_challenge&fr=RR-1) suggest that the concentration is lower still.) The result that Haber had been relying on was non-replicable. + +There are two lessons here. First, in early 20th century chemistry, as in 21st century psychology and economics, you could not trust everything you read in the scientific literature. Haber had believed published estimates of the gold in sea water, and he had paid dearly. Instead of spending years developing chemical processes that would only work if the published estimates were accurate, he should have started by confirming those estimates. Careful scientists take nothing for granted; they check everything themselves. + +Second, no scientist is so eminent that he cannot make mistakes. Ostwald was the most eminent chemist in Germany when he let contamination fool him into thinking he had synthesized ammonia. Haber had become the most eminent chemist in Germany when he failed to check basic measurements of the gold in sea water. + +Not only does eminence not guarantee accuracy, it can actually breed the hubris that leads to error. There are many examples of scientists who once did amazing work becoming sloppy, jumping to conclusions, or having so many people work for them that they can’t keep track of what’s going on in their own labs. + +What Will It Take for the Social Sciences to Self-Correct? +========================================================== + +What would it take for fields like economics and social psychology to self-correct as chemistry and physics (sometimes) do? + +A crucial feature that leads a field to become self-correcting is the potential for important technological applications. As long as scientists are only writing articles, allegations of error may be difficult to resolve. But once a field has advanced to the point where it can claim to make bread from air, or draw gold from the sea, it more quickly becomes clear if the claim can hold water. + +Now, work in psychology and economics sometimes aspires to offer practical benefits as well. Economists make recommendations regarding fiscal, labor, and educational policy, and the economists on the Federal Reserve Board control certain interest rates. Psychologists, for their part, offer advice and test interventions designed to change attitudes, improve mental health and academic achievement, and effect social change. + +Of course, these economic and psychological interventions often fail to achieve their stated aims. But it’s always possible to explain those failures after the fact, often citing unpredictable contextual factors that worked against the intervention’s success. + +And that highlights the second reason why the natural sciences are more self-correcting. Experiments in the natural sciences are closely controlled. A chemist cannot say that the success of a reaction depends on dozens of unspecified contextual factors, many of which cannot be foreseen. Instead, a chemist is expected to specify the exact temperature, pressure, and catalyst required to make a reaction occur. If another chemist cannot replicate the reaction at the same temperature and pressure, with the same catalyst, then somebody has made a mistake. It might be the original researcher; it might be the researcher conducting the replication. But there is a mistake somewhere, and somebody ought to figure it out. In a mature science, things don’t just happen for no particular reason. + +In the social sciences, by contrast, scholars can be remarkably accepting of contradictory findings. In education research, for example, there are findings suggesting that achievement gaps between rich and poor children have been growing for 50 years—and other findings suggesting that they have barely changed. There are findings suggesting that achievement gaps grow faster during summer than during school—and other findings suggesting they don’t. Even results of randomized controlled trials often fail to replicate. This bothers me, but it doesn’t bother everyone, and many scholars either limit their attention to the results currently in front of them, or pick and choose from contradictory results according to which one best fits their general world view or policy preferences. + +Of course, this state of affairs doesn’t provide a very firm foundation for real scientific progress. Both psychology and economics are starting to adopt practices which, when followed, [***produce findings that can more often be replicated***](https://www.nature.com/articles/s41562-023-01749-9) across different labs and research teams. But a large number of questionable findings are still in the literature, and many of them are still getting cited at high rates. Many fields will continue to spin their wheels until they can tighten their focus to a small subset of replicable results, observed under controlled conditions, with major, practical applications in the real world. + +**REFERENCES** + +von Hippel, P. T. (2022). Is psychological science self-correcting? Citations before and after successful and failed replications. *Perspectives on Psychological Science*, *17*(6), 1556-1565. + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2023/11/15/von-hippel-when-does-science-self-correct-lessons-from-a-replication-crisis-in-early-20th-century-chemistry/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2023/11/15/von-hippel-when-does-science-self-correct-lessons-from-a-replication-crisis-in-early-20th-century-chemistry/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/weichenrieder-finanzarchiv-public-finance-analysis-wants-your-insignificant-results.md b/content/replication-hub/blog/weichenrieder-finanzarchiv-public-finance-analysis-wants-your-insignificant-results.md new file mode 100644 index 00000000000..4100f4d3289 --- /dev/null +++ b/content/replication-hub/blog/weichenrieder-finanzarchiv-public-finance-analysis-wants-your-insignificant-results.md @@ -0,0 +1,36 @@ +--- +title: "WEICHENRIEDER: FinanzArchiv/Public Finance Analysis Wants Your Insignificant Results!" +date: 2017-08-30 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "FinanzArchiv / Public Finance Analysis" + - "Insignificant results" + - "publication bias" + - "special issue" +draft: false +type: blog +--- + +###### There is considerable concern among scholars that empirical papers face a drastically smaller chance of being published if the results looking to confirm an established theory turn out to be statistically insignificant. Such a publication bias can provide a wrong picture of economic magnitudes and mechanisms. + +###### Against this background, the journal *FinanzArchiv/Public Finance Analysis* recently ***[posted a call](http://www.ingentaconnect.com/contentone/mohr/fa/2017/00000073/00000002/art00001)*** for papers for a special issue on “Insignificant Results in Public Finance”. The editors are inviting the submission of carefully executed empirical papers that – despite using state of the art empirical methods – fail to find significant estimates for important economic effects that have widespread acceptance. + +###### It has been estimated that studies in economic behavioral research and psychology are ten times more likely to be published if they present statistically significant effects. Because a significant result may happen by chance, too much weight is attributed to them in the scientific literature. The associated publication bias can produce overestimates of the effectiveness of economic policy measures, psychological impacts or even medical medications. + +###### While several ways to address this issue exist, a correction that is most directly related to the problem concerns the attitude of the scientific editors. Publication bias and the negative incentives for researchers are tackled at the root when studies are assessed on the basis of the methodology used and the quality of the data — and not on the results obtained. It requires a certain self-commitment of the journals to the increased publication of so-called “non-significant” results. Such a self-commitment was recently submitted by the editors of *FinanzArchiv/Public Finance Analysis* and is reflected in its call for papers. + +###### The deadline for submissions to the special issue is 15 September 2017.  Papers can be uploaded [***here***](http://www.mohr.de/fa):  Submitting authors should indicate that their paper is being submitted to the special issue “Insignificant Results in Public Finance”. The editors would like to note that if any insignificant results transform into statistically significant results as an outcome of the refereeing process, this will not be held as an argument against publication. In this case, the paper may be shifted into a different issue of the Journal. + +###### *FinanzArchiv* was first published in 1884, which makes it one of the world’s oldest professional journals in economics and the oldest journal of public finance. The current editors are Katherine Cuff, Ronnie Schöb and Alfons Weichenrieder. Within public economics, a strong focus is on topics as taxation, public debt, public goods, public choice, federalism, market failure, social policy and the welfare state. + +###### *Alfons Weichenrieder is P**rofessor of Economics and Public Finance at Goethe-University Frankfurt and a guest research professor at the Institute of International Taxation of Vienna University of Economics and Business.  He can be contacted via email at* **[*a.weichenrieder@em.uni-frankfurt.de*](mailto:a.weichenrieder@em.uni-frankfurt.de)***.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2017/08/30/weichenrieder-finanzarchivpublic-finance-analysis-wants-your-insignificant-results/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2017/08/30/weichenrieder-finanzarchivpublic-finance-analysis-wants-your-insignificant-results/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/content/replication-hub/blog/wesselbaum-jcre-an-outlet-for-your-replications.md b/content/replication-hub/blog/wesselbaum-jcre-an-outlet-for-your-replications.md new file mode 100644 index 00000000000..8278edfcc88 --- /dev/null +++ b/content/replication-hub/blog/wesselbaum-jcre-an-outlet-for-your-replications.md @@ -0,0 +1,46 @@ +--- +title: "WESSELBAUM: JCRE – An Outlet for Your Replications" +date: 2024-07-11 +author: "The Replication Network" +tags: + - "GUEST BLOGS" + - "Comments" + - "economics" + - "Journal" + - "Journal policies" + - "Publication" + - "replications" +draft: false +type: blog +--- + +Replication studies play a crucial role in economics by ensuring the reliability, validity, and robustness of research findings. In an era where policy decisions and societal interventions heavily rely on economic research, the ability to replicate and validate research findings is important for making informed decisions and advancing knowledge. Replications in economics became more mainstream after [***the 2016 influential paper by Colin Camerer, Anna Dreber, and others published in Science***](https://www.science.org/doi/full/10.1126/science.aaf0918). This paper replicated 18 studies from the AER and QJE published between 2011 and 2014. In only 61% (11 out of 18) of these, the replication found a significant effect in the same direction as in the original study. + +Replication studies serve as a cornerstone of scientific integrity and transparency. Economics, like other empirical sciences, relies on data-driven analysis to draw conclusions about complex economic phenomena. However, the validity of these conclusions can often be challenged due to various factors such as data limitations, methodological choices, or even unintentional errors. By replicating papers, researchers can verify whether the original findings hold under different conditions, datasets, or methodologies. This process not only enhances the credibility of the research but also fosters trust. + +Replications contribute to the cumulative nature of knowledge. Many economic theories and empirical results build upon earlier findings. However, the reliability of these findings can sometimes be uncertain, especially when they are based on limited samples or specific contexts. Replication studies allow researchers to confirm whether the findings are generalizable across different populations, time periods, or geographical regions. This cumulative process helps in refining theories, identifying boundaries, and uncovering inconsistencies that may require further investigation. + +Additionally, replication studies encourage openness and collaboration within the economics community. They promote constructive dialogue among researchers about the strengths and limitations of different approaches and methodologies. By openly discussing replication efforts and their outcomes, we can collectively enhance research practices, improve data transparency, and foster a culture of quality assurance in economic research. + +While some Journals (for example the Journal of Applied Econometrics, JPE: Micro, JESA, or Empirical Economics) publish replication studies, the new journal “*[**Journal of Comments and Replications in Economics**](https://www.jcr-econ.org/)*” (*JCRE*) aims to be the premier outlet for articles that comment on or replicate previously published articles. + +Because many journals are reluctant to publish comments and replications, *JCRE* was founded to provide an outlet for research that explores whether published results are correct, robust, and/or generalizable and to publish replications defined as any study that directly addresses the reliability of a specific claim from a previously published study. + +The editorial board of *JCRE* consists of W. Robert Reed (University of Canterbury), Marianne Saam (University of Hamburg), and myself (University of Otago). As editors, we are supported by the Leibniz Information Centre for Economics (ZBW) and the German Research Foundation (DFG). + +Our advisory board consists of the following outstanding scholars: David Autor (MIT), Anna Dreber Almenberg (Stockholm School of Economics), Richard Easterlin (USC), Edward Leamer (UCLA), David Roodman (Open Philanthropy), and Jeffrey Wooldridge (MSU). + +In conclusion, replication studies are indispensable in economics for several compelling reasons: they uphold scientific rigor and transparency, contribute to the cumulative advancement of knowledge, help identify biases and flaws in research, and promote openness and collaboration among researchers. We believe that with the increase in the availability of large data sets, the advancements in computer power, and the development of new econometric tools, the importance of replicating and validating research findings will only grow. To this end, we invite you to submit your replication studies to the *JCRE* and promote the Journal in your networks and among your students. + +For more information about *JCRE*, [***click here***](https://www.jcr-econ.org/welcome/). + +*Dennis Wesselbaum is Associate Professor of Economics at the University of Otago in New Zealand, and Co-Editor of the Journal of Comments and Replications in Economics (JCRE). He can be contacted at dennis.wesselbaum@otago.ac.nz.* + +### Share this: + +* [Click to share on X (Opens in new window) + X](https://replicationnetwork.com/2024/07/11/wesselbaum-jcre-an-outlet-for-your-replications/?share=twitter) +* [Click to share on Facebook (Opens in new window) + Facebook](https://replicationnetwork.com/2024/07/11/wesselbaum-jcre-an-outlet-for-your-replications/?share=facebook) + +Like Loading... \ No newline at end of file diff --git a/layouts/blog/list.html b/layouts/blog/list.html new file mode 100644 index 00000000000..2ace279eba1 --- /dev/null +++ b/layouts/blog/list.html @@ -0,0 +1,40 @@ +{{- define "main" -}} + +{{ partial "page_header.html" . }} + +
+ {{ with .Content }} +
{{ . }}
+ {{ end }} + +
+ {{ $paginator := .Paginate (where .Pages "Type" "blog") }} + {{ range $paginator.Pages }} +
+

{{ .Title }}

+ + {{ with .Params.tags }} +
+ Tags: + {{ range . }} + {{ . }} + {{ end }} +
+ {{ end }} + {{ with .Summary }} +
{{ . }}
+ {{ end }} +
+
+ {{ end }} +
+ + {{ partial "pagination" . }} +
+ +{{- end -}} diff --git a/layouts/blog/single.html b/layouts/blog/single.html new file mode 100644 index 00000000000..ceeb1d5839a --- /dev/null +++ b/layouts/blog/single.html @@ -0,0 +1,27 @@ +{{- define "main" -}} + +
+ + {{ partial "page_header.html" . }} + +
+ +
+ {{ .Content }} +
+ + {{ partial "tags" . }} + + {{ if site.Params.section_pager }} +
+ {{ partial "section_pager" . }} +
+ {{ end }} + + {{ partial "comments" . }} + +
+ +
+ +{{- end -}} diff --git a/layouts/partials/navbar.html b/layouts/partials/navbar.html index ab7626b494e..8bf1b7d319f 100644 --- a/layouts/partials/navbar.html +++ b/layouts/partials/navbar.html @@ -58,7 +58,13 @@