Loading
Loading
Psychology's Replication Crisis Reveals That Essential Experimental Details Were Never Reported
The Open Science Collaboration's landmark attempt to replicate 100 psychology studies found that only 36% produced statistically significant results in the same direction as the original, and effect sizes were on average half the original magnitude. The Many Labs 2 project (28 studies, 125 samples across 36 countries) confirmed that many "replicated" effects vary dramatically across sites — not because of random error but because of unidentified contextual moderators that the original studies did not report. The core infrastructure problem is that psychology's standard method sections lack the detail needed for precise replication: stimulus presentation timing, experimenter demographics, participant recruitment channels, lab ambient conditions, and dozens of other procedural details that can moderate effects are systematically omitted from publications.
Psychology research directly informs clinical practice (cognitive behavioral therapy protocols), education policy (growth mindset interventions, stereotype threat), criminal justice (eyewitness identification procedures), and public health (behavioral nudges, anti-stigma campaigns). When foundational effects cannot be replicated, interventions built on those effects may be ineffective or harmful. The growth mindset intervention literature, for example, showed dramatic effects in original studies but near-zero effects in large-scale replications — yet growth mindset curricula had already been adopted by school districts serving millions of students. The replication crisis has generated substantial methodological reform, but the underlying infrastructure problem — insufficient procedural detail in published methods — persists because publication incentives reward novelty over methodological precision.
Pre-registration (committing to hypotheses and analysis plans before data collection) addresses p-hacking and HARKing but does not address hidden moderator problems, because researchers cannot pre-register moderators they don't know exist. Registered replication reports improve replication quality but are expensive (each costs ~$50K–$100K in participant time and researcher effort) and cover only a handful of effects per year. Open materials policies requiring researchers to share stimuli and code improve reproducibility but cannot capture tacit procedural knowledge — the "lab lore" that experienced researchers transmit orally but do not write down. Multisite replication projects (Many Labs, PSA) demonstrate the scale of the problem but cannot prevent it in new research.
Machine-readable experimental protocols that capture procedural details at sufficient granularity to enable exact replication — analogous to how chemical synthesis protocols specify temperatures, times, and concentrations to the decimal point. Automated experimental platforms (jsPsych, PsychoPy with standardized deployment) that enforce procedural consistency across sites by eliminating experimenter-mediated variation. Systematic moderator mapping projects that empirically test which procedural variables actually moderate established effects, separating consequential details from irrelevant ones.
A team could select a well-known psychology effect with mixed replication results and systematically identify candidate hidden moderators by comparing successful and failed replication protocols at the procedural detail level, then design an experiment that tests the top moderator candidates. Alternatively, a methods team could prototype a structured experimental protocol format for one subfield of psychology (e.g., visual cognition) that captures the procedural details typically omitted from method sections, then pilot it with collaborating labs. Relevant disciplines: experimental psychology, cognitive science, research methodology, human-computer interaction.
Targets the research infrastructure integrity almost-cluster. The structural pattern matches: foundational research infrastructure (experimental methods reporting) has documented quality problems, incentive structures (publication rewards novelty over methodological precision) discourage quality-checking, and the failure propagates downstream (clinical interventions and education policies built on unreplicated effects). Adds health and education domains to the almost-cluster (currently health and materials). The `temporal:static` tag is used because the replication rate itself is not worsening — the problem is longstanding and now visible. Distinct from `health-preclinical-cancer-replication-failure` (which is about biological reagent quality in biomedical research, not procedural detail in behavioral science).
Open Science Collaboration, "Estimating the reproducibility of psychological science," Science, 349(6251), aac4716, 2015; Klein, R.A. et al., "Many Labs 2: Investigating Variation in Replicability Across Samples and Settings," Advances in Methods and Practices in Psychological Science, 1(4), 443–490, 2018; Nosek, B.A. et al., "Replicability, Robustness, and Reproducibility in Psychological Science," Annual Review of Psychology, 73, 719–748, 2022; accessed 2026-02-25