Loading
Loading
Computational Research Becomes Unreproducible Within Years as Software Dependencies Decay
Computational research — simulations, data analyses, machine learning experiments — becomes unreproducible within 2–5 years as software dependencies break. A systematic study of 601 computer science papers found that only 32% of computational results could be reproduced even when authors provided their code, primarily because required libraries, compilers, operating system interfaces, and hardware drivers had changed. The problem compounds: a typical Python data science project depends on 50–200 packages, each with its own version constraints and transitive dependencies, creating a fragile dependency graph that decays as any component updates. Even when code runs, numerical results may differ due to changes in floating-point handling, random number generators, or GPU parallelism between library versions.
Computational methods now underpin the majority of published research across all sciences — from climate projections to drug discovery to materials simulation. The National Academies estimated that irreproducible research costs the US biomedical sector $28 billion annually, and computational irreproducibility is a growing fraction. When foundational results cannot be verified, downstream research builds on unconfirmed assumptions. The NeurIPS reproducibility program found that even with dedicated reproducibility checklists, only ~50% of machine learning papers could be independently reproduced within months of publication — a rate that drops further with time.
Container technologies (Docker, Singularity) can freeze a software environment, but containers themselves have version dependencies and may not run on future hardware or operating systems. Virtual machines provide deeper isolation but are too heavyweight for routine research use and cannot capture GPU-specific behaviors. Journal mandates requiring code and data availability have shown limited effectiveness — Stodden et al. found that only 14% of papers in journals with mandatory data-sharing policies actually provided usable data and code after the policy was adopted. Workflow management systems (Snakemake, Nextflow, CWL) capture the computational graph but not the full environment. Package managers with lockfiles (pip freeze, conda lock) capture exact versions but cannot guarantee those versions will remain installable as upstream repositories change or servers go offline.
A research-specific reproducibility infrastructure that combines three elements currently handled separately: (1) deterministic build systems that produce bit-identical computational environments from declarative specifications (drawing on Nix/Guix approaches but simplified for researchers); (2) content-addressed artifact storage that permanently archives not just code and data but the exact binary dependencies used to produce published results; (3) automated reproducibility testing that periodically re-executes published analyses and flags when results diverge. The individual technologies exist but have not been integrated into a system accessible to non-expert researchers.
A team could take 20–30 recently published computational papers from a specific field (e.g., NeurIPS 2023 papers with code releases) and systematically attempt to reproduce their main results using only the provided code and instructions, documenting failure modes and time-to-failure. Alternatively, a software engineering team could prototype a "reproducibility time machine" that uses content-addressed storage to reconstruct the exact dependency tree used to produce a specific published result. Relevant disciplines: software engineering, the target computational domain, systems engineering.
This brief targets the research infrastructure integrity almost-cluster (currently 5 briefs in health and materials). It adds the first digital-domain member, with the shared structural pattern: foundational research infrastructure (software environments) has known quality problems, incentive structures discourage investing in reproducibility, and journal mandates have failed to fix the problem. Distinct from `digital-ml-safety-benchmark-dataset-gap` (which is about missing safety evaluation benchmarks, not about software dependency decay). The `temporal:worsening` tag passes the three-requirement test: (1) increasing dependency complexity as projects use more packages; (2) documented acceleration — the 32% reproducibility rate is worse than comparable studies from the 2000s; (3) the dependency ecosystem is genuinely more fragile, not just more visible.
Collberg, C. & Proebsting, T., "Repeatability in Computer Systems Research," Communications of the ACM, 59(3), 62–69, 2016; Stodden, V. et al., "An empirical analysis of journal policy effectiveness for computational reproducibility," PNAS, 115(11), 2584–2589, 2018; NeurIPS Reproducibility Program reports, 2019–2024; accessed 2026-02-25