The Code Rots Before the Paper Is Cited

Computational Research Becomes Unreproducible Within Years as Software Dependencies Decay

digital

Problem Statement

Computational research — simulations, data analyses, machine learning experiments — becomes unreproducible within 2–5 years as software dependencies break. A systematic study of 601 computer science papers found that only 32% of computational results could be reproduced even when authors provided their code, primarily because required libraries, compilers, operating system interfaces, and hardware drivers had changed. The problem compounds: a typical Python data science project depends on 50–200 packages, each with its own version constraints and transitive dependencies, creating a fragile dependency graph that decays as any component updates. Even when code runs, numerical results may differ due to changes in floating-point handling, random number generators, or GPU parallelism between library versions.

Why This Matters

Computational methods now underpin the majority of published research across all sciences — from climate projections to drug discovery to materials simulation. The National Academies estimated that irreproducible research costs the US biomedical sector $28 billion annually, and computational irreproducibility is a growing fraction. When foundational results cannot be verified, downstream research builds on unconfirmed assumptions. The NeurIPS reproducibility program found that even with dedicated reproducibility checklists, only ~50% of machine learning papers could be independently reproduced within months of publication — a rate that drops further with time.

What’s Been Tried

Container technologies (Docker, Singularity) can freeze a software environment, but containers themselves have version dependencies and may not run on future hardware or operating systems. Virtual machines provide deeper isolation but are too heavyweight for routine research use and cannot capture GPU-specific behaviors. Journal mandates requiring code and data availability have shown limited effectiveness — Stodden et al. found that only 14% of papers in journals with mandatory data-sharing policies actually provided usable data and code after the policy was adopted. Workflow management systems (Snakemake, Nextflow, CWL) capture the computational graph but not the full environment. Package managers with lockfiles (pip freeze, conda lock) capture exact versions but cannot guarantee those versions will remain installable as upstream repositories change or servers go offline.

What Would Unlock Progress

A research-specific reproducibility infrastructure that combines three elements currently handled separately: (1) deterministic build systems that produce bit-identical computational environments from declarative specifications (drawing on Nix/Guix approaches but simplified for researchers); (2) content-addressed artifact storage that permanently archives not just code and data but the exact binary dependencies used to produce published results; (3) automated reproducibility testing that periodically re-executes published analyses and flags when results diverge. The individual technologies exist but have not been integrated into a system accessible to non-expert researchers.

Entry Points for Student Teams

A team could take 20–30 recently published computational papers from a specific field (e.g., NeurIPS 2023 papers with code releases) and systematically attempt to reproduce their main results using only the provided code and instructions, documenting failure modes and time-to-failure. Alternatively, a software engineering team could prototype a "reproducibility time machine" that uses content-addressed storage to reconstruct the exact dependency tree used to produce a specific published result. Relevant disciplines: software engineering, the target computational domain, systems engineering.

Genome Tags

Constraint

technicaldata

Domain

digital

Scale

global

Failure

ignored-contextunrepresentative-data

Breakthrough

processdata-integration

Stakeholders

institutional

Temporal

worsening

Tractability

proof-of-concept

Source Notes

This brief targets the research infrastructure integrity almost-cluster (currently 5 briefs in health and materials). It adds the first digital-domain member, with the shared structural pattern: foundational research infrastructure (software environments) has known quality problems, incentive structures discourage investing in reproducibility, and journal mandates have failed to fix the problem. Distinct from `digital-ml-safety-benchmark-dataset-gap` (which is about missing safety evaluation benchmarks, not about software dependency decay). The `temporal:worsening` tag passes the three-requirement test: (1) increasing dependency complexity as projects use more packages; (2) documented acceleration — the 32% reproducibility rate is worse than comparable studies from the 2000s; (3) the dependency ecosystem is genuinely more fragile, not just more visible.