Missing Data That Isn't Random

Standard ML Imputation Methods Fail When Missing Data Has Systematic Structure Across Populations

digitalhealth

Problem Statement

When datasets from multiple sources are combined — electronic health records from different hospitals, multi-omics data from different assays, or survey data from different countries — the missing data patterns have systematic structure. Entire variable blocks are absent for certain subpopulations or data sources. Standard imputation methods (MICE, mean imputation, deep learning imputers) assume data is "missing at random" (MAR), a condition that structured missingness systematically violates. As a result, ML models trained on combined heterogeneous data make systematically worse predictions for the populations whose data is least complete — typically the same populations already underserved.

Why This Matters

The push to combine heterogeneous data sources is accelerating across medicine (federated health data networks), climate science (satellite + ground sensors), and social science (administrative records + surveys). The UK Biobank, for example, has imaging data for only ~25% of participants and genomic data for ~80%, creating block-structured missingness that biases any model using both modalities. Every large-scale precision medicine initiative (All of Us, Genomics England) faces this problem. If structured missingness is not properly handled, these datasets will produce models that appear accurate on average but systematically fail for underrepresented groups.

What’s Been Tried

Methods for non-random missing data exist in classical statistics — selection models, pattern-mixture models, and shared-parameter models — but they require strong parametric assumptions and do not scale beyond a few hundred variables. Modern deep learning imputers (VAEs, GANs) can handle high-dimensional data but implicitly assume MAR or MCAR. Recent methods like MIWAE (Missing data IWAE) can handle some non-random patterns but cannot represent the block structure where entire data modalities are absent for subgroups. Graph-based approaches show promise for modeling missingness structure but have only been tested on synthetic data. No method has been validated on real-world datasets with verified structured missingness patterns at the scale of modern biobanks.

What Would Unlock Progress

A benchmark suite of real-world datasets with documented structured missingness patterns would enable systematic comparison of methods. Causal frameworks for missingness (treating missingness as a node in a directed acyclic graph) could provide principled approaches, but require domain knowledge about why data is missing — itself rarely documented. Methods that combine representation learning with explicit missingness structure modeling (e.g., learning separate embeddings per missingness pattern) are a promising direction.

Entry Points for Student Teams

A student team could construct synthetic datasets with controlled block-missingness patterns, benchmark standard imputation methods (MICE, missForest, MIWAE) against them, and measure downstream prediction fairness across subgroups defined by missingness pattern. This requires only standard ML tooling and publicly available datasets. Alternatively, teams could analyze the UK Biobank's or MIMIC-III's missingness structure, document the patterns, and test whether imputation method choice changes clinical prediction model performance for different patient subgroups. Relevant disciplines: statistics, machine learning, data science, biomedical informatics.

Genome Tags

Constraint

dataequitytechnical

Domain

digitalhealth

Scale

global

Failure

unrepresentative-dataignored-context

Breakthrough

algorithmdata-integration

Stakeholders

institutional

Temporal

worsening

Tractability

research-contribution

Source Notes

Related briefs: `DIGITAL-scientific-ai-data-scarcity` (broader data scarcity, not specifically structured missingness); `mathematics-ai-uncertainty-quantification-science` (UQ methods, tangentially related); `health-pulse-oximeter-skin-tone-bias` (hardware bias, not data bias). This brief bridges the equity-technical divide: structured missingness is a technical data problem whose consequences are inequitable prediction quality across populations. Source-bias note: the Nature Machine Intelligence framing emphasizes theoretical/methodological gaps; the actual binding constraints are data documentation practices and institutional willingness to share missingness metadata.