← ALL PROBLEMS
DIGITAL-scientific-ai-data-scarcity
Tier 22026-02-12

AI for Scientific Discovery Fails on Small, Fragmented, Multi-Modal Datasets Typical of Most Sciences

digitalenvironmenthealth

Problem Statement

Modern AI — particularly large language models and generative AI — achieves its successes by training on massive, curated datasets. Most scientific domains cannot produce such data. Scientific processes involve complex systems with numerous interdependent observations across different modalities, scales, and quality levels that serve as proxies for underlying phenomena. The resulting datasets are small, fragmented across disparate sources (simulations, lab experiments, sensor measurements, figures/tables extracted from literature), often proprietary, and lack standardized formats. An NSF workshop convening 30 experts across computational biology, neuroscience, climate science, materials informatics, and physics concluded that AI's reliance on "big, labeled data corpora" is fundamentally incompatible with how most scientific data is generated, creating a structural bottleneck that limits AI's impact on scientific discovery.

Why This Matters

Federal agencies and universities are investing billions in "AI for science" initiatives on the premise that AI will accelerate discovery across all scientific domains. But the data gap means AI benefits concentrate in data-rich fields (genomics, astronomy, materials screening) while data-scarce fields (ecology, hydrology, many clinical sciences) are left behind. The workshop identified that the most scientifically important problems — rare phenomena, extreme events, novel systems — are exactly where data is scarcest. If AI for science is built only on big-data assumptions, it will systematically miss the most important discoveries while creating an illusion of broad applicability.

What’s Been Tried

Transfer learning from data-rich to data-scarce domains has shown limited success because scientific domains have fundamentally different data structures, physical constraints, and noise characteristics that don't transfer well. Physics-informed neural networks (PINNs) embed prior knowledge to reduce data requirements, but they work only where the physics is well-characterized and fall apart for systems with unknown or partially known governing equations. Data augmentation and synthetic data generation can fill gaps for known distributions but cannot generate data for the rare events and out-of-distribution conditions where scientific discovery happens. Federated learning attempts to aggregate distributed datasets without centralizing them, but scientific data is heterogeneous in format, quality, and semantics — not just distributed in location. Foundation models for science are being attempted but face the fundamental problem that no single training corpus spans the diversity of scientific observation types, and pre-training on text (scientific papers) does not transfer to reasoning about experimental data.

What Would Unlock Progress

The workshop identified several directions: AI methods that can reason causally rather than just statistically — enabling extraction of mechanistic understanding from small datasets; general uncertainty quantification methods for generative models so scientists know when to trust AI outputs; hybrid approaches combining domain knowledge with data-driven methods in principled ways; and community-curated, AI-ready benchmark datasets for specific scientific challenge domains that can serve as shared evaluation standards. The deeper shift needed is AI architectures designed from the ground up for small, heterogeneous, multi-modal scientific data rather than adapted from architectures optimized for internet-scale text and images.

Entry Points for Student Teams

A team could take a well-studied scientific system (e.g., a simple chemical reaction, ecological population dynamics, or material fatigue behavior) where both data and ground-truth models exist, then systematically benchmark how current AI methods (standard neural networks, PINNs, transfer learning, few-shot learning) degrade as dataset size shrinks and modality diversity increases. The deliverable would be a "data efficiency curve" showing where each method breaks down and what minimum data requirements look like for reliable prediction. This is feasible as a computational/data science project requiring programming skills and basic domain knowledge.

Genome Tags

Constraint
datatechnical
Domain
digitalenvironmenthealth
Scale
global
Failure
disciplinary-siloignored-context
Breakthrough
algorithmknowledge-integration
Stakeholders
multi-institution
Temporal
worsening
Tractability
research-contribution

Source Notes

- The workshop brought together 30 experts from 10+ scientific disciplines, making it one of the broadest convenings on AI-for-science challenges. - Relates to existing briefs involving data-constrained predictive models (Cluster 2 in cross-domain analysis) but is more fundamental: those briefs describe models trained on unrepresentative data, while this brief describes the structural inability to create representative scientific training data at all. - The `failure:ignored-context` tag reflects that AI methods designed for internet-scale data are being applied to scientific settings with fundamentally different data characteristics — analogous to deploying lab-designed solutions in field conditions. - The causal reasoning gap is a distinct sub-problem: commercial AI optimizes for prediction, but science requires understanding causation. This disconnect means AI improvements for commercial applications (better language models, image generators) don't automatically benefit science. - Key adjacent communities: the machine learning theory community (NeurIPS, ICML), the scientific computing community (SIAM), and individual domain science communities are all working on pieces of this but lack a shared framework for small-data AI methods.

Source

"AI-enabled scientific revolution in the age of generative AI: second NSF workshop report," published in npj Artificial Intelligence (Nature), 2025. Workshop held August 2024, University of Minnesota. https://www.nature.com/articles/s44387-025-00018-6 (accessed 2026-02-12)