Loading
Loading
No Public Benchmark Datasets Exist for Evaluating Machine Learning in Safety-Critical Applications
Machine learning systems are being deployed in safety-critical applications — autonomous vehicles, medical diagnostics, aviation, nuclear operations, infrastructure monitoring — without benchmark datasets designed for safety evaluation. Unlike conventional ML benchmarks (ImageNet, CIFAR-10) that measure average-case accuracy, safety-critical applications require evaluation against worst-case scenarios partitioned by harm severity, and no public benchmark datasets with these properties exist. ML verification methods that can formally prove safety properties work only at scales "orders of magnitude behind" modern production architectures, and no standardized post-deployment monitoring methods can detect when a deployed ML model is degrading before a safety-relevant failure occurs.
ML components in safety-critical systems affect millions of people daily. The fundamental evaluation gap is that ML research optimizes for average-case performance on research benchmarks, while safety engineering requires worst-case guarantees under realistic deployment conditions. A medical imaging model that achieves 98% accuracy on a balanced test set may fail catastrophically on the 2% of rare but clinically critical cases that determine patient survival. Without safety-partitioned benchmarks, there is no way to systematically compare models on the scenarios that matter most, no way to certify ML components for safety standards (IEC 61508, DO-178C, ISO 26262), and no scientific basis for regulators to evaluate safety claims.
Domain-specific test suites exist in narrow areas — autonomous vehicle perception benchmarks (nuScenes, KITTI, Waymo Open Dataset) measure detection accuracy but do not partition test cases by harm severity or systematically cover edge cases. ISO/IEC TS 22440 (Functional Safety and AI) is still in draft after years of development. The ML community's emphasis on leaderboard competition rewards average-case optimization and actively discourages investment in worst-case characterization, because exposing failure modes reduces published accuracy numbers. Safety engineering communities and ML research communities operate with fundamentally different evaluation philosophies — worst-case vs. average-case — and lack shared vocabulary, metrics, or incentive structures.
Public benchmark datasets partitioned by harm severity level, covering realistic deployment conditions including edge cases, distribution shifts, and adversarial inputs for specific safety-critical application domains. Standardized post-deployment monitoring frameworks that can detect ML model degradation before safety-relevant failures occur — analogous to structural health monitoring for physical infrastructure. Bridging frameworks that translate between ML performance metrics and safety engineering requirements (ASIL levels in automotive, SIL in industrial, DAL in aviation).
A student team could construct a harm-severity-partitioned evaluation dataset for one safety-critical ML application — for example, medical image classification where test cases are ranked by clinical urgency, or structural health monitoring where defect images are ranked by failure consequence severity. Teams could systematically catalog known failure modes from incident reports and synthesize corresponding test cases. Alternatively, teams could prototype a post-deployment drift detection system for a specific ML pipeline using statistical process control methods adapted for high-dimensional data. Relevant disciplines: computer science, safety engineering, systems engineering, relevant domain expertise.
Distinct from `digital-ml-component-formal-verification` (which covers formal verification scalability for neural networks) and `digital-safe-rl-exploration-guarantees` (which covers safe exploration in reinforcement learning). This brief addresses the missing evaluation infrastructure — benchmark datasets and monitoring frameworks — without which formal methods and safe RL cannot be validated against real-world safety requirements. Related to `digital-ai-trustworthiness-heterogeneous-verification` but focuses specifically on the dataset/evaluation gap rather than the verification methodology gap. Source-bias note: NASEM frames this as requiring "a sustained, coordinated research effort"; the binding constraint is genuinely a data and evaluation infrastructure gap, compounded by the cultural divide between ML research (average-case) and safety engineering (worst-case) communities.
National Academies of Sciences, Engineering, and Medicine, "Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda," 2025, https://nap.nationalacademies.org/catalog/27970; Computer Science and Telecommunications Board; accessed 2026-02-20