Companion repo for robustness specification in biomedical foundation models (BFMs)
Please cite the following preprint for reference
@misc{xian_robustness_2024,
address = {Rochester, NY},
type = {{SSRN} {Scholarly} {Paper}},
title = {Robustness tests for biomedical foundation models should tailor to specification},
url = {https://papers.ssrn.com/abstract=5013799},
doi = {10.2139/ssrn.5013799},
language = {en},
urldate = {2024},
publisher = {Social Science Research Network},
author = {Xian, Patrick and Baker, Noah R. and David, Tom and Cui, Qiming and Holmgren, A. Jay and Bauer, Stefan and Sushil, Madhumita and Abbasi-Asl, Reza},
month = jan,
year = {2024},
keywords = {AI policy, foundation model, health AI, robustness},
}
We carreid out the search of BFMs from a few existing GitHub repositories, review papers, and directly on the internet. We selected a total of about 50 representative BFMs (mostly published in 2023-2024) in publication and preprints, covering a broad range of biomedical domains. We then extracted the relevant information on the model name, developers, modality, domain, capabilities, and any robustness tests that have been described for the each model. The information is gathered here. In the following, we break down the claimed robustness tests conducted for the BFMs. While about a third of the models don't have an explicit robustness test, a small number of models have been subject to multiple ones.
32% None
32% Evaluation on multiple existing datasets (including public datasets used for finetuning)
16% Evaluation on artificially shifted datasets (including perturbed and synthetic datasets)
8% Evaluation on external datasets (datasets not used in development)
8% Ablation studies
6% Others
None indicates no specified robustness tests. Most claimed robustness tests in the selected BFMs involve evaluation of model performance on some datasets, which we divide into three types: existing datasets, artificially shifted datasets, and external datasets.
A combination of theoretical and application-oriented resources are collected for robustness. The categorization of robustness follows that provided in the paper.
Robustness in the context of foundation models
- A.I. Robustness: a Human-Centered Perspective on Technological Challenges and Opportunities, ACM Comput. Surv. (2025)
- Robustness at Inference: Towards Explainability, Uncertainty, and Intervenability, CVPR Tutorial (2024)
- Machine Learning Robustness: A Primer, arXiv:2404.00897
- Spurious Correlations in Machine Learning: A Survey, arXiv:2402.12715
- AI Maintenance: A Robustness Perspective, IEEE Computer (2023)
- Foundational Robustness of Foundation Models, NeurIPS Tutorial (2022)
- A scoping review of robustness concepts for machine learning in healthcare, npj Digit. Med. (2025)
- Toward a framework for risk mitigation of potential misuse of artificial intelligence in biomedical research, Nat. Mach. Intell. (2024)
- SoK: Security and Privacy Risks of Medical AI, arXiv:2409.07415
- Ethical and regulatory challenges of large language models in medicine, Lancet Digit. Health (2024)
- Developing robust benchmarks for driving forward AI innovation in healthcare, Nat. Mach. Intell. (2022)
- Shifting machine learning for healthcare from development to deployment and from models to data, Nat. Biomed. Eng. (2022)
- Secure and Robust Machine Learning for Healthcare: A Survey, IEEE Rev. Biomed. Eng. (2020)
- Prompting is a Double-Edged Sword: Improving Worst-Group Robustness of Foundation Models, ICML (2024)
- Improving Group Robustness on Spurious Correlation Requires Preciser Group Inference, ICML (2024)
- Controllable Prompt Tuning For Balancing Group Distributional Robustness, ICML (2024)
- Multigroup Robustness, ICML (2024)
- Change is hard: a closer look at subpopulation shift, ICML (2023)
- Improving Out-of-Distribution Robustness via Selective Augmentation, ICML (2022)
- Just Train Twice: Improving Group Robustness without Training Group Information, ICML (2021)
- No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems, NeurIPS (2020)
- Characterizing Data Point Vulnerability as Average-Case Robustness, UAI (2024)
- Characterizing the Impacts of Instances on Robustness, ACL (2023)
- Achievable distributional robustness when the robust risk is only partially identified, NeurIPS (2024)
- Causality-oriented robustness: exploiting general additive interventions, arXiv:2307.10299
- Certified Robustness Against Natural Language Attacks by Causal Intervention, ICML (2022)
- Towards Causal Representation Learning, Proc. IEEE (2021)
- Provable Guarantees on the Robustness of Decision Rules to Causal Interventions, IJCAI (2021)
- A causal view on robustness of neural networks, NeurIPS (2020)
- Invariance, Causality and Robustness, Statist. Sci. (2020)
- Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach, arXiv:2502.06832
- Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness, arXiv:2408.05446
- On the Adversarial Robustness of Mixture of Experts, NeurIPS (2022)
- Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness, ICLR (2025)
- Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?, ACL BioNLP Workshop (2024)
- Uncertainty-Aware Pre-Trained Foundation Models for Patient Risk Prediction via Gaussian Process, WWW (2024)
- Pathophysiological Features in Electronic Medical Records Sustain Model Performance under Temporal Dataset Shift, AMIA (2024)
- Temporal Robustness against Data poisoning, NeurIPS (2023)
- Stable clinical risk prediction against distribution shift in electronic health records, Patterns (2023)
- EHR foundation models improve robustness in the presence of temporal distribution shift, Sci. Rep. (2023)
- DeepJoint: Robust Survival Modelling Under Clinical Presence Shift, NeurIPS Workshp (2022)
- Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine, Sci. Rep. (2022)
- Longitudinal Adversarial Attack on Electronic Health Records Data, WWW (2019)
- Current Pathology Foundation Models are unrobust to Medical Center Differences, arXiv:2501.18055
- Distilling foundation models for robust and efficient models in digital pathology, arXiv:2501.16239
- The Impact of Scanner Domain Shift on Deep Learning Performance in Medical Imaging: an Experimental Study, arXiv:2409.04368
- A multi-center study on the adaptability of a shared foundation model for electronic health records, npj Digit. Med. (2024)
- Enhancing Robustness of Foundation Model Representations under Provenance-related Distribution Shifts, NeurIPS Workshop (2023)
- Prompt injection attacks on vision language models in oncology, Nat. Comm. (2025)
- Medical large language models are susceptible to targeted misinformation attacks, npj Digit. Med. (2024)
- PromptSmooth: Certifying Robustness of Medical Vision-Language Models via Prompt Learning, MICCAI (2024)
- BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning, MICCAI (2024)
- Large Diverse Ensembles for Robust Clinical NLI, SemEval (2024)
- MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering, arXiv:2406.06573
- Adversarial Attacks on Large Language Models in Medicine, arXiv:2406.12259
- Poisoning medical knowledge using large language models, Nat. Mach. Intell. (2024)
- Backdoor Attack on Unpaired Medical Image-Text Foundation Models: A Pilot Study on MedCLIP, SaTML (2024)
- Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks, TMLR (2024)
- Demonstration of an Adversarial Attack Against a Multimodal Vision Language Model for Pathology Imaging, ISBI (2024)
- Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging, Nat. Biomed. Eng. (2023)
- An Auditing Test To Detect Behavioral Shift in Language Models, ICLR (2025)
- Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs, arXiv:2407.15549
- Robust Conversational Agents against Imperceptible Toxicity Triggers, NAACL (2022)
- Evaluating the Robustness of Adverse Drug Event Classification Models using Templates, ACL BioNLP Workshop (2024)
- Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness, SemEval (2024)
- Enhancing Robustness of Foundation Model Representations under Provenance-related Distribution Shifts, NeurIPS DistShift Workshop (2023)
- Evaluating the Robustness of Biomedical Concept Normalization, NeurIPS TLNLP Workshop (2022)
- Improving the robustness and accuracy of biomedical language models through adversarial training, J. Biomed. Inform. (2022)
- Adversarial attacks in radiology – A systematic review, Eur. J. Radiol. (2023)
- Adversarial attacks and adversarial robustness in computational pathology, Nat. Comm. (2022)
- Advancing diagnostic performance and clinical usability of neural networks via adversarial training and dual batch normalization, Nat. Comm. (2021)
- Adversarial attacks on medical machine learning, Science (2019)
- SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks, arXiv:2411.19688
- Scalable Drift Monitoring in Medical Imaging AI, arXiv:2410.13174
- The Data Addition Dilemma, arXiv:2408.04154
- Empirical data drift detection experiments on real-world medical imaging data, Nat. Comm. (2024)
- Off-label use of artificial intelligence models in healthcare, Nat. Med. (2024)
- Understanding Liability Risk from Using Health Care Artificial Intelligence Tools, New Eng. J. Med. (2024)
- Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data, NeurIPS (2023)
- External validation of AI models in health should be replaced with recurring local validation, Nat. Med. (2023)
- Diagnosing and remediating harmful data shifts for the responsible deployment of clinical AI models, medRxiv:2023.03.26.23286718
- Evaluating Robustness to Dataset Shift via Parametric Robustness Sets, NeurIPS (2022)
- A Fine-Grained Analysis on Distribution Shift, ICLR (2022)
- Mandoline: Model Evaluation under Distribution Shift, ICML (2021)