Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
CARDBiomedBench: a benchmark for evaluating the performance of large language models in biomedical research
1
Zitationen
24
Autoren
2026
Jahr
Abstract
Although large language models (LLMs) have the potential to transform biomedical research, their ability to reason accurately across complex, data-rich domains remains unproven. To address this research gap, we introduce CARDBiomedBench, a large-scale question-and-answer benchmark for evaluating LLMs in biomedical science. This pilot release focuses on neurodegenerative disease research, a field requiring the integration of genomics, pharmacology, and statistical reasoning. CARDBiomedBench includes more than 68 000 curated question-answer pairs generated through expert annotation and structured data augmentation. The questions spanned ten biological categories and nine reasoning types, based on publicly available resources, such as genome-wide association studies, summary data-based mendelian randomisation results, and regulatory drug databases. We assessed model responses using BioScore, a rubric-based evaluation system that measures response accuracy (response quality rate, RQR) and the ability to abstain from incorrect answers (safety rate). Testing 18 state-of-the-art LLMs revealed considerable gaps. Claude-3.5-Sonnet achieved high caution but low accuracy (safety rate 75%, RQR 24%), whereas GPT-4.1 showed the opposite trade-off (safety rate 7%, RQR 51%). No model showed a successful balance of both metrics. CARDBiomedBench provides a new standard for benchmarking biomedical LLMs, revealing key limitations in existing models and offering a scalable path towards safer, more effective artificial intelligence systems in scientific research.
Ähnliche Arbeiten
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
2015 · 31.001 Zit.
A global reference for human genetic variation
2015 · 19.486 Zit.
The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data
2012 · 18.083 Zit.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
2010 · 15.279 Zit.
A method and server for predicting damaging missense mutations
2010 · 13.447 Zit.
Autoren
- Owen Bianchi
- Maya Willey
- Owen Bianchi
- Benjamin Danek
- Marzieh Khani
- Nicole Kuznetsov
- Anant Dadu
- Syed Muhammad Tahir Shah
- Mathew J Koretsky
- Mary B. Makarious
- Cory A. Weller
- Kristin S Levine
- Sungwon Kim
- Paige Jarreau
- Dan Vitale
- Elise Marsan
- Hirotaka Iwaki
- Iftikhar J. Kullo
- Sara Bandres-Ciga
- Andrew B Singleton
- Mike A Nalls
- Shekoofeh Mokhtari
- Daniel Khashabi
- Faraz Faghri