OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 21.03.2026, 08:52

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

CARDBiomedBench: a benchmark for evaluating the performance of large language models in biomedical research

2026·1 Zitationen·The Lancet Digital HealthOpen Access
Volltext beim Verlag öffnen

1

Zitationen

24

Autoren

2026

Jahr

Abstract

Although large language models (LLMs) have the potential to transform biomedical research, their ability to reason accurately across complex, data-rich domains remains unproven. To address this research gap, we introduce CARDBiomedBench, a large-scale question-and-answer benchmark for evaluating LLMs in biomedical science. This pilot release focuses on neurodegenerative disease research, a field requiring the integration of genomics, pharmacology, and statistical reasoning. CARDBiomedBench includes more than 68 000 curated question-answer pairs generated through expert annotation and structured data augmentation. The questions spanned ten biological categories and nine reasoning types, based on publicly available resources, such as genome-wide association studies, summary data-based mendelian randomisation results, and regulatory drug databases. We assessed model responses using BioScore, a rubric-based evaluation system that measures response accuracy (response quality rate, RQR) and the ability to abstain from incorrect answers (safety rate). Testing 18 state-of-the-art LLMs revealed considerable gaps. Claude-3.5-Sonnet achieved high caution but low accuracy (safety rate 75%, RQR 24%), whereas GPT-4.1 showed the opposite trade-off (safety rate 7%, RQR 51%). No model showed a successful balance of both metrics. CARDBiomedBench provides a new standard for benchmarking biomedical LLMs, revealing key limitations in existing models and offering a scalable path towards safer, more effective artificial intelligence systems in scientific research.

Ähnliche Arbeiten