Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
RareArena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis
0
Zitationen
15
Autoren
2026
Jahr
Abstract
Rare diseases pose a substantial clinical and public health burden, with timely and accurate diagnoses remaining a formidable challenge in many countries and settings. Large language models (LLMs) have the potential to enhance the screening and diagnosis of rare diseases. However, existing evaluations of such capacities of LLMs are insufficient due to the low sample sizes, poor coverage of different rare diseases, and weak real-world clinical relevance. To address these issues, we developed an automated data collection pipeline, RareArena, to curate a comprehensive rare disease dataset from PubMed Central case reports. Two tasks were considered: rare disease screening (RDS), which involved rare disease recognition before confirmatory diagnostic tests; and rare disease confirmation (RDC), which required diagnosis with complete information. The constructed RDS task comprised 49 760 cases of 4597 rare diseases, whereas the RDC task comprised 22 901 cases of 3522 rare diseases. Human evaluations by two physicians across three dimensions-leakage, fidelity, and complexity-validated the high quality of our dataset. We systematically benchmarked ten state-of-the-art LLMs of varying sizes (GPT-4o, Llama3·1-70B, Llama3·1-8B, Qwen2·5-72B, Qwen2·5-7B, Gemma2-9B, Phi3-7B, Llama3-70B-UltraMedical, Llama3-8B-UltraMedical, and MedLlama-7B) and found that GPT-4o achieved the best performance in both the tasks (RDS: 33·1% top-1 recall, 56·9% top-5 recall; RDC: 64·2% top-1 recall, 85·9% top-5 recall). Additionally, GPT-4o performed better on genetically inherited diseases than on non-genetically inherited diseases in both the tasks, with its highest performance observed in systemic or rheumatologic diseases (RDS: 48·5% top-1 recall, 76·5% top-5 recall; RDC: 76·0% top-1 recall, 93·5% top-5 recall). To our knowledge, our dataset represents the largest rare disease diagnostic benchmark to date, featuring broad and deep disease coverage and reflecting clinically relevant scenarios. RareArena highlights the potential of LLMs in rare disease screening and diagnosis, providing a foundation for improved care for individuals with rare diseases around the world.
Ähnliche Arbeiten
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
2015 · 31.001 Zit.
A global reference for human genetic variation
2015 · 19.486 Zit.
The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data
2012 · 18.083 Zit.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
2010 · 15.279 Zit.
A method and server for predicting damaging missense mutations
2010 · 13.447 Zit.
Autoren
Institutionen
- Chinese Academy of Medical Sciences & Peking Union Medical College(CN)
- Beijing Tsinghua Chang Gung Hospital(CN)
- Tsinghua University(CN)
- Peking Union Medical College Hospital(CN)
- Singapore National Eye Center(SG)
- Beihang University(CN)
- University College London(GB)
- University of Glasgow(GB)
- Shanghai Jiao Tong University(CN)