Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

RareArena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis

2026·0 Zitationen·The Lancet Digital HealthOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Rare diseases pose a substantial clinical and public health burden, with timely and accurate diagnoses remaining a formidable challenge in many countries and settings. Large language models (LLMs) have the potential to enhance the screening and diagnosis of rare diseases. However, existing evaluations of such capacities of LLMs are insufficient due to the low sample sizes, poor coverage of different rare diseases, and weak real-world clinical relevance. To address these issues, we developed an automated data collection pipeline, RareArena, to curate a comprehensive rare disease dataset from PubMed Central case reports. Two tasks were considered: rare disease screening (RDS), which involved rare disease recognition before confirmatory diagnostic tests; and rare disease confirmation (RDC), which required diagnosis with complete information. The constructed RDS task comprised 49 760 cases of 4597 rare diseases, whereas the RDC task comprised 22 901 cases of 3522 rare diseases. Human evaluations by two physicians across three dimensions-leakage, fidelity, and complexity-validated the high quality of our dataset. We systematically benchmarked ten state-of-the-art LLMs of varying sizes (GPT-4o, Llama3·1-70B, Llama3·1-8B, Qwen2·5-72B, Qwen2·5-7B, Gemma2-9B, Phi3-7B, Llama3-70B-UltraMedical, Llama3-8B-UltraMedical, and MedLlama-7B) and found that GPT-4o achieved the best performance in both the tasks (RDS: 33·1% top-1 recall, 56·9% top-5 recall; RDC: 64·2% top-1 recall, 85·9% top-5 recall). Additionally, GPT-4o performed better on genetically inherited diseases than on non-genetically inherited diseases in both the tasks, with its highest performance observed in systemic or rheumatologic diseases (RDS: 48·5% top-1 recall, 76·5% top-5 recall; RDC: 76·0% top-1 recall, 93·5% top-5 recall). To our knowledge, our dataset represents the largest rare disease diagnostic benchmark to date, featuring broad and deep disease coverage and reflecting clinically relevant scenarios. RareArena highlights the potential of LLMs in rare disease screening and diagnosis, providing a foundation for improved care for individuals with rare diseases around the world.

Autoren

Institutionen

Themen

Genomics and Rare DiseasesArtificial Intelligence in Healthcare and EducationMachine Learning in Healthcare

Volltext beim Verlag öffnen

RareArena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen