Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning
1
Zitationen
32
Autoren
2025
Jahr
Abstract
Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ's correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO's utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.
Autoren
- Sahana Srinivasan
- X. C. Ai
- Thaddaeus Wai Soon Lo
- Aidan Gilson
- Minjie Zou
- Ke Zou
- Hyunjae Kim
- Ming Yang
- Krithi Pushpanathan
- Samantha Min Er Yew
- Wan Ting Loke
- Jocelyn Hui Lin Goh
- Yibing Chen
- Yiming Kong
- Evelyn X. Fu
- Michelle Hui
- Kristen Nwanyanwu
- Amisha Dave
- Kelvin Zhenghao Li
- Chen‐Hsin Sun
- Mark A. Chia
- Guang Yang
- Wendy Wong
- David Ziyou Chen
- Dianbo Liu
- Maxwell Singer
- Fares Antaki
- Lucian V. Del Priore
- Jost Jonas
- Ron A. Adelman
- Qingyu Chen
- Yih Chung Tham