Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative performance of GPT-4, GPT-o3, GPT-5, Gemini-3-Flash, and DeepSeek-R1 in ophthalmology question answering
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Background: The application of large language models (LLMs) in medicine is rapidly advancing, showing particular promise in specialized fields like ophthalmology. However, existing research has predominantly focused on validating individual models, with a notable scarcity of systematic comparisons between multiple state-of-the-art LLMs. Objective: To systematically evaluate the performance of GPT-4, GPT-o3, GPT-5, Gemini-3-Flash, and DeepSeek-R1 on ophthalmology question-answering tasks, with a specific focus on response consistency and factual accuracy. Methods: A total of 300 single-best-answer multiple-choice questions were sampled from the StatPearls ophthalmology question bank. The questions were categorized into four difficulty levels (Levels 1-4) based on the inherent difficulty ratings provided by the database. Each model provided independent answers three times under two distinct prompting strategies: a direct neutral prompt and a role-based prompt. Fleiss' kappa (κ) was used to assess inter-run response consistency, and overall accuracy was employed as the primary performance metric. Results: Accuracy: Gemini-3-Flash achieved the highest overall accuracy (83.3%), followed by GPT-o3 (79.2%) and DeepSeek-R1 (74.4%). GPT-4 (69.9%) and GPT-5 (69.1%) demonstrated the lowest accuracies. Consistency: GPT-o3 demonstrated the highest decision stability (κ = 0.966), followed by DeepSeek-R1 (κ = 0.904) and Gemini-3-Flash (κ = 0.860). GPT-5 exhibited the lowest stability (κ = 0.668). Influencing Factors: Prompting strategies did not significantly affect model accuracy. While Gemini-3-Flash remained stable across difficulty levels, DeepSeek-R1 and GPT-o3 showed enhanced relative performance on more complex tasks. Conclusion: GPT-o3 and Gemini-3-Flash achieve superior stability and accuracy in ophthalmology Question Answering (QA), making them suitable for high-stakes clinical decision support. The open-source model DeepSeek-R1 shows competitive potential, especially in complex tasks. Notably, GPT-5 failed to surpass its predecessor in both accuracy and consistency in this specialized domain. Prompt engineering has a limited impact on performance for closed-ended medical questions. Future work should extend to multimodal integration and real-world clinical validation to enhance the practical utility and reliability of LLMs in medicine.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.551 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.443 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.942 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.792 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.