OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 01.05.2026, 22:37

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative performance of GPT-4, GPT-o3, GPT-5, Gemini-3-Flash, and DeepSeek-R1 in ophthalmology question answering

2026·0 Zitationen·Frontiers in Cell and Developmental BiologyOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Background: The application of large language models (LLMs) in medicine is rapidly advancing, showing particular promise in specialized fields like ophthalmology. However, existing research has predominantly focused on validating individual models, with a notable scarcity of systematic comparisons between multiple state-of-the-art LLMs. Objective: To systematically evaluate the performance of GPT-4, GPT-o3, GPT-5, Gemini-3-Flash, and DeepSeek-R1 on ophthalmology question-answering tasks, with a specific focus on response consistency and factual accuracy. Methods: A total of 300 single-best-answer multiple-choice questions were sampled from the StatPearls ophthalmology question bank. The questions were categorized into four difficulty levels (Levels 1-4) based on the inherent difficulty ratings provided by the database. Each model provided independent answers three times under two distinct prompting strategies: a direct neutral prompt and a role-based prompt. Fleiss' kappa (κ) was used to assess inter-run response consistency, and overall accuracy was employed as the primary performance metric. Results: Accuracy: Gemini-3-Flash achieved the highest overall accuracy (83.3%), followed by GPT-o3 (79.2%) and DeepSeek-R1 (74.4%). GPT-4 (69.9%) and GPT-5 (69.1%) demonstrated the lowest accuracies. Consistency: GPT-o3 demonstrated the highest decision stability (κ = 0.966), followed by DeepSeek-R1 (κ = 0.904) and Gemini-3-Flash (κ = 0.860). GPT-5 exhibited the lowest stability (κ = 0.668). Influencing Factors: Prompting strategies did not significantly affect model accuracy. While Gemini-3-Flash remained stable across difficulty levels, DeepSeek-R1 and GPT-o3 showed enhanced relative performance on more complex tasks. Conclusion: GPT-o3 and Gemini-3-Flash achieve superior stability and accuracy in ophthalmology Question Answering (QA), making them suitable for high-stakes clinical decision support. The open-source model DeepSeek-R1 shows competitive potential, especially in complex tasks. Notably, GPT-5 failed to surpass its predecessor in both accuracy and consistency in this specialized domain. Prompt engineering has a limited impact on performance for closed-ended medical questions. Future work should extend to multimodal integration and real-world clinical validation to enhance the practical utility and reliability of LLMs in medicine.

Ähnliche Arbeiten