Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Performance of successive generative pretrained transformers (GPT) models in medical cases and board style questions
1
Zitationen
7
Autoren
2026
Jahr
Abstract
Large language models (LLMs) are evolving rapidly, yet their performance trajectory in specialized medical domains remains incompletely characterized. We evaluated the diagnostic and knowledge-based accuracy of six successive generative pre-trained transformer (GPT) models to test the hypothesis that performance gains are beginning to plateau. We conducted a comparative evaluation of GPT-3.5 Turbo, GPT-4-Turbo, GPT-4o, GPT-4.1, GPT-o3, and GPT-5 using two datasets: 78 sleep medicine case vignettes to assess diagnostic reasoning, and 897 sleep medicine board-style multiple choice questions (MCQs) to assess domain knowledge. Diagnostic accuracy improved across model generations on clinical vignettes, from 74.4% (58/78) for GPT-3.5 Turbo to 93.6% (73/78) for GPT-o3 and 91.0% (71/78) for GPT-5. A similar trend occurred for MCQs, increasing from 56.9% for GPT-3.5 Turbo to 93.0% for GPT-5. Pairwise comparisons confirmed significant improvements for advanced models over earlier iterations on both tasks (P < 0.05), and the most recent models demonstrated high levels of clinical competency. These results suggest that the latest LLMs may be approaching a high level of performance in medical tasks of sleep medicine diagnosis and knowledge retrieval. Future progress may require incorporation of curated medical datasets and domain-specific training to achieve clinical-grade reliability.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.436 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.311 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.753 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.523 Zit.