OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.04.2026, 09:09

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance of successive generative pretrained transformers (GPT) models in medical cases and board style questions

2026·1 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

1

Zitationen

7

Autoren

2026

Jahr

Abstract

Large language models (LLMs) are evolving rapidly, yet their performance trajectory in specialized medical domains remains incompletely characterized. We evaluated the diagnostic and knowledge-based accuracy of six successive generative pre-trained transformer (GPT) models to test the hypothesis that performance gains are beginning to plateau. We conducted a comparative evaluation of GPT-3.5 Turbo, GPT-4-Turbo, GPT-4o, GPT-4.1, GPT-o3, and GPT-5 using two datasets: 78 sleep medicine case vignettes to assess diagnostic reasoning, and 897 sleep medicine board-style multiple choice questions (MCQs) to assess domain knowledge. Diagnostic accuracy improved across model generations on clinical vignettes, from 74.4% (58/78) for GPT-3.5 Turbo to 93.6% (73/78) for GPT-o3 and 91.0% (71/78) for GPT-5. A similar trend occurred for MCQs, increasing from 56.9% for GPT-3.5 Turbo to 93.0% for GPT-5. Pairwise comparisons confirmed significant improvements for advanced models over earlier iterations on both tasks (P < 0.05), and the most recent models demonstrated high levels of clinical competency. These results suggest that the latest LLMs may be approaching a high level of performance in medical tasks of sleep medicine diagnosis and knowledge retrieval. Future progress may require incorporation of curated medical datasets and domain-specific training to achieve clinical-grade reliability.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareGenomics and Rare Diseases
Volltext beim Verlag öffnen