Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Performance of Large Language Models in Ophthalmology Referral Triage
0
Zitationen
5
Autoren
2026
Jahr
Abstract
Purpose The aim of this study was to evaluate the classification accuracy and consistency of five advanced language model-based systems (LLMs), ChatGPT 4o, ChatGPT 5.1, Perplexity Pro, Claude Sonnet 4.5, and Claude Opus 4.1, in classifying real-world Portuguese ophthalmology referral vignettes into symptom-based categories, and to assess the effect of supervised in-context learning on model performance. Methods A total of 3,831 real-world, anonymized ophthalmology referral vignettes written in Portuguese and collected between January and May 2023 were submitted to each system across three independent runs. In phase one, models classified referrals into one of 16 predefined symptom-based categories using a zero-shot prompting strategy. In phase two, each system was exposed to 957 labeled examples (~20% of the dataset) through in-context learning before repeating the task. Classification accuracy, consistency, and Fleiss' kappa agreement were calculated, with additional analysis by symptom category. Results Baseline classification accuracy averaged 68.7% across models, improving to 73.4% post exposure. ChatGPT 5.1 achieved the highest peak accuracy (79.5%), while ChatGPT 4o showed the largest consistency gain (from 66.8% to 93.8%) and a net improvement in 933 cases (p < 0.001). Performance exceeded 90% for common referral categories, such as diabetic screening and chronic visual loss, but was lower for rare or ambiguous complaints. Inter-run agreement, measured by Fleiss' kappa, ranged from moderate to substantial across models (κ = 0.462-0.801), with the highest agreement observed for ChatGPT 4o. Conclusions Advanced LLMs demonstrated strong performance in interpreting real-world Portuguese-language ophthalmology referrals, with meaningful gains in accuracy and consistency achieved through limited supervised in-context exposure. Performance was lower for rare or ambiguous referral categories. Despite this limitation, these findings support the potential role of LLMs as scalable, low-cost triage aids, provided that human oversight and further clinical validation are ensured prior to deployment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.644 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.550 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.061 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.850 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.