Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative Performance of Large Language Models in Ophthalmology Referral Triage

2026·0 Zitationen·CureusOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Purpose The aim of this study was to evaluate the classification accuracy and consistency of five advanced language model-based systems (LLMs), ChatGPT 4o, ChatGPT 5.1, Perplexity Pro, Claude Sonnet 4.5, and Claude Opus 4.1, in classifying real-world Portuguese ophthalmology referral vignettes into symptom-based categories, and to assess the effect of supervised in-context learning on model performance. Methods A total of 3,831 real-world, anonymized ophthalmology referral vignettes written in Portuguese and collected between January and May 2023 were submitted to each system across three independent runs. In phase one, models classified referrals into one of 16 predefined symptom-based categories using a zero-shot prompting strategy. In phase two, each system was exposed to 957 labeled examples (~20% of the dataset) through in-context learning before repeating the task. Classification accuracy, consistency, and Fleiss' kappa agreement were calculated, with additional analysis by symptom category. Results Baseline classification accuracy averaged 68.7% across models, improving to 73.4% post exposure. ChatGPT 5.1 achieved the highest peak accuracy (79.5%), while ChatGPT 4o showed the largest consistency gain (from 66.8% to 93.8%) and a net improvement in 933 cases (p < 0.001). Performance exceeded 90% for common referral categories, such as diabetic screening and chronic visual loss, but was lower for rare or ambiguous complaints. Inter-run agreement, measured by Fleiss' kappa, ranged from moderate to substantial across models (κ = 0.462-0.801), with the highest agreement observed for ChatGPT 4o. Conclusions Advanced LLMs demonstrated strong performance in interpreting real-world Portuguese-language ophthalmology referrals, with meaningful gains in accuracy and consistency achieved through limited supervised in-context exposure. Performance was lower for rare or ambiguous referral categories. Despite this limitation, these findings support the potential role of LLMs as scalable, low-cost triage aids, provided that human oversight and further clinical validation are ensured prior to deployment.

Autoren

Institutionen

Unidade Local de Saúde de Entre Douro e Vouga(PT)

Themen

Artificial Intelligence in Healthcare and EducationHealthcare Systems and TechnologyRetinal Diseases and Treatments

Volltext beim Verlag öffnen

Comparative Performance of Large Language Models in Ophthalmology Referral Triage

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen