Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking the Performance of Large Language Models in Uveitis: A Comparative Analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3
0
Zitationen
10
Autoren
2024
Jahr
Abstract
<title>Abstract</title> BACKGROUND/OBJECTIVE This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology. METHODS Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05. RESULTS Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3. CONCLUSIONS Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.292 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.143 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.539 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.452 Zit.