Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
From open-ended to multiple-choice: evaluating diagnostic performance and consistency of ChatGPT, Google Gemini and Claude AI
4
Zitationen
5
Autoren
2024
Jahr
Abstract
OBJECTIVE: Aim: To determine the performance and response repeatability of freely available LLMs in diagnosing diseases based on clinical case descriptions. PATIENTS AND METHODS: Materials and Methods: 100 detailed clinical case descriptions were used to evaluate the diagnostic performance of ChatGPT 3.5, ChatGPT 4o, Google Gemini, and Claude AI 3.5 Sonnet large language models (LLMs). The analysis was conducted in two phases: Phase 1 with only case descriptions, and Phase 2 with descriptions and answer variants. Each phase used specific prompts and was repeated twice to assess agreement. Response consistency was determined using agreement percentage and Cohen's Kappa (k). 95% confidence intervals for proportions were calculated using Wilson's method. Statistical significance was set at p<0.05 using Fisher's exact test. RESULTS: Results: In Phase 1 of the study, ChatGPT 3.5, ChatGPT 4o, Google Gemini, and Claude AI 3.5 Sonnet's efficacy was 69.00%, 64.00%, 44.00%, and 72.00% respectively. All models showed high consistency as agreement percentages ranged from 93.00% to 97.00%, and k ranged from 0.86 to 0.94. In Phase 2 all models' productivity increased significantly (90.00%, 95.00%, 65.00%, and 89.00% for ChatGPT 3.5, ChatGPT 4o, Google Gemini, and Claude AI 3.5 Sonnet respectively). The agreement percentages ranged from 97.00% to 99.00%, while k values were between 0.85 and 0.93. CONCLUSION: Conclusions: Claude AI 3.5 Sonnet and both ChatGPT models can be used effectively for the differential diagnosis process, while using these models for diagnosing from scratch should be done with caution. As Google Gemini's efficacy was low, its feasibility in real clinical practice is currently questionable.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.652 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.567 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.083 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.856 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.