Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
SUN-812 Are AI Models Accurate in Assisting Hypothyroidism Patients?
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Abstract Disclosure: S. Sidhu: None. G. Wu: None. A. Sidhu: None. E. Pan: None. A. Madala: None. M.D. Buono: None. A. Ashok: None. Background: Approximately 13 million people in the United States, representing 4.78% of the population, suffer from undiagnosed endocrine disorders. Hypothyroidism affects up to 5% of the general population, with another 5% estimated to be undiagnosed. Most cases in Americans older than the age of 11 have mild or minimal obvious symptoms. Purpose: Determine if AI chatbots can give accurate information about hypothyroidism and correctly judge the accuracy of their own responses. Methods: Questions were asked in English to Claude, Cohere, Gemini, GPT 4o Mini, and GPT 4o. Textual responses from each chatbot were recorded and scored from a scale of 1 to 5, with 1 indicating a highly inaccurate response and 5 suggesting an accurate and advanced response. A series of paired T-tests were used to compare the difference between manual and AI scores, score difference = (Manual - AI). P was adjusted by the Bonferroni Correction. For the Manual vs. AI rated scores across languages and chatbots, random jitter was used to better visualize data grouping and trends for the scatterplot. The raters for the languages all combined to rate English as well, to remove bias, and were blinded to the chatbot they were evaluating. Results: By chatbot, ChatGPT4o outperformed Cohere (t=3.209, df=29, p adj =0.032), Claude outperformed Cohere (t=3.914, df=29, p adj =0.005), and Gemini also outperformed Cohere (t=4.455, df=29, p adj =0.001). Pearson correlation coefficient of 0.417, suggesting a moderate positive correlation within the manual vs. AI scores. Discussion: 1. While ChatGPT 4o performed the best among chatbots, it was also the only model to require a paid subscription, making accurate information less accessible to individuals in lower socioeconomic brackets who may not be able to afford paid chatbots. AI-predicted scores were generally not accurate compared to manual scores, and there was only one instance where an AI model self-scored below 3, while 16 responses were human-rated below 3. AI models are overconfident in their responses, and chatbots may repeatedly provide incorrect information when prompted. Cohere was consistently outperformed by Claude, Gemini, and ChatGPT 4o/4o mini, highlighting its need for training on more diverse datasets. Conclusion: Our data suggests a need for more accessible and affordable large language models trained on medically succinct datasets for patients. Presentation: Sunday, July 13, 2025
Ähnliche Arbeiten
New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)
2008 · 28.927 Zit.
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 13.607 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.775 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.111 Zit.