Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

SUN-812 Are AI Models Accurate in Assisting Hypothyroidism Patients?

2025·0 Zitationen·Journal of the Endocrine SocietyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Disclosure: S. Sidhu: None. G. Wu: None. A. Sidhu: None. E. Pan: None. A. Madala: None. M.D. Buono: None. A. Ashok: None. Background: Approximately 13 million people in the United States, representing 4.78% of the population, suffer from undiagnosed endocrine disorders. Hypothyroidism affects up to 5% of the general population, with another 5% estimated to be undiagnosed. Most cases in Americans older than the age of 11 have mild or minimal obvious symptoms. Purpose: Determine if AI chatbots can give accurate information about hypothyroidism and correctly judge the accuracy of their own responses. Methods: Questions were asked in English to Claude, Cohere, Gemini, GPT 4o Mini, and GPT 4o. Textual responses from each chatbot were recorded and scored from a scale of 1 to 5, with 1 indicating a highly inaccurate response and 5 suggesting an accurate and advanced response. A series of paired T-tests were used to compare the difference between manual and AI scores, score difference = (Manual - AI). P was adjusted by the Bonferroni Correction. For the Manual vs. AI rated scores across languages and chatbots, random jitter was used to better visualize data grouping and trends for the scatterplot. The raters for the languages all combined to rate English as well, to remove bias, and were blinded to the chatbot they were evaluating. Results: By chatbot, ChatGPT4o outperformed Cohere (t=3.209, df=29, p adj =0.032), Claude outperformed Cohere (t=3.914, df=29, p adj =0.005), and Gemini also outperformed Cohere (t=4.455, df=29, p adj =0.001). Pearson correlation coefficient of 0.417, suggesting a moderate positive correlation within the manual vs. AI scores. Discussion: 1. While ChatGPT 4o performed the best among chatbots, it was also the only model to require a paid subscription, making accurate information less accessible to individuals in lower socioeconomic brackets who may not be able to afford paid chatbots. AI-predicted scores were generally not accurate compared to manual scores, and there was only one instance where an AI model self-scored below 3, while 16 responses were human-rated below 3. AI models are overconfident in their responses, and chatbots may repeatedly provide incorrect information when prompted. Cohere was consistently outperformed by Claude, Gemini, and ChatGPT 4o/4o mini, highlighting its need for training on more diverse datasets. Conclusion: Our data suggests a need for more accessible and affordable large language models trained on medically succinct datasets for patients. Presentation: Sunday, July 13, 2025

Autoren

Institutionen

Themen

Radiomics and Machine Learning in Medical ImagingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

SUN-812 Are AI Models Accurate in Assisting Hypothyroidism Patients?

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen