Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

SUN-805 Can AI Large Language Models Help our Hyperthyroid Patients?

2025·0 Zitationen·Journal of the Endocrine SocietyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Disclosure: P. Badala: None. S. Sidhu: None. G. Wu: None. A. Sidhu: None. E. Pan: None. A. Ashok: None. M. Del Buono: None. Background: 1.2% of people in the United States have hyperthyroidism. Hyperthyroidism can be developed as a result of Graves’ Disease, toxic adenoma, and a toxic multinodular goiter. Over 250 million people worldwide used AI software in 2023, and that number is forecast to increase significantly. Purpose: Determine if different AI chatbots can provide correct information about hyperthyroidism across different languages. Methods: Questions were asked in English, Chinese, Hindi, Japanese, Korean, and Punjabi to five chatbots, Claude, Cohere, Gemini, GPT 4o Mini, and GPT 4o. Textual responses from Claude, Cohere, Gemini, GPT 4o Mini, and GPT 4o were recorded and translated with help from native speakers. Responses were manually scored on a 1-5 scale. Results: A T-test paired run with Bonferroni corrections showed that across chatbots, English responses had more words than Chinese (t=8.309, df=24, p adj &lt;0.001), Punjabi (t=4.881, df=24, p&lt;0.001), Hindi (t=8.385, df=24, p adj &lt;0.001), Japanese (t=10.096, df=24, p adj &lt;0.001), Korean (t=4.581, df=24, p adj &lt;0.001). (Table 1). Hindi had a higher word count than Japanese, as did Chinese, Korean, and Punjabi. A regression analysis was carried out to determine if the length of an output was correlated to its accuracy. No correlation was observed, with the regression equation (Manual Score) = 4.305 - 0.000325( Text Length) being significant for the intercept (p&lt;0.001) but not the slope (p=0.168). No significant difference between the different chatbots’ outputs (across all languages) was observed (lowest p adj = 0.428). No correlation was observed between response length and accuracy based on scoring, indicating that text length does not significantly affect accuracy of responses. The clear hierarchy in language accuracy (English, Chinese &gt; Hindi, Korean, Japanese &gt; Punjabi) suggests a significant disparity in the quality of medical information provided across different languages. Conclusion: More training on linguistically and medically diverse datasets is needed to make responses more concise and readable. Presentation: Sunday, July 13, 2025

SUN-805 Can AI Large Language Models Help our Hyperthyroid Patients?

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen