Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
SUN-805 Can AI Large Language Models Help our Hyperthyroid Patients?
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Abstract Disclosure: P. Badala: None. S. Sidhu: None. G. Wu: None. A. Sidhu: None. E. Pan: None. A. Ashok: None. M. Del Buono: None. Background: 1.2% of people in the United States have hyperthyroidism. Hyperthyroidism can be developed as a result of Graves’ Disease, toxic adenoma, and a toxic multinodular goiter. Over 250 million people worldwide used AI software in 2023, and that number is forecast to increase significantly. Purpose: Determine if different AI chatbots can provide correct information about hyperthyroidism across different languages. Methods: Questions were asked in English, Chinese, Hindi, Japanese, Korean, and Punjabi to five chatbots, Claude, Cohere, Gemini, GPT 4o Mini, and GPT 4o. Textual responses from Claude, Cohere, Gemini, GPT 4o Mini, and GPT 4o were recorded and translated with help from native speakers. Responses were manually scored on a 1-5 scale. Results: A T-test paired run with Bonferroni corrections showed that across chatbots, English responses had more words than Chinese (t=8.309, df=24, p adj <0.001), Punjabi (t=4.881, df=24, p<0.001), Hindi (t=8.385, df=24, p adj <0.001), Japanese (t=10.096, df=24, p adj <0.001), Korean (t=4.581, df=24, p adj <0.001). (Table 1). Hindi had a higher word count than Japanese, as did Chinese, Korean, and Punjabi. A regression analysis was carried out to determine if the length of an output was correlated to its accuracy. No correlation was observed, with the regression equation (Manual Score) = 4.305 - 0.000325( Text Length) being significant for the intercept (p<0.001) but not the slope (p=0.168). No significant difference between the different chatbots’ outputs (across all languages) was observed (lowest p adj = 0.428). No correlation was observed between response length and accuracy based on scoring, indicating that text length does not significantly affect accuracy of responses. The clear hierarchy in language accuracy (English, Chinese > Hindi, Korean, Japanese > Punjabi) suggests a significant disparity in the quality of medical information provided across different languages. Conclusion: More training on linguistically and medically diverse datasets is needed to make responses more concise and readable. Presentation: Sunday, July 13, 2025
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.339 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.478 Zit.