OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 01.04.2026, 14:34

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

(277) What AI Performs Best When Answering Commonly Asked Questions About Testosterone Replacement Therapy?

2025·0 Zitationen·The Journal of Sexual MedicineOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

Abstract Introduction Testosterone replacement therapy (TRT) has seen a marked rise in popularity over the past decade, but in popular culture TRT continues to be shrouded in confusion and misinformation. Simultaneously, artificial intelligence (AI) is increasingly being used by patients and providers as a source of medical information. However, despite the rapid adoption of AI tools in healthcare, there is limited research comparing the accuracy or quality of the different AI chat bots when answering medical questions, especially in the context of TRT. Objective To investigate and compare the accuracy of responses to frequently asked questions about TRT from AI models. Methods Ten frequently asked questions regarding TRT were individually entered into five different AI sources (ChatGPT 4o, Grok 2, Claude 3.5, Copilot and Gemini 1.5). The responses were blinded and randomized. Ten urologists then critically analyzed and graded the responses on a Likert scale of 1 to 4 for clarity, accuracy, and evidence-based information, with 1 being the best and 4 being the worst. We initially analyzed data using a one-way ANOVA and then completed post-hoc pairwise comparison analysis using Tukey’s HSD. Results Grok was rated highest and outperformed all other models (p-values < 0.001 in each comparison). Gemini had the lowest overall rating and was significantly outperformed by all other AI models (all p < 0.001). Ranking the platforms by overall performance we find that Grok (Mean = 1.92) did the best, followed by Claude (Mean = 2.13), then ChatGPT (Mean = 2.40), then Copilot (Mean = 2.93), and finally Gemini (Mean= 3.29) did the worst (all p< 0.001). Conclusions Our findings reveal significant variation in the quality of AI responses to questions about TRT, as rated by 10 board certified urologists. Grok outperformed every other AI in physician reported quality of responses. With a mean of 1.92, most urologists reported Grok’s responses required zero to minimal clarification. Gemini’s responses were rated significantly worse than every other chatbot used in the study. With a mean of 3.29, Gemini’s responses on average required moderate clarification or were unsatisfactory altogether. These differences highlight that not all AI chatbots provide equally reliable or clinically accurate information. Given the growing use of AI in patient education, selecting the most accurate AI platform is critical, especially in areas like TRT where misinformation is common. Grok is the only AI that users must pay to use, so future research should explore how the premium versions of other AI chatbots compare. Disclosure No

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMobile Health and mHealth ApplicationsExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen