OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 23:15

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Validation of generative AI models’ performance on patient education quizzes: Hormone-related symptoms of cancer care.

2025·0 Zitationen·Journal of Clinical Oncology
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

e13881 Background: To evaluate the clinical reliability and accuracy of generative AI models when answering patient-level educational quizzes on hormone-related symptoms of cancer care during survivorship. This assessment aimed to measure the consistency of AI responses and identify areas of potential improvement. Methods: Two 10-question patient-level gender-specific quizzes (one for men and one for women) were compiled from the highest-quality questions generated by five generative AI models (GPT-o1, Claude3.5 Sonnet, Grok2, Gemini1.5, and DeepSeek). Questions were selected based on their relevance, clarity, and clinical appropriateness, with each quiz covering key domains: (1) sex hormones and cancer, (2) type and timing of symptoms, (3) assessment of symptoms, (4) treatment of hot flashes, (5) treatment of gynecomastia (men only), and (6) treatment of urogenital problems (women only). Each quiz was administered to all five AI models in five independent rounds (50 total attempts across both quizzes). Results: Overall, the models demonstrated high accuracy on both quizzes, with the majority achieving 100% across all rounds. On the men’s quiz, GPT-o1, Grok2, and DeepSeek consistently scored 100% across all five rounds. However, Claude3.5 scored 9/10 in Round 1 of 5, and Gemini1.5 scored 9/10 in Round 5 of 5. Both models incorrectly answered the same question in that particular Round: "Which treatments are available for enlarged male breast tissue (gynecomastia)?" The correct answer was, "d) All of the above," but instead Claude3.5 answered “both b) surgery and c) tamoxifen”), and Gemini1.5 answered “b) surgery in most cases”, omitting the other correct options. On the women’s quiz, all five models achieved perfect scores (100%) across all rounds. The aggregate accuracy was 99.6% (498/500 questions). Conclusions: Generative AI models exhibited high reliability and accuracy when tested on patient education quizzes derived from national clinical guidelines, particularly on questions addressing female-specific hormone-related symptoms. Errors on the men’s quiz highlight the need for refinement in multi-option clinical reasoning. These findings support the potential for generative AI models to serve as reliable and consistent tools for reinforcing patient education, with additional real-world patient testing and iteration necessary to optimize their clinical utility.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen