Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessment of Large Language Model Performance on Virtual Patient Scenarios: Mixed Methods Study (Preprint)

2026·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> Generative AI is increasingly being explored in medical education as a tool to enhance clinical reasoning and support interactive learning. However, only a few studies have so far evaluated how such models perform in suitable educational settings. This study examines the effectiveness of ChatGPT in simulating clinical decision-making through virtual patient (VP) interactions. </sec> <sec> <title>OBJECTIVE</title> This study aims to evaluate and compare the accuracy of ChatGPT-3.5 and ChatGPT-4 in solving VP scenarios across medical specialties, and to explore their strengths, limitations, and educational implications. </sec> <sec> <title>METHODS</title> A total of 64 VP scenarios covering paediatric, adult, disease management, and oncology cases were tested using ChatGPT-3.5 and ChatGPT-4 within the MobiViP mobile platform. Responses were classified as correct, incorrect, or inadequate. Success rates were calculated using descriptive and inferential statistics. Inadequate responses were also analyzed thematically. </sec> <sec> <title>RESULTS</title> ChatGPT-4 significantly outperformed ChatGPT-3.5 across all categories (median success rate: 92.55% vs 78.68%, p<.001). GPT-4 showed higher reliability particularly in complex scenarios such as oncology and disease management. GPT-3.5 generated a greater number and variety of inadequate responses, including navigation errors and irrelevant outputs. </sec> <sec> <title>CONCLUSIONS</title> This study highlights the potential of generative AI to complement traditional medical education, particularly in fostering clinical reasoning through case-based learning. While both models show promise in supporting clinical reasoning education, GPT-4 provides significantly more accurate and contextually appropriate outputs. Nonetheless, both versions can produce erroneous or confusing responses, underscoring the importance of guided implementation. Furthermore, while generative AI holds potential for scenario creation, caution is recommended due to risks of bias, inaccuracy, or pedagogical misalignment. Educators should place emphasis on users becoming AI-literate and integrate these tools thoughtfully to support, rather than replace clinical training. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsMachine Learning in Healthcare

Volltext beim Verlag öffnen

Assessment of Large Language Model Performance on Virtual Patient Scenarios: Mixed Methods Study (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen