OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.03.2026, 10:34

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Feasibility of a Specialized Large Language Model for Postgraduate Medical Examination Preparation: Single-Center Proof-Of-Concept Study (Preprint)

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Large language models (LLMs) are increasingly used in medical education for feedback and grading; yet their role in postgraduate examination preparation remains uncertain due to inconsistent grading, hallucinations, and user acceptance. </sec> <sec> <title>OBJECTIVE</title> This study evaluates the Personalized Anesthesia Study Support (PASS), a specialized GPT-4 model developed to assist candidates preparing for Singapore’s postgraduate specialist anesthesiology examination. We assessed user acceptance, grading interrater reliability, and hallucination detection rates to determine the feasibility of integrating specialized LLMs into high-stakes examination preparation. </sec> <sec> <title>METHODS</title> PASS was built on OpenAI’s GPT-4 and adapted with domain-specific prompts and references. Twenty-one senior anesthesiology residents completed a mock short answer question examination, which was independently graded by 3 human examiners and 3 PASS iterations. Participants reviewed feedback from both PASS and standard GPT-4 and completed a technology acceptance model (TAM) survey. Grading reliability was evaluated using Cohen and Fleiss κ. Hallucination rates were assessed by participants and examiners. </sec> <sec> <title>RESULTS</title> Of the 21 participants, 17 (81%) completed the TAM survey, generating 136 responses. PASS scored significantly higher than standard GPT-4 in usefulness (mean 4.25, SD 0.50 vs mean 3.44, SD 0.82; &lt;i&gt;P&lt;/i&gt;&amp;lt;.001), efficiency (mean 4.12, SD 0.61 vs mean 3.41, SD 0.74; &lt;i&gt;P&lt;/i&gt;&amp;lt;.001), and likelihood of future use (mean 4.13, SD 0.75 vs mean 3.59, SD 0.90; &lt;i&gt;P&lt;/i&gt;&amp;lt;.001), with no significant difference in ease of use (mean 4.56, SD 0.63 vs mean 4.50, SD 0.61; &lt;i&gt;P&lt;/i&gt;=.35). Internal grading reliability was moderate for PASS (κ=0.522) and fair for human examiners (κ=0.275). Across 316 PASS-generated responses, 67 hallucinations and 189 deviations were labeled. Hallucination labeling rates were comparable between candidates (10/67, 15%) and examiners (57/249, 22.9%; &lt;i&gt;P&lt;/i&gt;=.21), while examiners labeled significantly more deviations (168/249, 67.5% vs 21/67, 31%; &lt;i&gt;P&lt;/i&gt;&amp;lt;.001). </sec> <sec> <title>CONCLUSIONS</title> PASS demonstrated strong user acceptance and grading reliability, suggesting feasibility in high-stakes examination preparation. Experienced learners could identify major hallucinations at comparable rates to examiners, suggesting potential in self-directed learning but with continued need for caution. Further research should refine grading accuracy and explore multicenter evaluation of specialized LLMs for postgraduate medical education. </sec>

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationClinical Reasoning and Diagnostic SkillsSimulation-Based Education in Healthcare
Volltext beim Verlag öffnen