OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 31.03.2026, 14:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessment of Artificial Intelligence Chatbot Performance on the Canadian Otolaryngology and Head and Neck Surgery In-Training Exam: Insights from a Comparative Analysis (Preprint)

2024·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2024

Jahr

Abstract

<sec> <title>BACKGROUND</title> The introduction of large language models (LLM) has rapidly transformed the field of healthcare. Its performance, often compared to that of physicians, has been greatly scrutinized. ChatGPT-4, a finely tuned supervised model, offers improved reasoning capabilities and visual input analysis. </sec> <sec> <title>OBJECTIVE</title> The purpose of this study is to evaluate the performance of ChatGPT-4 in the field of otolaryngology and head and neck surgery (OTOHNS) residency training. </sec> <sec> <title>METHODS</title> A total of 351 questions from the OTOHNS National In-Training Exam (NITE) for 2022 and 2023 were submitted to ChatGPT-4 from April 22nd, 2024, to May 12th, 2024, using a new account. New sessions were used for every question, except for follow-up questions. Answers were independently graded by two reviewers using the official grading rubric and the average score was used. Cohen’s kappa coefficient was used for inter-rater reliability. Anonymized mean exam results from residents who have previously taken this exam were obtained from the lead faculty of the NITE. The sample size was calculated based on the total number of enrolled residents, as indicated on each university’s program website. Z-tests were used to compare ChatGPT-4’s performance to that of residents per sub-specialty and training level. The questions were categorized by type (image or text), task (diagnosis, additional exams, treatment or guidelines), sub-specialty, taxonomic level and prompt length. A one-way ANOVA, independent t-test and two-tailed Pearson correlation was used to examine variations between question categories. IBM SPSS 29 was used. </sec> <sec> <title>RESULTS</title> ChatGPT-4 scored 66.19% and 64.84% on the 2022 and 2023 exams, respectively. Inter-rater reliability between the two raters was 89.8% (standard error 0.018, P &lt; .001). ChatGPT-4 outperformed the residents on both exams, amongst all training levels and within all sub-specialties except for the general/pediatrics section of the 2023 exam (Z-test -2.37). There were decreasing performance gaps with increasing residency training as per the following Z-scores: PGY-2 16.08, PGY-3 9.31, PGY-4 3.49 in 2022 and PGY-2 15.57, PGY-3 8.60, PGY-4 3.21 in 2023. For the 2022 exam, ChatGPT-4 would rank in the 99th percentile amongst PGY-2, 95th percentile amongst PGY-3 and 73rd percentile amongst PGY-4 classmates. For the 2023 exam, it would rank in the 99th percentile amongst PGY-2, 94th percentile amongst PGY-3 and 71st percentile amongst PGY-4 classmates. ChatGPT-4 performed best on text-based questions (74.3%, P&lt;.001), level one taxonomic questions (75.1%, P&lt;.001) and guideline-based questions (70%, P=.048). It had no significant difference in performance based on sub-specialty (P=.364) or prompt length (P=.385). </sec> <sec> <title>CONCLUSIONS</title> ChatGPT-4 not only achieved passing grades on two versions of the Canadian OTOHNS NITE, but it also outperformed residents in an outstanding manner, underscoring a critical need to redesign residency assessment methods. </sec> <sec> <title>CLINICALTRIAL</title> N/A </sec>

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationTracheal and airway disorders
Volltext beim Verlag öffnen