OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 16:40

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT

2025·4 Zitationen·The American Surgeon
Volltext beim Verlag öffnen

4

Zitationen

3

Autoren

2025

Jahr

Abstract

ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs (<i>P</i> < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower (<i>P</i> = 0.005 and <i>P</i> = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower (<i>P</i> < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference (<i>P</i> = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationCardiac, Anesthesia and Surgical OutcomesRadiomics and Machine Learning in Medical Imaging
Volltext beim Verlag öffnen