OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 24.03.2026, 11:09

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Can artificial intelligence pass the Japanese urology board examinations?

2024·1 Zitationen·International Journal of UrologyOpen Access
Volltext beim Verlag öffnen

1

Zitationen

6

Autoren

2024

Jahr

Abstract

The application of artificial intelligence (AI) is transforming diagnosis, treatment, and education in medical fields. Large language models (LLMs), such as ChatGPT by OpenAI (San Francisco, USA) and Claude3 by Anthropic (San Francisco, USA), have shown promising results in passing medical licensing examinations in various countries.1 However, limited research has been conducted on their performance in specialty board examinations in medical fields, which require a higher level of domain-specific knowledge.2 Mesnard et al. assessed the performance of three LLMs in the European Board of Urology (EBU) in-service assessment with a standardized prompt: “I will pose questions to you concerning urological surgery; you will need to select the correct answer each time.” Although both patients and medical professionals have recently gained access to specific medical information through LLMs, the accuracy of each model remains unclear. This study investigated the accuracy of several LLMs on the 2022 Japanese Specialty Board Examination in Urology to determine whether AI can achieve a passing level and to what extent we can trust the information from LLMs regarding urological knowledge. We selected 79 out of 98 questions from the 2022 Japanese Urology Specialty Board Examination, excluding image-based questions. Four different prompt conditions were tested: zero-shot, prompt 1,3, 4 prompt 2,5 and prompt 3.3-7 The LLMs evaluated included ChatGPT (GPT-3.5 and GPT-4.0) and Claude3 (Sonnet). The performance of each model was assessed based on the percentage of correct answers under each prompt condition. The zero-shot accuracies for GPT-3.5, GPT-4.0, and Claude3 were 31.6%, 62.1%, and 51.8%, respectively. Statistically significant differences were observed among the groups (p < 0.001). After applying prompts 1–3, the accuracies of GPT-3.5, GPT-4.0, and Claude3 ranged from 29.1% to 32.9%, 58.2% to 64.5%, and 53.2% to 55.7%, respectively. Overall, GPT-4.0 had the highest score among the three LLMs, attaining passing scores (more than 60%) in three out of four prompts (Table 1). Regarding the selection of forbidden options, GPT-3.5 with Prompt 2 selected the highest number of forbidden options (3 out of 79 answers, 3.8%). In contrast, GPT-4.0 and Claude3 had low selections of forbidden options (0.32% and 0.63%, respectively) regardless of the prompts. Although infrequent, several hallucinations—where LLMs generated impossible answers—were observed in all models. Fundamental errors, such as ignoring instructions for multiple selections or inconsistencies between the content of explanations and answers, were observed in eight answers (2.5%) of GPT-3.5, 7 answers (4.1%) of GPT-4.0, and eight answers (2.9%) of Claude3. The chi-square tests showed no significant differences in accuracy rates among the three prompts in each model (GPT-3.5: p = 0.939, GPT-4.0: p = 0.875, Claude3: p = 0.968). Based on the present results, we could not find specific compatibility for prompt variations and each LLM. Our findings suggest that GPT-4.0 demonstrates the highest accuracy among the tested models, although further improvements and optimizations are necessary due to moderate accuracies using any LLM. This study did not find the superiority of specific prompts. The difference in the updating dates between each LLM may be associated with performance variations. However, out of the 79 questions, only one pertained to content approved for insurance coverage after September 2021 (Nivolumab as adjuvant therapy for urothelial carcinoma, approved in March 2022 in Japan). GPT-3.5 answered this question correctly regardless of the prompt. It is also imperative to assess which factors most influence the performance of each LLM on professional knowledge in medical fields. The study has several limitations. First, image-based questions were excluded due to their unavailability in GPT-3.5 and the reputation of poor performance in GPT-4.0. Second, prompt engineering techniques were selected at the authors' discretion which may introduce bias. However, no significant difference was observed between prompts within the same model. Third, the impact of all updated version of each LLMs on performance cannot be investigated. Fourth, doctors-LLMs comparison was not conducted due to the challenge of recruiting doctors. Additionally, it is imperative to incorporate guidelines information for obtaining answers with LLMs in future studies. For example, GPT-4o can perform such tasks but was not tested in this study. Further studies, including image-based questions, novel LLMs, and engineered prompts, are warranted to enhance the accuracy and applicability of these tools in medical fields. The authors would like to thank Enago (www.enago.jp) for the English language review. Shuhei Okada: Conceptualization; data curation; formal analysis; investigation; methodology; writing – original draft. Shintaro Narita: Conceptualization; project administration; supervision; writing – original draft. Ryohei Yamamoto: Supervision; validation. Kazuyuki Numakura: Supervision; validation. Mitsuru Saito: Validation; supervision. Tomonori Habuchi: Supervision; validation. Shintaro Narita received honoraria and/or consultant fee from Janssen. Tomonori Habuchi received honoraria from Janssen Pharmaceutical K.K., Takeda Pharmaceutical Company Ltd., Astellas Pharma Inc., AstraZeneca K.K., Sanofi S.A., and Bayer A.G. Tomonori Habuchi also received research funding support from Mochida Pharmaceutical Co. and Sysmex Co. This study was supported in part by JSPS KAKENHI grant. Not applicable. Not applicable. Not applicable. Not applicable.

Ähnliche Arbeiten