Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Analysis of LLMs’ Performance On a Practice Radiography Certification Exam
2
Zitationen
1
Autoren
2025
Jahr
Abstract
PURPOSE: To compare the performance of multiple large language models (LLMs) on a practice radiography certification exam. METHOD: Using an exploratory, nonexperimental approach, 200 multiple-choice question stems and options (correct answers and distractors) from a practice radiography certification exam were entered into 5 LLMs: ChatGPT (OpenAI), Claude (Anthropic), Copilot (Microsoft), Gemini (Google), and Perplexity (Perplexity AI). Responses were recorded as correct or incorrect, and overall accuracy rates were calculated for each LLM. McNemar tests determined if there were significant differences between accuracy rates. Performance also was evaluated and aggregated by content categories and subcategories. RESULTS: ChatGPT had the highest overall accuracy of 83.5%, followed by Perplexity (78.9%), Copilot (78.0%), Gemini (75.0%), and Claude (71.0%). ChatGPT had a significantly higher accuracy rate than did Claude (P , .001) and Gemini (P 5 .02). Regarding content categories, ChatGPT was the only LLM to correctly answer all 38 patient care questions. In addition, ChatGPT had the highest number of correct responses in the areas of safety (38/48, 79.2%) and procedures (50/59, 84.7%). Copilot had the highest number of correct responses in the area of image production (43/55, 78.2%). ChatGPT also achieved superior accuracy in 4 of the 8 subcategories. DISCUSSION: Findings from this study provide valuable insights into the performance of multiple LLMs in answering practice radiography certification exam questions. Although ChatGPT emerged as the most accurate LLM for this practice exam, caution should be exercised when using generative artificial intelligence (AI) models. Because LLMs can generate false and incorrect information, responses must be checked for accuracy, and the models should be corrected when inaccurate responses are given. CONCLUSION: Among the 5 LLMs compared in this study, ChatGPT was the most accurate model. As interest in generative AI continues to increase and new language applications become readily available, users should understand the limitations of LLMs and check responses for accuracy. Future research could include additional practice exams in other primary pathways, including magnetic resonance imaging, nuclear medicine technology, radiation therapy, and sonography.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.758 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.666 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.220 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.896 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.