Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Cross-Sectional Descriptive Study of Comparative Accuracy of ChatGPT, Google Gemini, And Microsoft Copilot in Solving NEET PG Medical Entrance Test
0
Zitationen
4
Autoren
2025
Jahr
Abstract
Background: Artificial Intelligence (AI) is increasingly applied in healthcare and medical education, with tools capable of assisting in diagnosis, treatment planning, and exam preparation. The NEET-PG is India’s national entrance examination for postgraduate medical training, with case vignettes forming a major component of assessment. AI chatbots therefore hold potential as aids in exam preparation. Previous studies have reported variable accuracy of AI tools in medical licensing exams, but head-to-head comparisons across question types, subjects, and platforms are scarce. Given their rapidly growing use by students and educators, establishing the reliability of these tools is critical. This study directly compares three leading AI chatbots. The objective was to assess and compare the accuracy of ChatGPT-4, Google Gemini, and Microsoft Copilot in solving the NEET-PG 2023 examination and evaluate their performance across different question types and medical subjects. Methods: This cross-sectional descriptive study evaluated the performance of three AI chatbots using a validated set of 200 NEET-PG 2023 questions sourced from PrepLadder and verified against standard textbooks. These questions were presented verbatim to ChatGPT-4, Google Gemini, and Microsoft Copilot. Each chatbot received the questions independently in separate sessions to minimize memory bias. Responses were recorded correct or incorrect using the validated answer key, and accuracy was expressed the percentage of correct responses. Comparative analysis was performed for overall accuracy, subject distribution, and question type (recall, analytical, image-based, case-based). Differences were assessed using the chi-square test with p < 0.05 considered statistically significant. Results: Microsoft Copilot achieved the highest overall accuracy with 165/200 correct responses (82.5%), followed by ChatGPT-4 with 161/200 (80.5%) and Google Gemini with 155/200 (77.5%). The difference in overall performance was not statistically significant (χ² = 1.6, p = 0.4). All three chatbots achieved 100% accuracy in Microbiology, Anesthesia, and Psychiatry, whereas lower accuracy occurred in Community Medicine, Forensic Medicine, Internal Medicine, and Radiology. No significant variation was found across subjects (χ² = 2.7, p = 0.9). By question type, recall-based items showed the highest accuracy (85.5%), followed by case-based (82.4%) and analytical (77.3%), while image-based questions were the most challenging (mean accuracy 71.0%). Although Copilot performed slightly better on recall and image-based items, the differences across the three chatbots for question type were not statistically significant (χ² = 0.35, p = 0.9). These findings highlight variability by subject and question format but no significant difference among the three tools. Conclusion: All three AI chatbots demonstrated good accuracy in solving NEET-PG questions, performing better in recall-based subjects and less well with image-based items, reflecting current limitations in multimodal applications. They can complement exam preparation by serving as an accessible and interactive platform, offering an affordable alternative to expensive coaching. In healthcare, AI chatbots hold potential for assisting with diagnosis, treatment planning, triage, and referral, particularly in resource-limited settings. However, concerns regarding data privacy, patient confidentiality, lack of empathy, erosion of clinical decision-making limit their broader adoption. Future research should evaluate evolving versions of these models, larger exam datasets, and integration into structured educational frameworks.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.231 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.084 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.444 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.423 Zit.