Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
AI Decision-Making Performance in Maternal–Fetal Medicine: Comparison of ChatGPT-4, Gemini, and Human Specialists in a Cross-Sectional Case-Based Study
0
Zitationen
10
Autoren
2025
Jahr
Abstract
<b>Background/Objectives:</b> Large Language Models (LLMs), including ChatGPT-4 and Gemini, are increasingly incorporated into clinical care; however, their reliability within maternal-fetal medicine (MFM), a high-risk field in which diagnostic and management errors may affect both the pregnant patient and the fetus, remains uncertain. Evaluating the alignment of AI-generated case management recommendations with those of MFM specialists, emphasizing accuracy, agreement, and clinical relevancy. Study Design and Setting: Cross-sectional study with blinded online evaluation (November-December 2024); evaluators were blinded to responder identity (AI vs. human), and case order and response labels were randomized for each evaluator using a computer-generated sequence to reduce order and identification bias. <b>Methods:</b> Twenty hypothetical MFM cases were constructed, allowing standardized presentation of complex scenarios without patient-identifiable data and enabling consistent comparison of AI-generated and human specialist recommendations. Responses were generated by ChatGPT-4, Gemini, and three MFM specialists, then assessed by 22 blinded board-certified MFM evaluators using a 10-point Likert scale. Agreement was measured with Spearman's rho (ρ) and Cohen's (κ); accuracy differences were measured with Wilcoxon signed-rank tests. <b>Results:</b> ChatGPT-4 exhibited moderate alignment (mean 6.6 ± 2.95; ρ = 0.408; κ = 0.232, <i>p</i> < 0.001), performing well in routine, guideline-driven scenarios (e.g., term oligohydramnios, well-controlled gestational hypertension, GDMA1). Gemini scored 7.0 ± 2.64, demonstrating effectively no consistent inter-rater agreement (κ = -0.024, <i>p</i> = 0.352), indicating that although mean scores were slightly higher, evaluators varied widely in how they judged individual Gemini responses. No significant difference was found between ChatGPT-4 and clinicians in median accuracy scores (Wilcoxon <i>p</i> = 0.18), while Gemini showed significantly lower accuracy (<i>p</i> < 0.01). Model performance varied primarily by case complexity: agreement was higher in straightforward, guideline-based scenarios and more variable in complex cases, whereas no consistent pattern was observed by gestational age or specific clinical domain across the 20 cases. <b>Conclusions:</b> AI shows promise in routine MFM decision-making but remains constrained in complex cases, where models sometimes under-prioritize maternal-fetal risk trade-offs or incompletely address alternative management pathways, warranting cautious integration into clinical practice. Generalizability is limited by the small number of simulated cases and the use of hypothetical vignettes rather than real-world clinical encounters.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.102 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.468 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.