Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

AI Decision-Making Performance in Maternal–Fetal Medicine: Comparison of ChatGPT-4, Gemini, and Human Specialists in a Cross-Sectional Case-Based Study

2025·0 Zitationen·Journal of Clinical MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Background/Objectives: Large Language Models (LLMs), including ChatGPT-4 and Gemini, are increasingly incorporated into clinical care; however, their reliability within maternal-fetal medicine (MFM), a high-risk field in which diagnostic and management errors may affect both the pregnant patient and the fetus, remains uncertain. Evaluating the alignment of AI-generated case management recommendations with those of MFM specialists, emphasizing accuracy, agreement, and clinical relevancy. Study Design and Setting: Cross-sectional study with blinded online evaluation (November-December 2024); evaluators were blinded to responder identity (AI vs. human), and case order and response labels were randomized for each evaluator using a computer-generated sequence to reduce order and identification bias. Methods: Twenty hypothetical MFM cases were constructed, allowing standardized presentation of complex scenarios without patient-identifiable data and enabling consistent comparison of AI-generated and human specialist recommendations. Responses were generated by ChatGPT-4, Gemini, and three MFM specialists, then assessed by 22 blinded board-certified MFM evaluators using a 10-point Likert scale. Agreement was measured with Spearman's rho (ρ) and Cohen's (κ); accuracy differences were measured with Wilcoxon signed-rank tests. Results: ChatGPT-4 exhibited moderate alignment (mean 6.6 ± 2.95; ρ = 0.408; κ = 0.232, p < 0.001), performing well in routine, guideline-driven scenarios (e.g., term oligohydramnios, well-controlled gestational hypertension, GDMA1). Gemini scored 7.0 ± 2.64, demonstrating effectively no consistent inter-rater agreement (κ = -0.024, p = 0.352), indicating that although mean scores were slightly higher, evaluators varied widely in how they judged individual Gemini responses. No significant difference was found between ChatGPT-4 and clinicians in median accuracy scores (Wilcoxon p = 0.18), while Gemini showed significantly lower accuracy (p < 0.01). Model performance varied primarily by case complexity: agreement was higher in straightforward, guideline-based scenarios and more variable in complex cases, whereas no consistent pattern was observed by gestational age or specific clinical domain across the 20 cases. Conclusions: AI shows promise in routine MFM decision-making but remains constrained in complex cases, where models sometimes under-prioritize maternal-fetal risk trade-offs or incompletely address alternative management pathways, warranting cautious integration into clinical practice. Generalizability is limited by the small number of simulated cases and the use of hypothetical vignettes rather than real-world clinical encounters.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcarePregnancy and Medication Impact

Volltext beim Verlag öffnen

AI Decision-Making Performance in Maternal–Fetal Medicine: Comparison of ChatGPT-4, Gemini, and Human Specialists in a Cross-Sectional Case-Based Study

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen