Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Human medical documentation significantly outperforms ChatGPT‐4o in critical clinical dimensions: A blinded comparative assessment in paediatric orthopaedics
0
Zitationen
6
Autoren
2026
Jahr
Abstract
PURPOSE: This study evaluated the quality of ChatGPT-generated medical history summaries compared to human-created documentation in a paediatric orthopaedic practice setting. METHODS: A prospective, randomised, blinded comparative study was conducted involving 20 consecutive paediatric patients (mean age 14.2 ± 2.3 years; 11 males, 9 females) presenting with knee problems. Audio recordings of medical consultations were transcribed and processed by ChatGPT-4o (OpenAI) using standardised prompts. Three independent orthopaedic specialists evaluated both human-generated and AI-generated summaries using eight quality criteria: temporal consistency, spatial consistency, accident description, symptom accuracy, symptom specificity, previous interventions, writing style and overall impression. Each criterion was scored on a 6-point Likert scale. RESULTS: Human-created summaries received significantly higher overall ratings (5.2 ± 0.8) compared to ChatGPT-generated summaries (4.5 ± 0.8, p < 0.001, Cohen's d = 0.80). After Bonferroni correction for multiple comparisons, statistically significant differences favouring human documentation were confirmed in four of eight criteria: temporal consistency (p < 0.001), spatial consistency (p < 0.001), accident description (p < 0.001) and overall impression (p < 0.001). No significant differences were observed for writing style and documentation of previous interventions. Inter-rater reliability was moderate (ICC = 0.64). ChatGPT demonstrated frequent temporal inconsistencies (14 of 60 evaluations, 23%) and omission of relevant accident details (21 of 60 evaluations, 35%). CONCLUSION: While AI-generated summaries showed acceptable stylistic quality, human documentation significantly outperformed ChatGPT in critical clinical dimensions, including temporal consistency and accuracy of complex orthopaedic presentations. Current large language models are not ready to replace human medical documentation in paediatric orthopaedic practice without careful oversight. The findings support the implementation of hybrid workflows where AI assists but does not replace human clinical judgement. LEVEL OF EVIDENCE: Level I.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.593 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.483 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.003 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.824 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.