Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Towards Safe and Trustworthy Healthcare AI: Risk Assessment of Medical Dialogue Using LLMs
0
Zitationen
1
Autoren
2025
Jahr
Abstract
Large Language Models (LLMs) are increasingly used in healthcare settings, yet concerns remain regarding their ability to safely and reliably handle medical dialogues. To address this issue, this study introduces a quantitative framework for evaluating the safety and trustworthiness of LLMs in multilingual medical dialogues. Using the German subtask of the NTCIR-18 MedNLP-CHAT dataset, we examined how 13 LLMs—encompassing general-purpose, open-source, and biomedical variants—identify medical, ethical, and legal risks. ROC–AUC–based statistical validation (one-sample t-tests and sign tests) were applied to ensure robust and reproducible evaluation. Results show that gpt-5, gpt-4o, gpt-3.5-Turbo, gpt-oss:120b, gpt-oss:20b, and gemma-3:27b consistently achieved reliable performance, while smaller and domain-specific models often failed to generalize across languages and risk types. These findings suggest that model scale and multi-domain safety alignment are key to achieving trustworthy risk reasoning in clinical dialogues and provide guidance for conservative deployment of LLMs in healthcare.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.