OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.05.2026, 12:53

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The detectability paradox: bilingual medical report generation with open-weight models and the limits of human oversight

2026·0 Zitationen·Journal of the American Medical Informatics AssociationOpen Access
Volltext beim Verlag öffnen

0

Zitationen

14

Autoren

2026

Jahr

Abstract

OBJECTIVES: The automation of medical report generation using large language models (LLMs) could significantly reduce physicians' documentation burden while enhancing healthcare efficiency. However, the misuse of generative artificial intelligence in medical reporting can lead to important safety risks for patients. We addressed 2 questions: (1) What is the quality of medical reports generated by LLMs in English and French? and (2) Can we distinguish between human-written and LLM-generated medical reports? MATERIALS AND METHODS: We evaluated the quality of reports generated by several multilingual, open-weight LLMs using text similarity metrics on 4212 medical reports in English and French across multiple specialties. A bilingual expert panel of certified physicians (n = 4) and medical residents (n = 5) scored accuracy, fluency, and completeness of generated reports using a 1-5 Likert scale. Experts also completed a Turing-like test, blindly identifying reports as human or machine-generated. RESULTS: Phi-4 achieved the best overall performance (ROUGE-1: 0.70, BERTScore: 0.83). Expert evaluation confirmed high-quality reports in both languages (overall 4.6/5.0). Medical experts performed better than chance but struggled to differentiate human versus machine reports (accuracy: 0.60). Automatic classifiers showed strong performance (accuracy: 0.98). DISCUSSION: The high quality of LLM-generated reports supports their potential to enhance healthcare efficiency in multilingual settings. However, the discrepancy between human detection difficulty and automated detection success reveals inherent limitations in relying solely on human oversight for quality assurance and misuse prevention. CONCLUSIONS: Deployment of LLMs for medical reporting requires combining automated detection tools with human expertise to ensure patient safety. Dataset and code: https://github.com/ds4dh/medical_report_generation.

Ähnliche Arbeiten