Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study (Preprint)

2025·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> Tuberculosis (TB) remains a major global health challenge, particularly in low- and middle-income countries, where effective triage, diagnosis, and management are often limited. Existing decision-support tools focus on imaging and cannot integrate multi-modal clinical information, constraining their utility in complex clinical scenarios. Large Language Models (LLMs) have shown promise in assisting diagnosis and clinical decision-making in other medical fields, but evidence for their application in TB care is scarce. Evaluating LLMs for TB decision support is crucial to explore their potential to improve clinical accuracy, efficiency, and quality of care in high-burden, resource-limited settings. </sec> <sec> <title>OBJECTIVE</title> To evaluate whether large language models (LLMs) can assist tuberculosis (TB) physicians in clinical decision-making across triage, differential diagnosis, and management recommendation tasks, addressing potential delays and inequities in TB care. </sec> <sec> <title>METHODS</title> In this experimental comparative study conducted in 2025 under STARD guidelines, 17 standardized TB cases (7 simulated, 10 real) were assessed. Responses were generated by two advanced LLMs (ChatGPT-4o and DeepSeek-R1) and two TB physicians. Reference standards were established by three TB specialists. Objective performance was measured using precision, recall, and F1 scores. Subjective evaluation assessed suitability, information quality, and, for management tasks, safety, conciseness, understandability, and operability using 5-point Likert scales. Readability was measured by a Chinese R-value; group differences were analyzed using Mann-Whitney U tests. </sec> <sec> <title>RESULTS</title> LLMs achieved precision similar to physicians across all tasks (median 0.67 vs 0.50; U = 8695.5; P = .35) but higher recall (0.53 vs 0.33; U = 6848.5; P < .001) and F1 scores (0.58 vs 0.33; U = 7085.5; P < .001) in management recommendation tasks. In management tasks, LLMs outperformed physicians in recall (0.50 vs 0.20; U = 185.0; P < .001) and F1 (0.50 vs 0.30; U = 104.0; P < .001), with no difference in precision. Subjectively, LLMs scored higher in suitability (3.67 vs 3.00; U = 1122.0; P < .001), information quality (3.33 vs 2.67; U = 155.0; P < .001), understandability (3.67 vs 3.00; U = 4281.5; P = .022), and operability (3.67 vs 3.00; U = 4305.0; P = .025). No differences were observed in conciseness (P = .54) or safety (P = .06). Physicians’ responses were more readable (1.88 vs 2.17; U = 11427.5; P < .001). </sec> <sec> <title>CONCLUSIONS</title> LLMs can serve as adjuncts to support TB clinical decision-making, enhancing management recommendations without replacing physicians. Their use may improve decision efficiency and help reduce disparities in TB care. </sec> <sec> <title>CLINICALTRIAL</title> This experimental comparative study evaluating large language models versus tuberculosis physicians did not involve patient interventions or randomization, and therefore was not registered as a clinical trial. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationCOVID-19 diagnosis using AIMachine Learning in Healthcare

Volltext beim Verlag öffnen

Performance Comparison of Human Doctors and Large Language Models in Tuberculosis Triage, Diagnosis, and Management:An Experimental Study (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen