Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
P95 Evaluation of ChatGPT and Gemini in generating differential diagnoses in diffuse interstitial lung diseases (DILDs) from transcribed radiological reports
0
Zitationen
2
Autoren
2025
Jahr
Abstract
<h3>Background</h3> Diagnosis of Diffuse interstitial lung diseases (DILDs) is a complex task requiring observer experience. This study analyses the accuracy of Large Language Models (LLMs) in making DILD diagnoses from transcribed radiological reports and compares their accuracy to that of thoracic radiologists. <h3>Methods</h3> LLMs analysed included ChatGPT (GPT-4, o1 and o3-mini) and Gemini (Gemini 2.0 Flash and Gemini 2.0 Flash Thinking). Radiologists of differing seniority (specialist registrars, senior research fellows, consultants [n=11]) from a tertiary referral centre were invited to participate. No patient data were used. Instead, 40 ‘Textbook’ descriptors based on HRCT patterns/signs of common DILDs were created by a consultant radiologist. To mimic varying radiologists’ reporting styles, four descriptors were generated for each DILD: narrative, telegraphic, with/without relevant negative findings. LLMs and radiologists were presented with the same HRCT descriptors in random order. Different prompting methods were used: ‘unrestricted’, ‘restricted’ and ‘engineered’ prompts. For the ‘unrestricted’ prompt, the groups were asked to provide their differential diagnoses in decreasing order of likelihood. For the ‘restricted’ prompt (LLMs only), LLMs were restricted to three differential diagnoses. Few-shot prompting techniques were also used to make an ‘engineered’ prompt, with the aim of improving diagnostic accuracy of LLMs. <h3>Results</h3> Overall accuracy of the LLMs (unrestricted and restricted prompts): GPT-4o:90%, 93.3%; 01:84.2%, 74.2%; o3-mini: 96.7%, 100%; Gemini 2.0 Flash: 73.3%, 73.3%; Gemini 2.0 Flash Thinking:90%, 90%. Prompt Engineering did not improve diagnostic accuracy of LLMs. Overall accuracy of radiologists (unrestricted prompt): specialist registrars: 81.5%; senior research fellows: 96.5%; consultants: 92.1%. <h3>Conclusion</h3> LLMs executed excellent diagnostic capabilities comparable to senior research fellows and consultants. LLMs could be used as adjunctive tools to improve the diagnostic accuracy of less-experienced radiologists. The results of prompt engineering techniques are potentially attributable to imperfect prompting methods. Future work should be targeted at establishing the diagnostic accuracy of LLMs designed for medical use, across a wider variety of DILDs, whilst ensuring safe diagnosis of patients. <h3>Reference</h3> Sun SH, Huynh K, Cortes G, Hill R, Tran J, Yeh L, <i>et al</i>. Testing the ability and limitations of ChatGPT to generate differential diagnoses from transcribed radiologic findings. <i>Radiology</i> 2024 Oct;<b>313</b>(1):e232346.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.324 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.189 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.588 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.470 Zit.