OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 28.03.2026, 04:06

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

P95 Evaluation of ChatGPT and Gemini in generating differential diagnoses in diffuse interstitial lung diseases (DILDs) from transcribed radiological reports

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

2

Autoren

2025

Jahr

Abstract

<h3>Background</h3> Diagnosis of Diffuse interstitial lung diseases (DILDs) is a complex task requiring observer experience. This study analyses the accuracy of Large Language Models (LLMs) in making DILD diagnoses from transcribed radiological reports and compares their accuracy to that of thoracic radiologists. <h3>Methods</h3> LLMs analysed included ChatGPT (GPT-4, o1 and o3-mini) and Gemini (Gemini 2.0 Flash and Gemini 2.0 Flash Thinking). Radiologists of differing seniority (specialist registrars, senior research fellows, consultants [n=11]) from a tertiary referral centre were invited to participate. No patient data were used. Instead, 40 ‘Textbook’ descriptors based on HRCT patterns/signs of common DILDs were created by a consultant radiologist. To mimic varying radiologists’ reporting styles, four descriptors were generated for each DILD: narrative, telegraphic, with/without relevant negative findings. LLMs and radiologists were presented with the same HRCT descriptors in random order. Different prompting methods were used: ‘unrestricted’, ‘restricted’ and ‘engineered’ prompts. For the ‘unrestricted’ prompt, the groups were asked to provide their differential diagnoses in decreasing order of likelihood. For the ‘restricted’ prompt (LLMs only), LLMs were restricted to three differential diagnoses. Few-shot prompting techniques were also used to make an ‘engineered’ prompt, with the aim of improving diagnostic accuracy of LLMs. <h3>Results</h3> Overall accuracy of the LLMs (unrestricted and restricted prompts): GPT-4o:90%, 93.3%; 01:84.2%, 74.2%; o3-mini: 96.7%, 100%; Gemini 2.0 Flash: 73.3%, 73.3%; Gemini 2.0 Flash Thinking:90%, 90%. Prompt Engineering did not improve diagnostic accuracy of LLMs. Overall accuracy of radiologists (unrestricted prompt): specialist registrars: 81.5%; senior research fellows: 96.5%; consultants: 92.1%. <h3>Conclusion</h3> LLMs executed excellent diagnostic capabilities comparable to senior research fellows and consultants. LLMs could be used as adjunctive tools to improve the diagnostic accuracy of less-experienced radiologists. The results of prompt engineering techniques are potentially attributable to imperfect prompting methods. Future work should be targeted at establishing the diagnostic accuracy of LLMs designed for medical use, across a wider variety of DILDs, whilst ensuring safe diagnosis of patients. <h3>Reference</h3> Sun SH, Huynh K, Cortes G, Hill R, Tran J, Yeh L, <i>et al</i>. Testing the ability and limitations of ChatGPT to generate differential diagnoses from transcribed radiologic findings. <i>Radiology</i> 2024 Oct;<b>313</b>(1):e232346.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingRadiology practices and education
Volltext beim Verlag öffnen