OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.03.2026, 13:57

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Advancing radiology reporting with large language models: Is <scp>GPT</scp>‐4 the <scp>LI‐RADS</scp> game changer or just a wild card?

2024·1 Zitationen·Liver InternationalOpen Access
Volltext beim Verlag öffnen

1

Zitationen

3

Autoren

2024

Jahr

Abstract

The field of radiology has long grappled with the challenges posed by the variability and complexity of free-text reporting. Inconsistencies in reporting styles and structures can lead to the loss of critical clinical information and hinder the extraction of structured data for research and clinical purposes. This information is very relevant for the decision processes of many tumours, but it can be even more relevant in the setting of hepatocellular carcinoma (HCC). This is so because of the unique characteristics of HCC, as it is characterized by a premise that makes it substantially different from other tumours: the possibility of being unequivocally diagnosed by radiology means in a high-risk context. However, in the real-life clinical practice setting, we usually face two limitations: (1) Not all lesions exhibit the same features, nor do all these features confer the same probability of 'risk' of being an HCC and (2) the interpretability of radiological findings and their description is very heterogeneous among different centres and radiology specialists, as mentioned before. This issue is further compounded in multilingual settings, where the diversity of languages adds another layer of complexity to the already challenging task of standardizing radiology reports. For this reason, to standardize the descriptive language of the lesions and to categorize the probability of HCC diagnosis appropriately, The Liver Imaging Reporting and Data System (LI-RADS) was designed and has been refined until its last update in 2018.1 Nonetheless, despite its utility, the standardization in radiological reports is far from universal, and adherence to its use is suboptimal or inadequate in up to 48.4% of reports.2 The universal access to artificial intelligence (AI)-based large language models is a paradigm shift at all levels, including of course in medicine. This opportunity presented by these LLMs, capable of understanding natural and unstructured language,3, 4 has been magnificently seized by Gu K. et al. using GPT-4 for LI-RADS feature extraction and categorization from multilingual free-text reports, representing a significant step forward in the application of large language models to radiology. LI-RADS provides a standardized framework for imaging hepatocellular carcinoma, but its effectiveness is often hindered by the variability in reporting styles. By demonstrating the ability of GPT-4 to accurately extract key LI-RADS features from free-text reports in both Korean and English, this study highlights the potential of large language models to bridge the gap between free-text reporting and structured data extraction. The study's methodology, which involved the generation of fictitious reports for prompt engineering and the use of genuine reports for external validation, provides a robust framework for evaluating the performance of GPT-4 in a real-world setting. The high accuracies achieved by the model −93% for the made-up reports and 85% for the real ones-, particularly in the extraction of major LI-RADS features, such as size, arterial phase hyperenhancement and washout, underscore the potential of GPT-4 to significantly enhance the efficiency and accuracy of radiology reporting. All this in the incredibly short time span of 32 seconds. However, what sets this study apart is not just its impressive results, but also the reproducibility and extensibility of its methodology. By providing detailed examples of their code and inference process, Gu et al. have not only ensured the reproducibility of their findings but have also paved the way for future studies to build upon their work. This level of transparency and sharing of resources is crucial for the advancement of the field, as it allows researchers to validate and extend existing models, ultimately leading to the development of more robust and generalizable solutions. Moreover, the potential implications of this study extend far beyond the realm of LI-RADS feature extraction. By demonstrating the effectiveness of GPT-4 in handling multilingual free-text reports, the authors have opened up new avenues for research in the application of large language models to various aspects of radiology reporting. From the automated generation of structured reports to the extraction of key clinical information from unstructured data, the possibilities are endless. Thus, it is critical to push the limits by refining prompting strategies, optimizing model architectures and validating performance on larger and more diverse datasets. Nevertheless, all that glitters is not gold. We are only witnessing the birth of these tools and there is still a long way to go. In the work itself, the authors describe how in the cohort of real reports, the accuracy of ancillary features of malignancy was significantly lower compared to the fictitious cohort. It was high but significantly lower. Plus, they report an error rate of 4.5%, which is not that low. Put in context, the impact on inappropriate classification of malignancy data can be potentially relevant and have deleterious consequences. However, it is necessary to point out that the accuracy results are superior to those published in other previous series.5, 6 Additionally, there is a real risk of hallucinations from LLMs. While the authors describe that no obvious hallucinations were identified, there may be some that are not so evident. Plus, the risk is not in how evident they are in a research study, but when using it in the real-life setting (Figure 1). Nonetheless, this raises the question of the level of perfection that we demand from a system for its use in the healthcare system and even more, which is the minimum threshold of accuracy that must be met in such a critical setting. Nevertheless, this may have a philosophical aspect that extends beyond the purview of this editorial. To conclude, as we continue to push the boundaries of what is possible with these models, we can look forward to a future where the challenges of free-text reporting are a thing of the past and where structured data extraction is the norm rather than the exception. The present study represents a significant milestone on this journey, and their commitment to open science and reproducibility should serve as a model for researchers in the field. The authors do not have any disclosures to report. Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Ähnliche Arbeiten