Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Natural Language Processing for Chest X‐Ray Reports in the Transformer Era: BERT‐Like Encoders for Comprehension and GPT‐Like Decoders for Generation
7
Zitationen
1
Autoren
2025
Jahr
Abstract
We conducted a comprehensive literature search in PubMed to illustrate the current landscape of transformer-based tools from the perspective of transformer’s two integral components: encoder exemplified by BERT and decoder characterized by GPT. Also, we discussed adoption barriers and potential solutions in terms of computational burdens, interpretability concerns, ethical issues, hallucination problems, malpractice, and legal liabilities. We hope that this commentary will serve as a foundational introduction for radiologists seeking to explore the evolving technical landscape of chest X-ray report analysis in the transformer era. Natural language processing (NLP) has gained widespread use in computer-assisted chest X-ray (CXR) report analysis, particularly since the renaissance of deep learning (DL) in the 2012 ImageNet challenge. While early endeavors predominantly employed recurrent neural networks (RNN) and convolutional neural networks (CNN) [1], the revolution is brought by the transformer [2] and its success can be attributed to three key factors [3]. First, its self-attention mechanism enables simultaneous processing of multiple parts of an input sequence, offering significantly greater efficiency compared to earlier models such as RNN [4]. Second, its architecture exhibits exceptional scalability, supporting models with over 100 billion parameters to capture intricate linguistic relationships in human language [5]. Third, the availability of vast internet-based corpus and advances in computational power have made the pre-training and fine-tuning of large-scale transformer-based models feasible [6]. The development of the transformer enables the resolution of previously intractable problems and achieves expert-level performance across a broad range of CXR report analytical tasks, such as name entity recognition, question answering, and extractive summarization [7]. In this commentary, we conducted a comprehensive literature search in PubMed (Figure 1) to illustrate the current landscape, adoption barriers, and potential solutions for the transformer-based tools from the perspective of the transformer's two integral components: encoder handling comprehension and decoder managing generation. As our primary focus is NLP, the classification criteria for encoder or decoder was based on text modules and we excluded research purely focusing on vision transformers (ViT). Literature search pipeline in PubMed to identify relevant articles published from June 12, 2017, when the transformer model was first introduced, to October 4, 2024. We followed previous systematic reviews [3, 8, 9] to design the groups of keywords: (1) “transformer”; (2) “clinical notes”, “clinical reports”, “clinical narratives”, “clinical text”, “medical notes”, “medical reports”, “medical narratives”, or “medical text”; (3) “natural language processing”, “medical language processing”, “text mining”, or “information extraction”; (4) “radiography”, “chest film”, “chest radiograph”, “radiograph”, or “X-rays”. As the primary means of communication between radiologists and referring physicians, CXR reports contain high-density information on patients' conditions [10]. Much like physicians interpreting CXR reports, the first step of NLP analysis is understanding the content and an important application of the transformer encoder is explicitly converting it into a format suitable for subsequent tasks. One notable encoder is BERT [11], which stands for bidirectional encoder representations from transformers. In contrast to predecessors that rely on large amounts of expert annotations for supervised learning [12], BERT undergoes self-supervised training on large-scale unlabeled datasets to understand language patterns and is subsequently fine-tuned with a small set of annotations on the target task [12, 13], yielding superior performance in text classification [14], name entity recognition [15], extractive summarization [16], and semantics optimization [17]. In the context of healthcare, Olthof et al. [18] built a comprehensive pipeline to evaluate BERT across datasets of varying complexities, disease prevalence, and sample sizes, demonstrating that BERT statistically outperformed conventional DL models of RNN and CNN, in terms of area under the curve and F1-score, with t-test p-values less than 0.05. Beyond the superior performance of BERT compared to conventional DL models, adapting it to domain-specific corpus can further enhance the effectiveness across various tasks. Yan et al. [19] adapted four BERT-like encoders using millions of radiology reports to tackle three key tasks: identifying sentences that describe abnormal findings, assigning diagnostic codes, and extracting key sentences that summarize the reports. Their results demonstrated that domain-specific adaptation yielded statistically significant improvements in accuracy, F1-score, and ROUGE metrics across all three tasks. Most BERT-relevant studies predominantly focus on sentence-, paragraph-, or report-level predictions, while BERT-like encoders are also well-suited for word-level pattern recognition. Chambon et al. [20] leveraged PubMed BERT [21], a biomedical-specific adaptation of BERT, to evaluate the probability of individual tokens containing protected health information, and replaced identified sensitive tokens with synthetic surrogates to ensure privacy preservation. Similarly, Weng et al. [22] developed a system utilizing ALBERT [23], a lite BERT with reduced parameters, to identify diagnostic keywords unrelated to abnormal findings, thereby reducing false-positive alarms and outperforming regular expression-, syntactic grammar-, and conventional DL-based baselines. BERT-derived labels can also be applied to develop DL models targeting other modalities [12, 13]. Nowak et al. [24] systematically explored the utility of BERT-generated silver labels for CXR reports and subsequently linked them to the corresponding radiographs to develop image classifiers. Compared to models trained exclusively on radiologist-annotated gold labels, integrating silver and gold labels led to improved discriminability. In a further macro-averaged analysis, synchronous training on silver and gold labels proved effective in settings with limited gold labels, whereas training first with silver, followed by gold labels was better in cases with abundant gold labels. Zhang et al. [25] introduced a novel approach to extracting more generalizable labels from CXR reports for image classifiers, rather than relying on predefined categories: first, they used BERT to extract linguistic entities and relationships; second, they constructed a knowledge graph based on these extractions; third, radiologists refined the graph using their domain expertise. Unlike traditional multiclass labels, the established knowledge graph not only categorized each sample but also revealed interpretable relationships between categories, such as those linking anatomical regions with abnormal signs. In addition to deriving classification labels, BERT and its advanced comprehension capabilities introduced an unprecedented innovation: the direct supervision of pixel-level segmentation models using medical text [26]. Li et al. [26] proposed a text-augmented lesion segmentation paradigm that integrated BERT-based textual information to compensate for the deficiency in chest radiograph quality and refine pseudo annotations for semi-supervision. These studies highlight the strength of BERT-like encoders in comprehending healthcare-related content and their potential to enhance annotation systems for multi-modality beyond text. Meanwhile, researchers have identified the failures of BERT models in handling complex clinical tasks. Sushil et al. [27] demonstrated that BERT implementations for clinical language inference achieved a test accuracy of 0.778. While domain-specific adaptations using medical textbooks or PubMed articles improved accuracy to 0.833, this performance still fell short of that achieved by medical experts. Potential limitations of BERT-like encoders lie in their relatively modest parameter size, although larger than earlier DL models, and their reliance on inadequate training corpora, such as books, Wikipedia, and selected text databases [28]. Consequently, their ability to learn human knowledge remains constrained. These shortcomings are being alleviated by GPT-like decoders, which incorporate hundreds of billions of parameters and are trained on an internet-scale corpora [29]. Following the advent of BERT-like encoders, generative pre-trained transformer (GPT) [30], the next groundbreaking leap, breaks technical barriers by enabling non-experts to perform NLP tasks through a freely conversational format without any coding. CvT2DistilGPT2 [31], a prominent report generator in the transformer era, utilizes a convolutional ViT as the image encoder and GPT-2 as the text decoder. Their comprehensive experiments indicated that ViT outperformed CNN and GPT surpassed BERT in encoder–decoder architectures for CXR report generation. In specific generation applications, state-of-the-art methods integrate BERT-like encoders with GPT-like decoders. TranSQ [32] is such an advanced framework. Compared with earlier models, it emulates the diagnostic reasoning process of radiologists when generating reports: (1) formulating diagnostic hypothesis embeddings that represent implicit clinical intentions, (2) querying relevant visual features extracted by a ViT and synthesizing semantic embeddings through the cross-modality fusion, and (3) transforming the semantic embeddings into candidate sentences based on DistilGPT [33]. Finally, TranSQ attained a BLEU-4 score of 0.205 and a ROUGE score of 0.409. In comparison, the best-performing baseline among 17 retrieval and generation models achieved a BLEU-4 score of 0.188 and a ROUGE score of 0.383, highlighting the superior capability of the unified transformer architecture in multi-modality. Though GPT-like decoders have dominated text generation in the general domain, the RNN family such as long short-term memory (LSTM) [34] still achieves good performance in generating medical reports, partially because of highly templated characteristics in the clinical text [32]. Kaur and Mittal [35] employed classical encoder–decoder architectures, utilizing CNN for visual feature extraction, and LSTM for textual token generation. They also integrated transformer modules, not GPT-like decoders but BERT-like encoders, to generate numerical representations as LSTM inputs prior to report generation and to shortlist disease-relevant sentences afterward. Results presented that their proposed solution achieved a BLEU-4 score of 0.767 and a ROUGE score of 0.897, suggesting that conventional approaches remain a viable candidate backbone for CXR report generation in specific scenarios. In addition to quantitative metrics by comparing GPT outputs with ground truth reports, model-generated reports should be supplemented with evaluation by medical experts. Boag et al. [36] conducted a broad study on automated CXR report generation, highlighting a divergence between quantitative metrics and clinical accuracy. A discrepancy between quantitative metrics and report readability has also been reported [37]. Accordingly, we emphasize the involvement of human rating in the evaluation of CXR report generation to ensure clinical correctness and readability. In previous sections, we reviewed the current applications of transformer for various CXR report analytical tasks. Although the remarkable performance of BERT-like encoders and GPT-like decoders has been well-established, these applications still face domain-specific problems. Some of these can be alleviated through the integration of advanced technical methods and specialized medical expertise [31, 38], while others necessitate further research for resolution. First, the computational demands in the transformer era are substantial. For example, the large version of BERT contains 334 million parameters and GPT-3 has 175 billion. In contrast, traditional DL models, such as support vector machines [39] and random forests [40], require only a few hundred to a few thousand parameters. As a result, many healthcare providers cannot afford the computational costs of tailoring models from scratch. To address this, we offer several recommendations. For model development, we suggest researchers leverage pre-trained open-access models and focus on fine-tuning rather than building models from scratch. For fine-tuning, considering the varying parameter scales, we recommend parameter-efficient fine-tuning for BERT-like encoders, a technique that updates only a small subset of the model's parameters while leaving the majority of pre-trained weights unchanged [41]. An exemplificative study conducted by Taylor et al. [42] empirically validated the effectiveness of various parameter-efficient fine-tuning techniques on BERT-like encoders within the healthcare domain. For GPT-like decoders, we advocate prompt engineering techniques, such as retrieval-augmented generation, which emphasize crafting informative and instructive inputs to guide the decoders' output without changing model parameters [43]. For example, Ranjit et al. [44] proposed a method to retrieve the most relevant sentences from prior CXR reports as contextual prompts for GPT-like decoders, enabling the generation of concise and accurate reports retaining critical clinical entities. Last but not least, obtaining approval from ethics committees to share anonymous data can facilitate collaboration with external technical partners, helping to alleviate resource burdens. Second, the interpretability of transformer models, including both BERT-like encoders and GPT-like decoders, is critical in healthcare applications, where decisions directly impact patients' lives. While traditional DL approaches have often been regarded as black-box models, their relatively few parameters and simple architectures render them more explainable compared to modern transformers with over 100 billion parameters. For example, individual layers and neurons in CNN can be dissected and visualized, providing insights into their functionality [45-48]. In contrast, understanding the behavior of neurons in transformer models remains a significant challenge due to the computational complexity associated with the exponential scaling of neuron numbers [49]. For BERT-like encoders, though the internal neuron activations remain challenging to interpret, preliminary experiments focusing on identifying key tokens and analyzing their influence on the model's outputs have demonstrated a high degree of alignment with medical expert assessments [50, 51]. For GPT-like decoders, a key strength lies in their flexibility to generate content and align with human instructions. This capability allows users to not only obtain expected outputs for predefined tasks but also request explanations for these outputs, fostering enhanced interpretability and usability [52, 53]. For readers seeking a more comprehensive overview of techniques or detailed insights, we recommend referring to these surveys [54-56]. Third, ethical considerations are paramount in the era of transformers, given their powerful ability to extract nuanced patterns from training datasets. These concerns are particularly pressing when datasets contain sensitive private information or are not representative of the target population. To address patient privacy, we recommend anonymizing input data during both model development and deployment stages to ensure that sensitive information is neither learned by the model [57] nor inadvertently disclosed under certain prompts [58]. Dataset representativeness is also a critical issue, as underrepresentation of minority groups in training data can exacerbate performance disparities and perpetuate inequities [59]. To mitigate this risk, developers should prioritize inclusivity during data collection, and maintainers should continuously monitor model performance to ensure equitable outcomes [60]. Fourth, although GPT-like decoders have demonstrated remarkable capability in generating coherent responses to diverse user prompts and solving a wide range of tasks in a conversational format [61], they are developed on the predictive probability of tokens from the internet corpora instead of contextual radiological language and well-defined logic [62]. Therefore, they continue to suffer from hallucinations, a phenomenon where model-generated content appears coherent and plausible but is factually incorrect, nonsensical, or unrelated to users' inputs [63]. Current efforts to reducing hallucination can be broadly categorized into methods applied during training and post-training stages. During training, key strategies include supervised fine-tuning on in-house CXR reports and reinforcement learning guided by radiologists' feedback [31, 64]. Post-training methods encompass hallucination detection, integration of external knowledge, multi-agent collaboration, and radiologist-in-the-loop frameworks [62, 65]. Due to space constraints, we encourage readers to refer to these reviews [62, 66-68] for comprehension of these strategies. Lastly, even after these technical refinements, transformer may still present risks of malpractice, potentially leading to medical errors and legal liabilities [69]. Errors can arise from various sources, including inaccurate transformer outputs, clinician nonadherence to correct transformer recommendations, and poor integration of the transformer into clinical workflows [70]. Consequently, determining legal responsibility in cases of adverse outcomes remains a critical issue for various stakeholders, including software developers, maintenance teams, radiology departments, and radiologists [71]. A report by the European Commission focuses on the safety and liability implications of artificial intelligence, which applies medical device laws to DL models, and demonstrates that liability generally falls into two categories: civil and product liability [71]. Civil liability typically pertains to radiologists and radiology departments, while product liability applies to software developers. However, the report stops short of providing a strict and definitive framework for liability due to the inherent complexity and ambiguity of DL algorithms [71]. As a result, legal questions surrounding liability will likely continue to be addressed through courts and case law. Under existing legal frameworks, we recommend radiologists to follow the standard of care, utilizing DL models as supplementary confirmatory tools rather than substitutes for standard medical practice to ensure beneficial outcomes for all stakeholders [69]. Additionally, for radiology departments seeking to implement transformer-based NLP tools, we suggest that they should involve radiologists, the most important stakeholders, throughout the entire development cycle [72], and prepare in-depth training programs to familiarize radiologists with transformer-based tools, which differ significantly from routine statistical tests and are often black boxes that resist full interpretation [73]. Moreover, managing radiologists' expectations is important: both unrealistic optimism, where transformer is seen as a replacement for expert expertise, and undue pessimism, where transformer is perceived as offering no utility, should be avoided [74-77]. Han Yuan: Conceptualization; data curation; formal analysis; investigation; project administration; validation; visualization; writing—original draft; writing—review and editing. None. The author declares that he has no conflicts of interest. This study is exempt from review by the ethics committee because it does not involve human participants, animal subjects, or sensitive data does not to this study as no datasets or
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.316 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.177 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.575 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.468 Zit.