Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Multimodal LLMs for Information Extraction from Oncology Reports Requires a Clinically Curated Ground Truth—Two-Phase Evaluation of GPT-4.1 versus GPT-4.0
0
Zitationen
5
Autoren
2026
Jahr
Abstract
To critically evaluate the performance of Generative Pre-trained Transformer (GPT)-4-based large language models (LLMs) for extracting imaging findings from oncology records, with a primary focus on quantifying the impact of reference data quality on measured performance. A two-phase study was conducted on 40 oncology medical records. In Phase 1, model outputs were compared against existing, uncurated reference summaries. In Phase 2, outputs for a 20-record subset were re-evaluated against a new “gold standard” of expert-curated, standardized summaries created by a board-certified radiologist. We systematically tested two model versions (text-only GPT-4.0 vs. multimodal GPT-4.1), two prompt designs, two input modalities (text vs. image), and two document scopes. Performance was assessed using lexical metrics (BLEU, ROUGE, METEOR) and a semantic alignment metric (Kullback–Leibler [KL] Divergence). A profound performance disparity was observed between phases. Phase 1 evaluation against uncurated references yielded modest scores (e.g., max ROUGE-1 ≈ 0.45, BLEU ≈ 0.15) and high semantic divergence (KL > 7.7). In contrast, Phase 2 evaluation against the gold-standard references resulted in substantial improvements across all configurations. The top-performing configuration—multimodal GPT-4.1 using image-based input on the full document—achieved a ROUGE-1 of 0.57, BLEU of 0.25, and a significantly lower KL Divergence of 5.96, closely approaching the expert standard. The quality and consistency of the reference standard are the most critical drivers of measured LLM performance in clinical information extraction tasks. Standard NLP metrics can be misleading when applied to uncurated “ground truth.” With a clinically validated reference, advanced multimodal models like GPT-4.1 demonstrate a powerful capability to accurately summarize complex oncology reports, highlighting the necessity of codeveloping AI models and their evaluation frameworks.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.250 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.109 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.482 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.434 Zit.