Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating Multimodal LLMs for Information Extraction from Oncology Reports Requires a Clinically Curated Ground Truth—Two-Phase Evaluation of GPT-4.1 versus GPT-4.0

2026·0 Zitationen·AI in Precision Oncology

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

To critically evaluate the performance of Generative Pre-trained Transformer (GPT)-4-based large language models (LLMs) for extracting imaging findings from oncology records, with a primary focus on quantifying the impact of reference data quality on measured performance. A two-phase study was conducted on 40 oncology medical records. In Phase 1, model outputs were compared against existing, uncurated reference summaries. In Phase 2, outputs for a 20-record subset were re-evaluated against a new “gold standard” of expert-curated, standardized summaries created by a board-certified radiologist. We systematically tested two model versions (text-only GPT-4.0 vs. multimodal GPT-4.1), two prompt designs, two input modalities (text vs. image), and two document scopes. Performance was assessed using lexical metrics (BLEU, ROUGE, METEOR) and a semantic alignment metric (Kullback–Leibler [KL] Divergence). A profound performance disparity was observed between phases. Phase 1 evaluation against uncurated references yielded modest scores (e.g., max ROUGE-1 ≈ 0.45, BLEU ≈ 0.15) and high semantic divergence (KL > 7.7). In contrast, Phase 2 evaluation against the gold-standard references resulted in substantial improvements across all configurations. The top-performing configuration—multimodal GPT-4.1 using image-based input on the full document—achieved a ROUGE-1 of 0.57, BLEU of 0.25, and a significantly lower KL Divergence of 5.96, closely approaching the expert standard. The quality and consistency of the reference standard are the most critical drivers of measured LLM performance in clinical information extraction tasks. Standard NLP metrics can be misleading when applied to uncurated “ground truth.” With a clinically validated reference, advanced multimodal models like GPT-4.1 demonstrate a powerful capability to accurately summarize complex oncology reports, highlighting the necessity of codeveloping AI models and their evaluation frameworks.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingRadiology practices and education

Volltext beim Verlag öffnen

Evaluating Multimodal LLMs for Information Extraction from Oncology Reports Requires a Clinically Curated Ground Truth—Two-Phase Evaluation of GPT-4.1 versus GPT-4.0

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen