Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Conceptual proposal for LLM-generated FDG PET/CT follow-up reports in melanoma: a pilot study on model stability and blinded expert evaluation
0
Zitationen
14
Autoren
2026
Jahr
Abstract
Purpose Oncological patients regularly undergo PET/CT re-staging, which requires a report that outlines their current disease status and highlights relevant changes compared to the previous PET/CT. Large language models (LLMs) may be helpful with documentation in the future. This study is a pilot on LLM performance, focusing on test–retest stability and reproducibility. Methods Three textbook melanoma follow-up cases of increasing complexity (involving one to eight organs) were selected. From standardized text-only prompts (no imaging data), follow-up reports were written by GPT-4o, Claude Sonnet 4 (each producing three independent revisions), and three nuclear medicine residents. This yielded nine reports per case (27 in total). Six blinded nuclear medicine experts (three internal, three external) performed test–retest evaluations of report quality and authorship identification. Results The cosine similarity analysis revealed high intra-case coherence (mean: 0.599–0.727) regardless of authorship. The external human readers consistently rated reports higher than the internal human readers. The LLM-generated reports received comparable or superior ratings to human reports, with Claude achieving the highest external reader scores (mean 0.926, standard deviation 0.263, on a 0–1 scale). Human performance declined with case complexity, while Claude, in particular, improved. The external readers significantly preferred the LLM impressions (Fisher’s exact test, p = 0.005). Neither the human nor LLM readers reliably identified authorship (balanced accuracy 0.343–0.500). Conclusion In this pilot, blinded expert evaluation demonstrated that current LLMs can generate reports for melanoma [ 18 F]fluorodeoxyglucose PET/CT of comparable quality to human-authored reports from text prompts in this study. High test–retest stability was obtained. Larger future studies will be required to confirm these findings.
Ähnliche Arbeiten
Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination
2004 · 4.363 Zit.
Manual for Raven's progressive matrices and vocabulary scales
1998 · 4.215 Zit.
The mathematics of statistical machine translation: parameter estimation
1993 · 4.118 Zit.
Word association norms, mutual information, and lexicography
1990 · 3.665 Zit.
Language identification in the limit
1967 · 3.572 Zit.