Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

ConTEXTual Net 3D: Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings

2026·0 Zitationen·Journal of Imaging Informatics in MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak image-text labels and used it to train a 3D visual grounding model. Our weak-labeling pipeline identified sentences describing positive findings in PET/CT reports by searching for mentions of standardized uptake values (SUVmax) and axial slice numbers. These were used to automatically generate lesion masks, which were paired with the corresponding text descriptions. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which takes as input a description of a lesion and generates a corresponding segmentation mask. The model's performance was evaluated on 251 radiologist-reviewed cases and compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two radiologists. We evaluated detection performance using F1 score. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251). ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1 = 0.22) and the 2.5D model (F1 = 0.53), though it underperformed both radiologists (F1 = 0.94 and 0.91). The model achieved better performance on <sup>18</sup>F-fluorodeoxyglucose (F1 = 0.78) and DCFPyL (F1 = 0.75) exams than on DOTATATE (F1 = 0.58) and <sup>18</sup>F-fluciclovine (F1 = 0.66) exams. In conclusion, our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.

Autoren

Institutionen

Themen

Multimodal Machine Learning ApplicationsTopic ModelingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

ConTEXTual Net 3D: Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen