Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

S1382 AI Meets Endoscopy: Assessing ChatGPT-4o’s Accuracy in Gastroduodenal Ulcer Classification and Localization

2025·0 Zitationen·The American Journal of Gastroenterology

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Introduction: Artificial Intelligence, particularly large language models like ChatGPT-4o, is transforming healthcare by enhancing medical education and supporting diagnostics. ChatGPT-4o, capable of interpreting both text and images, offers precise diagnostic suggestions. This study aimed to assess its accuracy in Forrest classification of gastroduodenal ulcers causing upper gastrointestinal bleeding, using gastroenterologist consensus as the reference standard. Methods: We conducted a retrospective cohort study using 109 high-quality endoscopic images of gastroduodenal ulcers from esophagogastroduodenoscopy between January 1 and May 15, 2025, in patients with upper gastrointestinal bleeding. Ulcers were classified by consensus of 2 gastroenterologists (gold standard) and evaluated by ChatGPT-4o using standardized prompts. We calculated the overall accuracy of ChatGPT-4o in classifying ulcers based on the Forrest classification and location of ulcer compared to expert endoscopists using Python version 3.11 (Pandas, SciPy, Scikit-learn). To assess the reliability of ChatGPT-4o’s performance, Cohen’s Kappa coefficient and Intraclass Correlation was utilized. Statistical significance was established at a p-value of less than 0.05. The performance metrics including F1-score and accuracy was presented within the context of a confusion matrix, comprising True Positives, True Negatives, False Positives, and False Negatives. Results: ChatGPT-4o correctly identified the anatomical location (e.g., gastric vs duodenal) in 77.9% of cases, showing fairly good performance in spatial identification. However, the accuracy in identifying correct Forrest class of ulcer was 39.4%. It performed best on low-risk bleeding (Forrest class 3 and 2c) ulcers and poorly or not at all on high-risk bleeding lesions (Forrest class 1a and 1b) (Binary Accuracy: 77.98%, P-value: 0.00046). ChatGPT-4o’s agreement with human experts is limited, especially for urgent bleeding cases (Cohens Kappa 0.209). It performed similarly for gastric and duodenal ulcers, with slightly better accuracy in duodenal ulcers (40% vs 38.2%). Conclusion: ChatGPT-4o has much better performance in identifying low-risk ulcers. It struggles with sensitivity (recall) for high-risk bleeding ulcers, which is critical for clinical triage. This was likely due to underrepresentation or image ambiguity. It matched the expert gastroenterologists’ classification in only ∼4 out of 10 cases. This suggests moderate to low reliability in grading ulcers by Forrest criteria.

Autoren

Themen

Artificial Intelligence in Healthcare and EducationGastrointestinal Bleeding Diagnosis and TreatmentCOVID-19 diagnosis using AI

Volltext beim Verlag öffnen

S1382 AI Meets Endoscopy: Assessing ChatGPT-4o’s Accuracy in Gastroduodenal Ulcer Classification and Localization

Abstract

Ähnliche Arbeiten

Autoren

Themen