Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comment on “Diagnostic Performance of Multimodal Large Language Models in the Analysis of Oral Pathology”
0
Zitationen
2
Autoren
2026
Jahr
Abstract
The original article by Suárez et al. (2025), which offers a valuable and timely investigation into the capabilities of ChatGPT-4o for interpreting clinical photographs of oral mucosal lesions, caught our attention. Significant insights into the model's potential as a diagnostic support tool are provided by the study's meticulous methodology, which includes analyzing 30 responses per image to evaluate repeatability. Nonetheless, we would like to highlight a few methodological issues that are essential for analyzing the findings and directing further study in this rapidly developing area. First of all, although using top-notch, expertly taken photos guarantees clarity, doing so might restrict how broadly the results can be applied to actual situations. Smartphones are frequently used by patients or general practitioners to take pictures in clinical or teledentistry settings, frequently with varying lighting, angles, and resolutions. According to Talwar et al. (2023), this kind of variability can have a great effect on AI performance. To better gauge the model's practical applicability, future research should test its robustness using a wider range of images that represent these real-world circumstances. Second, the highly structured and directive prompt of the study successfully guided the model to generate structured outputs for location, diagnosis, tests, and treatment. This is excellent for standardization, but it may simplify the diagnostic task and overestimate the model's ability to think for itself. Large language models are significantly impacted by prompt engineering (Hassanein et al. 2025). This is consistent with the results of our recent study (Hassanein et al. 2025), which compared ChatGPT-4o and DeepSeek-3 and showed that a fixed, standardized prompt can be a major limitation, possibly limiting the model's reasoning and failing to capture the iterative nature of clinical diagnosis. Examining how different prompt styles, such as less directive questions or iterative, conversational prompts that resemble a clinical history-taking dialog, affect diagnostic accuracy and reasoning depth would be highly beneficial. This could demonstrate whether the model is truly able to integrate information or is mainly following a structured template. Thirdly, when the diagnosis was accurate, the remarkable precision in suggesting diagnostic tests (90.7%) and treatments (95.8%) is encouraging. Nonetheless, 58.2% was the overall diagnostic accuracy. This discrepancy highlights a crucial point: the model's utility is entirely contingent on the accuracy of its initial diagnosis. An inaccurate diagnosis followed by a confidently stated but inappropriate treatment recommendation could have negative consequences in a clinical setting. As a result, the model's performance needs to be assessed as a whole. This line of reasoning and the possible dangers of confident errors should be given more attention in future research. At last, the authors correctly conclude that ChatGPT-4o should be used in conjunction with clinical judgment rather than as a substitute for it. We strongly agree. Investigating the dynamic of human-AI collaboration is a crucial next step. What is the impact of AI suggestions on a clinician's diagnostic accuracy? Research could look into whether the model serves as a helpful “second opinion” that lessens cognitive bias or if it could result in automation bias, where the clinician relies too much on the AI's results. Asmaa Abou-Bakr: writing – original draft, writing – review and editing. Fatma E. A. Hassanein: conceptualization, writing – review and editing. During the preparation of this work, the authors used ChatGPT in order to improve language and readability. After using this tool, the authors reviewed and edited the content as needed and took full responsibility for the content of the publication. The authors have nothing to report. The authors declare no conflicts of interest.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.231 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.084 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.444 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.423 Zit.