OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 30.03.2026, 11:13

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study

2025·0 Zitationen·Barw Medical JournalOpen Access
Volltext beim Verlag öffnen

0

Zitationen

8

Autoren

2025

Jahr

Abstract

Abstract Introduction The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, benefits, biases, and limitations of LLMs in diagnosing dermatologic conditions within pathology. Methods A pathologist compiled 60 real histopathology case scenarios of skin conditions from a hospital database. Two other pathologists reviewed each patient’s demographics, clinical details, histopathology findings, and original diagnosis. These cases were presented to ChatGPT-3.5, Gemini, and an external pathologist. Each response was classified as complete agreement, partial agreement, or no agreement with the original pathologist’s diagnosis. Results ChatGPT-3.5 had 29 (48.4%) complete agreements, 14 (23.3%) partial agreements, and 17 (28.3%) none agreements. Gemini showed 20 (33%), 9 (15%), and 31 (52%) complete agreement, partial agreement, and no agreement responses, respectively. Additionally, the external pathologist had 36(60%), 17(28%), and 7(12%) complete agreements, partial agreements, and no agreements responses, respectively, in relation to the pathologists’ diagnosis. Significant differences in diagnostic agreement were found between the LLMs and the pathologist (P < 0.001). Conclusion In certain instances, ChatGPT-3.5 and Gemini may provide an accurate diagnosis of skin pathologies when presented with relevant patient history and descriptions of histopathological reports. However, their overall performance is insufficient for reliable use in real-life clinical settings. Introduction The healthcare sector is undergoing significant transformation with the emergence of large language models (LLMs), which have the potential to revolutionize patient care and outcomes. In November 2022, OpenAI introduced a natural language model called Chat Generative Pre-Trained Transformer (ChatGPT). It is renowned for its ability to generate responses that approximate human interaction in various tasks. Gemini, developed by Google, is a text-based AI conversational tool that utilizes machine learning and natural language understanding to address complex inquiries. These models generate new data by identifying structures and patterns from existing data, demonstrating their versatility in producing content across different domains. Generative LLMs rely on sophisticated deep learning methodologies and neural network architectures to scrutinize, comprehend, and produce content that closely resembles human-created outputs. Both ChatGPT and Gemini have gained global recognition for their unprecedented ability to emulate human conversation and cognitive abilities [1-3]. ChatGPT offers a notable advantage in medical decision-making due to its proficiency in analyzing complex medical data. It is a valuable resource for healthcare professionals, providing quick insights derived from patient records, medical research, and clinical guidelines [1,4]. Moreover, ChatGPT can play a crucial role in the differential diagnostic process by synthesizing information from symptoms, medical history, and risk factors, and comprehensively processing this data to present a range of potential medical diagnoses, thereby assisting medical practitioners in their assessments. This has the potential to improve diagnostic accuracy and reduce instances of misdiagnosis or delays [4]. The integration of ChatGPT and Gemini into the medical decision-making landscape has generated interest from various medical specialties. Multiple disciplines have published articles highlighting the significance and potential applications of ChatGPT and Gemini in their respective fields [2,5]. Despite the growing number of these models used in diagnostics, patient management, preventive medicine, and genomic analysis across medicine, the integration of LLMs in dermatology remains limited. This study emphasizes the exploration of large language models, highlighting their less common yet promising role in advancing dermatologic diagnostics and patient care [6] This study aims to explore the role of LLMs and its decision-making capabilities in the field of pathology, specifically in dermatologic conditions. It focuses on ChatGPT 3.5 and Gemini and compares their accuracy and concordance with the diagnoses of human pathologists. The study also investigates the potential advantages, biases, and constraints of integrating LLM tools into pathology decision-making processes. Methods Case Selection A pathologist selected 60 real case scenarios, with half being neoplastic conditions and the other half non-neoplastic, from a hospital’s medical database. The cases involved patients who had undergone biopsy and histopathological examination for skin conditions. The records included information on age, sex, and the chief complaint of the patients, in addition to a detailed description of the histopathology reports (clinical and microscopic description without the diagnosis). Consensus Diagnosis Two additional board-certified pathologists reviewed each case, reaching a collaborative consensus diagnosis through a meticulous review of clinical and microscopic descriptions. This process ensured diagnostic accuracy and reliability while minimizing individual biases. Eligibility Criteria The study included cases that had complete and relevant histopathological reports and comprehensive patient demographic information. Specifically, cases were included if they provided a definitive diagnosis in the histopathological report and contained detailed patient data such as age, gender, and clinical history. Cases were excluded if the histopathological report was incomplete, lacked critical patient information, or if the diagnosis could not be definitively made based solely on the textual description. Sampling Method The selection process involved a systematic review of available cases from the hospital's medical database to ensure a representative sample of different dermatologic diagnoses. A random sampling method was employed to minimize selection bias and to ensure the sample was representative of the broader population of dermatologic conditions within the database. The selected cases span a range of common and less common dermatologic conditions, enhancing the generalizability of the study’s findings. Evaluation by AI Systems and External Pathologist In March 2023, these cases were evaluated using two LLM systems, namely ChatGPT-3.5 and Gemini. In addition, an external board-certified pathologist was tested similarly to the AI systems, receiving only the necessary histopathology report descriptions (without histopathological images) to ensure a fair comparison between the LLM systems and the external pathologist. Pathologists’ Experience The Pathologists involved in the study had a minimum of eight years of experience in their respective specialties, handling an average of 30 cases per month. This level of experience ensured a deep familiarity with a wide range of case scenarios. Crucially, the pathologists conducted their assessments were fully informed of the study design, including the comparative analysis with AI systems. Their expertise and understanding were vital in upholding the integrity and reliability of the diagnostic evaluations throughout the study AI Prompting Strategy The LLM systems were initially greeted with a prompt saying “Hello,” followed by standardized inquiries presented as: “Please provide the most accurate diagnoses from the texts that will be given below.” Each case was individually presented by copy-pasting it from a Word document and requesting each system to provide a diagnosis of the case scenario based on the information presented. The first response of each system to the inquiry was documented. If no diagnosis was given, the prompt was repeated as such: “Please, based on the histopathological report information given above, provide the most likely disease that causes it.” Until a diagnosis was obtained. In some cases, after a diagnosis was provided, an additional question was asked to specify the histologic subtype of the condition (e.g., if the diagnosis was “seborrheic keratosis”, the system was asked to specify the histologic subtype). Furthermore, the board-certified external pathologist was tested with the same questions, and the correct diagnosis was inquired. Response Categorization The responses from both systems and the external pathologist were categorized into three subtypes: complete agreement with the original diagnosis by the human pathologists, partial agreement, or none agreement. The criteria for categorizing agreement levels into "complete," "partial," and "none agreement" are based on the distinction between general and specific diagnostic classifications. For instance, when the original diagnosis provides a detailed type and subtype (e.g., "Seborrheic keratosis, irritated type"), an AI tool's or external pathologist's response was classified as demonstrating "complete agreement" if it accurately identifies both the general diagnosis ("Seborrheic keratosis") and the specific subtype ("irritated type"). This classification acknowledges that accurate identification of both components reflects a thorough understanding and alignment with the original diagnosis. Conversely, an assessment was categorized as "partial agreement" if the response correctly identifies the general diagnosis but inaccurately specifies the subtype. Furthermore, a diagnosis was classified as demonstrating "no agreement" when both the general diagnosis and subtype provided by the AI tool or external pathologist are incorrect. These classification criteria draw upon established methodologies in diagnostic agreement studies, emphasizing the importance of distinguishing between different levels of agreement based on the precision and correctness of diagnostic outputs [7]. Data Processing and Statistical Analysis The initial processing of the acquired data involved several steps before statist

Ähnliche Arbeiten