Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating and enhancing the performance of large language models in thyroid eye disease through customization and Chain-of-Thought strategies
0
Zitationen
16
Autoren
2026
Jahr
Abstract
The paucity of public awareness regarding thyroid eye disease (TED) usually leads to delayed medical care. While large language models (LLMs) hold great potential for augmenting patient education, their ability in answering TED-related questions has yet to be comprehensively evaluated. This study aims to assess the capability of LLMs to address TED-related questions and explore the practicability of customizing LLMs for disease-specific domains. Considering the diverse LLM candidates, we deployed a cascade pipeline to search for the best model for TED. We first evaluated performances of several prevailing LLMs on multiple-choice questions. The best-performing models, GPT-4 and Claude 3.5, were selected and customized to create TED-GPT and TED-Claude. Chain-of-Thought (CoT) was then utilized, resulting in CoT-GPT and CoT-Claude. We also evaluated newer LLMs with native CoT capabilities (GPT-4-o1, GPT-4-o3, Gemini-2.0-Flash, Gemini-2.5-Pro, Claude 3.7). These models, along with their original versions, were then assessed and compared on multiple-choice questions. The better-performing TED-GPT and TED-Claude were evaluated on short-answer and case questions, with comparisons made to their original ones using the QUEST framework (Quality, Understanding/Reasoning, Expression, Safety/Harm, Trust). For multiple-choice questions, GPT-4 and Claude 3.5 arrived competitive accuracies (76.2% and 83.2%, respectively). The addition of CoT, along with customization into GPT-4 and Claude 3.5, led to improvement in the accuracy (CoT-GPT 86.1%, CoT-Claude 87.1%, TED-GPT 86.1%, TED-Claude 89.1%), outperforming all other newer LLMs. For case and short-answer questions, the customized TED-GPT and TED-Claude also performed better than their original versions. TED-Claude showed the best performance in accuracy, readability, comprehensiveness, likelihood of harm and reasoning. Thus, LLMs, particularly TED-Claude, achieved relatively satisfactory performances in answering TED-related questions. Besides, using LLMs’ customization modules, along with CoT, effectively enhanced model performances. This indicates that clinicians can use these simple and universal methods to construct LLMs suitable for specific medical domains.
Ähnliche Arbeiten
Trimmomatic: a flexible trimmer for Illumina sequence data
2014 · 68.538 Zit.
Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology
2015 · 31.513 Zit.
BEDTools: a flexible suite of utilities for comparing genomic features
2010 · 30.027 Zit.
HTSeq—a Python framework to work with high-throughput sequencing data
2014 · 22.482 Zit.
A global reference for human genetic variation
2015 · 19.701 Zit.
Autoren
Institutionen
- Shanghai Ninth People's Hospital(CN)
- Shanghai Jiao Tong University(CN)
- Shanghai Artificial Intelligence Laboratory
- East China Normal University(CN)
- China National Nuclear Corporation(CN)
- Chengdu Medical College(CN)
- Chengdu Second People's Hospital(CN)
- University of Macau(MO)
- Guangdong Academy of Medical Sciences(CN)
- Shanghai Children's Medical Center(CN)