OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 01.05.2026, 21:42

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating and enhancing the performance of large language models in thyroid eye disease through customization and Chain-of-Thought strategies

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

16

Autoren

2026

Jahr

Abstract

The paucity of public awareness regarding thyroid eye disease (TED) usually leads to delayed medical care. While large language models (LLMs) hold great potential for augmenting patient education, their ability in answering TED-related questions has yet to be comprehensively evaluated. This study aims to assess the capability of LLMs to address TED-related questions and explore the practicability of customizing LLMs for disease-specific domains. Considering the diverse LLM candidates, we deployed a cascade pipeline to search for the best model for TED. We first evaluated performances of several prevailing LLMs on multiple-choice questions. The best-performing models, GPT-4 and Claude 3.5, were selected and customized to create TED-GPT and TED-Claude. Chain-of-Thought (CoT) was then utilized, resulting in CoT-GPT and CoT-Claude. We also evaluated newer LLMs with native CoT capabilities (GPT-4-o1, GPT-4-o3, Gemini-2.0-Flash, Gemini-2.5-Pro, Claude 3.7). These models, along with their original versions, were then assessed and compared on multiple-choice questions. The better-performing TED-GPT and TED-Claude were evaluated on short-answer and case questions, with comparisons made to their original ones using the QUEST framework (Quality, Understanding/Reasoning, Expression, Safety/Harm, Trust). For multiple-choice questions, GPT-4 and Claude 3.5 arrived competitive accuracies (76.2% and 83.2%, respectively). The addition of CoT, along with customization into GPT-4 and Claude 3.5, led to improvement in the accuracy (CoT-GPT 86.1%, CoT-Claude 87.1%, TED-GPT 86.1%, TED-Claude 89.1%), outperforming all other newer LLMs. For case and short-answer questions, the customized TED-GPT and TED-Claude also performed better than their original versions. TED-Claude showed the best performance in accuracy, readability, comprehensiveness, likelihood of harm and reasoning. Thus, LLMs, particularly TED-Claude, achieved relatively satisfactory performances in answering TED-related questions. Besides, using LLMs’ customization modules, along with CoT, effectively enhanced model performances. This indicates that clinicians can use these simple and universal methods to construct LLMs suitable for specific medical domains.

Ähnliche Arbeiten