Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating reasoning large language models with human-like thinking in ophthalmic question answering

2026·0 Zitationen·BMJ Open OphthalmologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

OBJECTIVES: To evaluate the performance of reasoning large language models (LLMs) with human-like thinking in ophthalmic question answering. METHODS: We evaluated two state-of-the-art open-source reasoning LLMs (DeepSeek-R1 and QwQ-32B) and one conventional non-reasoning LLM (LLaMA-3.3-70B-Instruct) models on ophthalmology questions, assessing not only answer accuracy (ACC) but also the quality of their reasoning processes. First, we curated MedQA-Eye, a dataset of 967 ophthalmology questions across 10 subspecialties, 3 scenarios, 5 medical entities and 3 languages. Second, we proposed a novel framework considering human thinking patterns essential to medical practice to evaluate the thinking performance of reasoning LLMs on MedQA-Eye. RESULTS: DeepSeek-R1 demonstrated superior overall ACC (90.59%, 95% CI 88.59% to 92.27%) to LLaMA-3.3-70B-Instruct (87.90%, 95% CI 85.69% to 89.81%, p=0.015) and QwQ-32B (84.28%, 95% CI 81.85% to 86.44%, p<0.001) with performance varying across subspecialties. Analysis of reasoning LLMs revealed incorrect logical inference as the primary point of failure, accounting for 93.41%-94.74% of incorrectly answered questions. We further quantified semantic uncertainty in reasoning LLM thinking as a predictor of answer reliability. DeepSeek-R1 exhibited lower semantic uncertainty (1.04±3.63) compared with QwQ-32B (4.31±40.70), p<0.001. CONCLUSION: Reasoning LLMs demonstrated superior performance in ophthalmology question answering, with DeepSeek-R1 achieving the highest ACC. Our findings demonstrate that reasoning LLM can better simulate human-like thinking processes compared with conventional non-reasoning LLM, suggesting its potential for more trustworthy LLM systems in ophthalmology.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMultimodal Machine Learning ApplicationsGenomics and Rare Diseases

Volltext beim Verlag öffnen

Evaluating reasoning large language models with human-like thinking in ophthalmic question answering

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen