Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology

2025·11 Zitationen·CureusOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Background Previous research has highlighted the potential of large language models (LLMs) in answering multiple-choice questions (MCQs) in medical physiology. However, their accuracy and reliability in specialized fields, such as blood physiology, remain underexplored. This study evaluates the performance of six free-to-use LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in solving item-analyzed MCQs on blood physiology. The findings aim to assess their suitability as educational aids. Methods This cross-sectional study at the All India Institute of Medical Sciences, Raebareli, India, involved administering a 40-item MCQ test on blood physiology to 75 first-year medical students. Item analysis utilized the Difficulty Index (DIF I), Discrimination Index (DI), and Distractor Effectiveness (DE). Internal consistency was assessed with the Kuder-Richardson 20 (KR-20) coefficient. These 40 item-analyzed MCQs were presented to six selected LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Le Chat) available as standalone Android applications on March 19, 2025. Three independent users accessed each LLM simultaneously, uploading the compiled MCQs in a Portable Document Format (PDF) file. Accuracy was determined as the percentage of correct responses averaged across all three users. Reliability was measured as the percentage of MCQs consistently answered correctly by LLM to all three users. Descriptive statistics were presented as mean ± standard deviation and percentages. Pearson's correlation coefficient or Spearman's rho was used to evaluate the associations between variables, with p < 0.05 considered significant. Results Item analysis confirmed the validity and reliability of the assessment tool, with a DIF I of 63.2 ± 20.4, a DI of 0.38 ± 0.20, a DE of 66.7 ± 33.3, and a KR-20 of 0.804. Among LLMs, Claude 3.7 demonstrated the highest reliable accuracy (95%), followed by DeepSeek (93%), Grok 3 beta (93%), ChatGPT (90%), Gemini 2.0 (88%), and Mistral Le Chat (70%). No significant correlations were found between LLM performance and MCQ difficulty, discrimination power, or distractor effectiveness. Conclusions The MCQ assessment tool exhibited an appropriate difficulty level, strong discriminatory power, and adequately constructed distractors. LLMs, particularly Claude, DeepSeek, and Grok, demonstrated high accuracy and reliability in solving blood physiology MCQs, supporting their role as supplementary educational tools. LLMs handled questions of varying difficulty, discrimination power, and distractor effectiveness with similar competence. However, given occasional errors, they should be used alongside traditional teaching methods and expert supervision.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen