OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 14.03.2026, 22:48

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

LLM-Assisted Assessment in Biology Education: A Comparative Study of ChatGPT-4o, Gemini 2.5 Flash and Teacher Grading in High-School Exams

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2025

Jahr

Abstract

Large language model (LLM) graders are emerging as low-cost, rapid alternatives for evaluating open-ended student responses, but empirical evidence on their validity and pedagogical value remains limited. This study compares two state-of-the-art LLMs - ChatGPT-4o and Gemini 2.5 Flash - against benchmark scores assigned by an experienced biology teacher to 13 fourth-year students from the Integrated Technical High School at IFRN (Natal-Central Campus). Students answered an open-ended question on high-altitude physiology, scored from 0 to 10 using a four-level analytic rubric. We applied descriptive statistics, Pearson correlation, intraclass correlation coefficients, Bland-Altman plots, weighted Cohen’s $\kappa$, and thematic content analysis of LLM-generated textual feedback. Both LLMs demonstrated convergent validity with teacher scores: ChatGPT40 showed strong convergent validity ($r=0.73$), while Gemini 2.5 Flash showed moderate convergent validity ($r=0.57$). Additionally, the models exhibited a strong inter-model correlation ($\mathbf{r}=0.75$) and moderate inter-rater reliability (ICC = 0.68). However, systematic leniency emerged: ChatGPT overestimated scores by $+0.73 \pm 1.13$ and Gemini by $+1.42 \pm 0.78$ points ($\mathbf{p}\lt$0.01). Gemini's scores showed lower variance but omitted the “Insufficient” category entirely, reducing diagnostic sensitivity. Qualitative analysis revealed that feedback from both LLMs consistently referenced atmospheric pressure and gas exchange but failed to mention acclimatization mechanisms in 38% (ChatGPT) and 31% (Gemini) of cases. Longer explanations correlated with greater absolute grading errors ($\boldsymbol{\beta}=\mathbf{0. 4 6}, \mathbf{p}=\mathbf{0. 0 4}$), suggesting that verbosity does not imply accuracy. Our findings suggest that LLMs can approximate human grading while offering timely formative insights yet require calibration to address inflation and conceptual gaps. We propose a moderation protocol to support reliable LLM-assisted assessment in secondary biology education. As this is a pilot study with a small sample (n=13), findings should be interpreted as preliminary.

Ähnliche Arbeiten