Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
LLM-Assisted Assessment in Biology Education: A Comparative Study of ChatGPT-4o, Gemini 2.5 Flash and Teacher Grading in High-School Exams
0
Zitationen
4
Autoren
2025
Jahr
Abstract
Large language model (LLM) graders are emerging as low-cost, rapid alternatives for evaluating open-ended student responses, but empirical evidence on their validity and pedagogical value remains limited. This study compares two state-of-the-art LLMs - ChatGPT-4o and Gemini 2.5 Flash - against benchmark scores assigned by an experienced biology teacher to 13 fourth-year students from the Integrated Technical High School at IFRN (Natal-Central Campus). Students answered an open-ended question on high-altitude physiology, scored from 0 to 10 using a four-level analytic rubric. We applied descriptive statistics, Pearson correlation, intraclass correlation coefficients, Bland-Altman plots, weighted Cohen’s $\kappa$, and thematic content analysis of LLM-generated textual feedback. Both LLMs demonstrated convergent validity with teacher scores: ChatGPT40 showed strong convergent validity ($r=0.73$), while Gemini 2.5 Flash showed moderate convergent validity ($r=0.57$). Additionally, the models exhibited a strong inter-model correlation ($\mathbf{r}=0.75$) and moderate inter-rater reliability (ICC = 0.68). However, systematic leniency emerged: ChatGPT overestimated scores by $+0.73 \pm 1.13$ and Gemini by $+1.42 \pm 0.78$ points ($\mathbf{p}\lt$0.01). Gemini's scores showed lower variance but omitted the “Insufficient” category entirely, reducing diagnostic sensitivity. Qualitative analysis revealed that feedback from both LLMs consistently referenced atmospheric pressure and gas exchange but failed to mention acclimatization mechanisms in 38% (ChatGPT) and 31% (Gemini) of cases. Longer explanations correlated with greater absolute grading errors ($\boldsymbol{\beta}=\mathbf{0. 4 6}, \mathbf{p}=\mathbf{0. 0 4}$), suggesting that verbosity does not imply accuracy. Our findings suggest that LLMs can approximate human grading while offering timely formative insights yet require calibration to address inflation and conceptual gaps. We propose a moderation protocol to support reliable LLM-assisted assessment in secondary biology education. As this is a pilot study with a small sample (n=13), findings should be interpreted as preliminary.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.231 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.084 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.444 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.423 Zit.