Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating Generative AI and Human Performance in Question-Answer Validation Tasks
0
Zitationen
3
Autoren
2025
Jahr
Abstract
The development of educational applications, particularly within virtual reality (VR) environments, depends on the creation of high-quality question–answer pairs to support effective learning and assessment. Ensuring the accuracy and relevance of these items is essential, making systematic validation a necessary part of the design process. This study examines the effectiveness of generative artificial intelligence (AI) models, specifically ChatGPT and DeepSeek, in validating such content by comparing their performance with that of human evaluators. The goal is to identify the most consistent and reliable approach to supporting question generation in educational contexts. To measure inter-rater reliability, Krippendorf’s Alpha was used to assess consistency among multiple evaluators while accounting for agreement occurring by chance. Cohen’s Kappa was also applied to analyze pairwise agreement and to evaluate the extent to which human and AI assessments aligned. These metrics were calculated in several dimensions, including correctness, clarity, relevance, and educational value, to facilitate a comparative analysis of the evaluator’s performance.
Ähnliche Arbeiten
A spreading-activation theory of semantic processing.
1975 · 8.019 Zit.
Cognitive Load During Problem Solving: Effects on Learning
1988 · 7.673 Zit.
International Conference on Learning Representations (ICLR 2013)
2013 · 6.255 Zit.
Learning from delayed rewards
1989 · 5.452 Zit.
Comprehension: A Paradigm for Cognition
1998 · 4.771 Zit.