Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating generative AI chatbots for large-scale assessment data: comparing LLM-as-a-judge and human ratings
0
Zitationen
5
Autoren
2026
Jahr
Abstract
This study focuses on developing and evaluating a customized Generative AI chatbot designed to enhance access to large-scale educational data. The chatbot aims to assist researchers and policymakers in exploring complex datasets, such as NAEP, through natural language queries. The chatbot was built using a Retrieval-Augmented Generation (RAG) framework that integrates multiple specialized agents to retrieve, interpret, and synthesize educational data. One agent was selected as a case study for performance evaluation. The study compared an automated Large Language Model (LLM)-based evaluation (“LLM-as-a-judge”) with human expert ratings to examine validity and consistency across three criteria: correctness, completeness, and communication quality. A total of 141 expert-generated questions reflecting typical user queries were used, each accompanied by a reference answer and source documentation. Chatbot’s responses were evaluated with a three-dimensional framework on Correctness, Completeness, and Communication. In addition to human evaluation, an LLM-based evaluation was implemented, and the model was provided with the rubric, human-written reference answers, and retrieved RAG contents to generate automated quality assessments. Interrater reliability among human raters and the LLM-as-a-judge were computed with quadratic weighted kappa (QWK). Findings showed that the LLM-as-a-judge approach achieved comparable agreement levels with human raters and demonstrated reliability across all evaluation dimensions. Interrater reliability analyses revealed no significant differences between inter-human and human-to-LLM agreement, except in the communication dimension, where human-to-LLM consistency was higher. These results indicate that the LLM-as-a-judge method can serve as a viable and consistent alternative to human evaluation for customized RAG-based chatbot assessment. Integrating LLM-based evaluation into the assessment of Generative AI chatbots provides a scalable, reliable, and cost-effective complement to traditional human review. With human oversight for calibration and validation, this approach enables more efficient and consistent evaluation practices, advancing the use of AI tools that facilitate broader access to large-scale educational data.
Ähnliche Arbeiten
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller
1999 · 5.632 Zit.
An experiment in linguistic synthesis with a fuzzy logic controller
1975 · 5.552 Zit.
A FRAMEWORK FOR REPRESENTING KNOWLEDGE
1988 · 4.548 Zit.
Opinion Paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy
2023 · 3.313 Zit.