Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating generative AI chatbots for large-scale assessment data: comparing LLM-as-a-judge and human ratings

2026·0 Zitationen·Large-scale Assessments in EducationOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

This study focuses on developing and evaluating a customized Generative AI chatbot designed to enhance access to large-scale educational data. The chatbot aims to assist researchers and policymakers in exploring complex datasets, such as NAEP, through natural language queries. The chatbot was built using a Retrieval-Augmented Generation (RAG) framework that integrates multiple specialized agents to retrieve, interpret, and synthesize educational data. One agent was selected as a case study for performance evaluation. The study compared an automated Large Language Model (LLM)-based evaluation (“LLM-as-a-judge”) with human expert ratings to examine validity and consistency across three criteria: correctness, completeness, and communication quality. A total of 141 expert-generated questions reflecting typical user queries were used, each accompanied by a reference answer and source documentation. Chatbot’s responses were evaluated with a three-dimensional framework on Correctness, Completeness, and Communication. In addition to human evaluation, an LLM-based evaluation was implemented, and the model was provided with the rubric, human-written reference answers, and retrieved RAG contents to generate automated quality assessments. Interrater reliability among human raters and the LLM-as-a-judge were computed with quadratic weighted kappa (QWK). Findings showed that the LLM-as-a-judge approach achieved comparable agreement levels with human raters and demonstrated reliability across all evaluation dimensions. Interrater reliability analyses revealed no significant differences between inter-human and human-to-LLM agreement, except in the communication dimension, where human-to-LLM consistency was higher. These results indicate that the LLM-as-a-judge method can serve as a viable and consistent alternative to human evaluation for customized RAG-based chatbot assessment. Integrating LLM-based evaluation into the assessment of Generative AI chatbots provides a scalable, reliable, and cost-effective complement to traditional human review. With human oversight for calibration and validation, this approach enables more efficient and consistent evaluation practices, advancing the use of AI tools that facilitate broader access to large-scale educational data.

Autoren

Institutionen

Themen

AI in Service InteractionsArtificial Intelligence in Healthcare and EducationTopic Modeling

Volltext beim Verlag öffnen

Evaluating generative AI chatbots for large-scale assessment data: comparing LLM-as-a-judge and human ratings

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen