OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 12.03.2026, 03:51

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Feasibility of AI-powered assessment scoring: Can large language models replace human raters?

2025·2 Zitationen·The Clinical Neuropsychologist
Volltext beim Verlag öffnen

2

Zitationen

4

Autoren

2025

Jahr

Abstract

<b>Objective:</b> To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). <b>Method:</b> Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples <i>t</i>-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. <b>Results:</b> Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-<i>R</i> = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). <b>Conclusions:</b> ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMedical Imaging and AnalysisExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen