Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Feasibility of AI-powered assessment scoring: Can large language models replace human raters?
2
Zitationen
4
Autoren
2025
Jahr
Abstract
<b>Objective:</b> To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). <b>Method:</b> Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples <i>t</i>-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. <b>Results:</b> Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-<i>R</i> = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). <b>Conclusions:</b> ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.200 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.051 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.416 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.410 Zit.