OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 30.03.2026, 20:23

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

2024·0 Zitationen·arXiv (Cornell University)Open Access
Volltext beim Verlag öffnen

0

Zitationen

11

Autoren

2024

Jahr

Abstract

Large Language Models (LLMs) are commonly evaluated using human-crafted\nbenchmarks, under the premise that higher scores implicitly reflect stronger\nhuman-like performance. However, there is growing concern that LLMs may ``game"\nthese benchmarks due to data leakage, achieving high scores while struggling\nwith tasks simple for humans. To substantively address the problem, we create\nGAOKAO-Eval, a comprehensive benchmark based on China's National College\nEntrance Examination (Gaokao), and conduct ``closed-book" evaluations for\nrepresentative models released prior to Gaokao. Contrary to prevailing\nconsensus, even after addressing data leakage and comprehensiveness,\nGAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned\ncapabilities. To better understand this mismatch, We introduce the Rasch model\nfrom cognitive psychology to analyze LLM scoring patterns and identify two key\ndiscrepancies: 1) anomalous consistent performance across various question\ndifficulties, and 2) high variance in performance on questions of similar\ndifficulty. In addition, We identified inconsistent grading of LLM-generated\nanswers among teachers and recurring mistake patterns. we find that the\nphenomenons are well-grounded in the motivations behind OpenAI o1, and o1's\nreasoning-as-difficulties can mitigate the mismatch. These results show that\nGAOKAO-Eval can reveal limitations in LLM capabilities not captured by current\nbenchmarks and highlight the need for more LLM-aligned difficulty analysis.\n

Ähnliche Arbeiten

Autoren

Themen

Biomedical and Engineering EducationArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen