Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?
0
Zitationen
11
Autoren
2024
Jahr
Abstract
Large Language Models (LLMs) are commonly evaluated using human-crafted\nbenchmarks, under the premise that higher scores implicitly reflect stronger\nhuman-like performance. However, there is growing concern that LLMs may ``game"\nthese benchmarks due to data leakage, achieving high scores while struggling\nwith tasks simple for humans. To substantively address the problem, we create\nGAOKAO-Eval, a comprehensive benchmark based on China's National College\nEntrance Examination (Gaokao), and conduct ``closed-book" evaluations for\nrepresentative models released prior to Gaokao. Contrary to prevailing\nconsensus, even after addressing data leakage and comprehensiveness,\nGAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned\ncapabilities. To better understand this mismatch, We introduce the Rasch model\nfrom cognitive psychology to analyze LLM scoring patterns and identify two key\ndiscrepancies: 1) anomalous consistent performance across various question\ndifficulties, and 2) high variance in performance on questions of similar\ndifficulty. In addition, We identified inconsistent grading of LLM-generated\nanswers among teachers and recurring mistake patterns. we find that the\nphenomenons are well-grounded in the motivations behind OpenAI o1, and o1's\nreasoning-as-difficulties can mitigate the mismatch. These results show that\nGAOKAO-Eval can reveal limitations in LLM capabilities not captured by current\nbenchmarks and highlight the need for more LLM-aligned difficulty analysis.\n
Ähnliche Arbeiten
Quasi-Experimentation: Design & Analysis Issues for Field Settings
1979 · 11.375 Zit.
Science in Action: How to Follow Scientists and Engineers through Society
1988 · 10.857 Zit.
Does Active Learning Work? A Review of the Research
2004 · 6.819 Zit.
Science in Action
1987 · 4.125 Zit.
In Search of Excellence
1984 · 4.060 Zit.