Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
General scales unlock AI evaluation with explanatory and predictive power
0
Zitationen
26
Autoren
2026
Jahr
Abstract
Ensuring safe and effective use of artificial intelligence (AI) requires understanding and anticipating its performance on new tasks, from advanced scientific challenges to transformed workplace activities<sup>1-3</sup>. So far, benchmarking has guided progress in AI but has offered limited explanatory and predictive power for general-purpose AI systems<sup>4-8</sup>, attributed to limited transferability across specific tasks<sup>9-11</sup>. Here we introduce general scales for AI evaluation that elicit demand profiles explaining what capabilities common AI benchmarks truly measure, extract ability profiles quantifying the general strengths and limits of AI systems and robustly predict AI performance for new task instances. Our fully automated methodology builds on 18 rubrics, capturing a broad range of cognitive and intellectual demands, which place different task instances on the same general scales, illustrated on 15 large language models (LLMs) and 63 tasks. Both the demand and the ability profiles on these scales bring new insights such as construct validity through benchmark sensitivity and specificity and explain conflicting claims about whether AI has reasoning capabilities. Ultimately, high predictive power at the instance level becomes possible using the general scales, providing superior estimates over strong black-box baseline predictors, especially in out-of-distribution settings (new tasks and benchmarks). The scales, rubrics, battery, techniques and results presented here constitute a solid foundation for a science of AI evaluation, underpinning the reliable deployment of AI in the years ahead.
Ähnliche Arbeiten
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
2017 · 20.615 Zit.
Generative Adversarial Nets
2023 · 19.894 Zit.
Visualizing and Understanding Convolutional Networks
2014 · 15.306 Zit.
"Why Should I Trust You?"
2016 · 14.446 Zit.
On a Method to Measure Supervised Multiclass Model’s Interpretability: Application to Degradation Diagnosis (Short Paper)
2024 · 13.171 Zit.
Autoren
- Lexin Zhou
- Lorenzo Pacchiardi
- Fernando Martínez-Plumed
- Katherine M. Collins
- Yael Moros-Daval
- Seraphina Zhang
- Qinlin Zhao
- Yitian Huang
- Luning Sun
- Jonathan E. Prunty
- Zongqian Li
- Pablo Sánchez-García
- Kexin Jiang-Chen
- Pablo A. M. Casares
- Jiyun Zu
- John Burden
- Behzad Mehrbakhsh
- David Stillwell
- Marta Cebrián
- Jianjun Wang
- Peter Henderson
- Sherry Tongshuang Wu
- Patrick C. Kyllonen
- Lucy G. Cheke
- Xuejun Xie
- José Hernández-Orallo
Institutionen
- Princeton University(US)
- Leverhulme Trust(GB)
- Microsoft Research Asia (China)(CN)
- Universitat Politècnica de València(ES)
- University of Cambridge(GB)
- Artificial Intelligence Research Institute(ES)
- KU Leuven(BE)
- Educational Testing Service(US)
- Centre for Automation and Robotics(ES)
- William & Mary(US)
- Williams (United States)(US)
- Carnegie Mellon University(US)