Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The AI Productivity Index (APEX)
0
Zitationen
14
Autoren
2026
Jahr
Abstract
We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing opensource model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models’ ability to produce economically valuable work.
Ähnliche Arbeiten
The global landscape of AI ethics guidelines
2019 · 4.504 Zit.
The Limitations of Deep Learning in Adversarial Settings
2016 · 3.856 Zit.
Trust in Automation: Designing for Appropriate Reliance
2004 · 3.377 Zit.
Fairness through awareness
2012 · 3.267 Zit.
Mind over Machine: The Power of Human Intuition and Expertise in the Era of the Computer
1987 · 3.182 Zit.