OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.03.2026, 05:59

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The AI Productivity Index (APEX)

2026·0 Zitationen·SuperIntelligence - Robotics - Safety & AlignmentOpen Access
Volltext beim Verlag öffnen

0

Zitationen

14

Autoren

2026

Jahr

Abstract

We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing opensource model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models’ ability to produce economically valuable work.

Ähnliche Arbeiten