Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior

2025·0 Zitationen·PatternsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Standard offline evaluations for language models fail to capture how these models actually behave in practice, where personalization fundamentally alters model behavior. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other questions to their chat interfaces.

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationAI in Service Interactions

Volltext beim Verlag öffnen

The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen