OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 02.05.2026, 02:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

The AI interviewer: multi-faceted evaluation of adaptive questioning by large language models

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

12

Autoren

2026

Jahr

Abstract

Large language models are increasingly deployed as adaptive interviewers in qualitative research and human-computer interaction, yet systematic evaluation of their interviewing behavior remains limited. We introduce a modular LLM agent for conducting semi-structured psychological interviews and present a controlled, multi-faceted evaluation protocol to assess interviewer quality across six state-of-the-art models: Claude Sonnet 4, Gemini 2.5 Pro, GPT-5 Chat, Grok 4, Qwen3-235B A22B, and DeepSeek Chat V3.1. The agent conducts adaptive interviews over 54 main questions spanning biography, family, interests, challenges, values, work, and health, deciding for each response whether a follow-up is warranted and generating tailored follow-up questions. To enable fair comparison, we standardize interview context using transcripts from ten baseline human interviews, execute all models under identical orchestration and prompts, and use a single LLM interviewee to eliminate human response variability. Expert psycholinguists evaluate interviewer behavior on five binary criteria: benevolence (empathic tone), necessity, context-awareness, openness, and justified skip (when follow-ups are unnecessary), annotating over 2900 items with high inter-rater reliability (Fleiss κ 0.67-0.93). We complement human judgment with efficiency metrics (latency, questioning intensity) and linguistic profiling via morpho-syntactic and psycholinguistic features on the interview text. Results reveal systematic trade-offs: Gemini 2.5 Pro has the most empathic tone, GPT-5 Chat optimizes for speed and selective precision, Grok 4 achieves exhaustive coverage at the cost of latency and occasional over-contextualization, while Claude Sonnet 4 offers balanced versatility. Linguistic markers such as person pronouns, tense, intensifiers, or syntactic complexity align meaningfully with human judgments, suggesting that stylistic choices are aligned with perceived interview quality. DeepSeek's format instability underscores the operational importance of schema compliance. Our reusable toolkit (prompts, orchestration code, annotation rubric) provides a foundation for principled deployment of LLM interviewers in psychological experiments, enabling researchers to match model capabilities to study goals and to audit agent behavior for empathy, appropriateness, and effectiveness.

Ähnliche Arbeiten