Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The AI interviewer: multi-faceted evaluation of adaptive questioning by large language models
0
Zitationen
12
Autoren
2026
Jahr
Abstract
Large language models are increasingly deployed as adaptive interviewers in qualitative research and human-computer interaction, yet systematic evaluation of their interviewing behavior remains limited. We introduce a modular LLM agent for conducting semi-structured psychological interviews and present a controlled, multi-faceted evaluation protocol to assess interviewer quality across six state-of-the-art models: Claude Sonnet 4, Gemini 2.5 Pro, GPT-5 Chat, Grok 4, Qwen3-235B A22B, and DeepSeek Chat V3.1. The agent conducts adaptive interviews over 54 main questions spanning biography, family, interests, challenges, values, work, and health, deciding for each response whether a follow-up is warranted and generating tailored follow-up questions. To enable fair comparison, we standardize interview context using transcripts from ten baseline human interviews, execute all models under identical orchestration and prompts, and use a single LLM interviewee to eliminate human response variability. Expert psycholinguists evaluate interviewer behavior on five binary criteria: benevolence (empathic tone), necessity, context-awareness, openness, and justified skip (when follow-ups are unnecessary), annotating over 2900 items with high inter-rater reliability (Fleiss κ 0.67-0.93). We complement human judgment with efficiency metrics (latency, questioning intensity) and linguistic profiling via morpho-syntactic and psycholinguistic features on the interview text. Results reveal systematic trade-offs: Gemini 2.5 Pro has the most empathic tone, GPT-5 Chat optimizes for speed and selective precision, Grok 4 achieves exhaustive coverage at the cost of latency and occasional over-contextualization, while Claude Sonnet 4 offers balanced versatility. Linguistic markers such as person pronouns, tense, intensifiers, or syntactic complexity align meaningfully with human judgments, suggesting that stylistic choices are aligned with perceived interview quality. DeepSeek's format instability underscores the operational importance of schema compliance. Our reusable toolkit (prompts, orchestration code, annotation rubric) provides a foundation for principled deployment of LLM interviewers in psychological experiments, enabling researchers to match model capabilities to study goals and to audit agent behavior for empathy, appropriateness, and effectiveness.
Ähnliche Arbeiten
The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods
2009 · 5.711 Zit.
The Stress Process
1981 · 4.480 Zit.
Mental health problems and social media exposure during COVID-19 outbreak
2020 · 2.793 Zit.
Cross-national prevalence and risk factors for suicidal ideation, plans and attempts
2008 · 2.633 Zit.
Psychological Aspects of Natural Language Use: Our Words, Our Selves
2002 · 2.556 Zit.