OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 06.04.2026, 03:41

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Large language models for thematic analysis in healthcare research: A blinded mixed-methods comparison with human analysts

2026·0 Zitationen·PLOS Digital HealthOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Large language models (LLMs) are increasingly used for qualitative thematic analysis, yet evidence on their performance in analysing focus-group data, where polyvocality and context complicate coding, remains limited. Given the increasing role of such models in thematic analysis, there is a need for methodological frameworks that enable systematic, metric-based comparisons between human and model-based analyses. We conducted a blinded mixed-methods comparison of two general-purpose LLMs (ChatGPT-5 and Claude 4 Sonnet), an LLM-based qualitative coding application (QualiGPT), and blinded human analysts on an in-person focus-group transcript informing an AI-enabled digital health proposal. We evaluated deductive coding using a 10-code, 6-theme codebook against an expert consensus adjudication; inductive coding with a structured Likert-scale comparison to a reference-standard set of inductive themes generated by expert consensus; and manual quote verification of LLM segments to define LLM hallucination (evidence absent or non-supportive) and error rate (including partial matches and speaker-coded segments). During deductive coding against an expert consensus adjudication, large language models yielded a mean agreement of 93.5% (95% CI 92.5-94.5) with κ = 0.34 (95% CI 0.26-0.40); blinded human coders achieved 92.7% (95% CI 91.6-93.9) agreement with κ = 0.34 (95% CI 0.26-0.41). Mean Gwet's AC1 was 0.92 (95% CI 0.90-0.93) for the blinded human analysis, and 0.93 (95% CI 0.92-0.94) for the LLM-assisted deductive analysis, reflecting high agreement despite the low overall code prevalence (7.8%, SD = 3.2%). Only one model achieved non-inferiority in inductive analysis of the transcript (p = 0.043). The strict hallucination rate in inductive analysis was 1.2% (SD = 2.1%). LLMs were non-inferior to human analysts for deductive coding of the focus-group data, with variable performance in inductive analysis. Low hallucination but significant comprehensive error rates indicate that LLMs can augment qualitative analysis but require human verification.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationComputational and Text Analysis MethodsMachine Learning in Healthcare
Volltext beim Verlag öffnen