Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluating the Quality of AI-generated outputs for textual analysis with GRACE (Grounded Review and Assessment of Computational Evidence): A Comparative Evaluation of Ten Methods including Topic Modelling, Deep Learning, and Large Language Models (Preprint)

2026·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

<sec> <title>BACKGROUND</title> The rapid growth of digital technologies has generated large volumes of free-text data across healthcare, public health, and social research. These contain contextualised accounts of lived experience that are often absent from quantitative measures. Despite their value, these data remain underused because manual qualitative analysis is traditionally designed for in-depth work on smaller numbers of longer transcripts and is difficult to scale. Computational methods, including topic modelling and large language models, are increasingly promoted as efficient solutions. However, concerns persist regarding interpretability, bias, hallucinations, and loss of contextual depth. Critically, there is no established human-centred framework for evaluating the quality of machine-generated outputs, despite qualitative research’s longstanding emphasis on reflexivity, nuance, and meaning-making. </sec> <sec> <title>OBJECTIVE</title> 1) To develop an AI evaluation framework for assessing machine-generated outputs; 2) Evaluate different machine learning approaches, including classic natural language processing (latent Dirichlet allocation, LDA), a deep learning method (BERTopic), and more recent generative AI (LLaMA-3, Copilot, DeepSeek). </sec> <sec> <title>METHODS</title> We developed and applied a human-centred evaluation framework, GRACE (Grounded Review and Assessment of Computational Evidence), to assess the quality of free-text outputs from approaches using machine learning. GRACE was derived from established qualitative appraisal tools and operationalised four core indicators: interpretability, actionability, nuance, and redundancy, using structured scoring and reflexive consensus. We compared classic probabilistic topic modelling (LDA); a deep learning embedding-based approach (BERTopic); and three large language models (LLMs: LLaMA-3, Copilot, DeepSeek), used alone or in combination with prior structural topic modelling (STM). These were applied to the same corpus (n = 1,044 free-text responses). LLM prompting was iteratively refined, with a single-shot STM-based configuration selected for final evaluation due to reduced hallucinations. All outputs were analysed within a Machine-Assisted Topic Analysis workflow. A rapid manual thematic analysis of a 15% subsample (n = 152) served as a pragmatic comparator. </sec> <sec> <title>RESULTS</title> Model outputs were variable, with different natural language processing (NLP) methods producing different results from the same dataset. GRACE evaluation indicated that LDA achieved the highest overall mean score (2.6/5), followed by BERTopic and topic modelling plus Copilot (2.5), topic modelling plus LLaMA-3 (2.2), and topic modelling plus DeepSeek (1.9). LDA generated broader conceptual patterns requiring interpretive refinement; while BERTopic produced narrower, more descriptive clusters with thematic overlap. LLM-only outputs were very poor, but a combination of topic modelling and LLMs performed better: the outputs from the latter were well structured but often superficial and repetitive. </sec> <sec> <title>CONCLUSIONS</title> Computational models produced different interpretations of the same dataset, and performance did not align with technical complexity. Large language models were not suitable for thematic analysis, especially when applied to raw data, generating generalised and sometimes inaccurate outputs. Classical probabilistic modelling, particularly STM within a Machine Assisted Topic Analysis (MATA) workflow, provided the most reliable foundation, but still required human interpretation. We argue that the key issue is not whether a model “works,” but what insights it produces and whether these support meaningful, contextually grounded conclusions. GRACE offers a simple, human-centred framework to support this assessment. We recommend the use of a structured MATA approach. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationComputational and Text Analysis MethodsMeta-analysis and systematic reviews

Volltext beim Verlag öffnen

Evaluating the Quality of AI-generated outputs for textual analysis with GRACE (Grounded Review and Assessment of Computational Evidence): A Comparative Evaluation of Ten Methods including Topic Modelling, Deep Learning, and Large Language Models (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen

Evaluating the Quality of AI-generated outputs for textual analysis with GRACE (Grounded Review and Assessment of Computational Evidence): A Comparative Evaluation of Ten Methods including Topic Modelling, Deep Learning, and Large Language Models (Preprint)