OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 21.04.2026, 12:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

<scp>AI</scp> ‐Assisted Statistical Review: Could It Have Averted Retractions? A Case‐Based Perspective From Immunology

2025·0 Zitationen·Allergy
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2025

Jahr

Abstract

According to a report published in Nature [1], the number of retracted biomedical publications has increased dramatically over the past two decades, with over 10,000 articles withdrawn in 2023 alone, which is the highest figure ever recorded. While retractions are an important part of scientific correction [2], many are driven by methodological flaws, particularly statistical misinterpretation, which can erode trust in science and compromise clinical practice, especially in immunology. Immunology research plays a central role in vaccine development, immunotherapies, and public health policy. Despite these stakes, statistical peer review remains rare. Most biomedical journals apply it in only 1%–10% of submissions, and formal statistical reporting guidelines such as the Statistical Analyses and Methods in the Published Literature (SAMPL) guidelines are adopted by only a small minority [3, 4]. Recently, AI tools have been explored as a possible means of assisting in the detection of statistical issues in scientific manuscripts. Their use has been suggested as a way to potentially support efforts aimed at improving the accuracy, reproducibility, and transparency of research reporting. Integrating such tools into the peer review process could complement expert judgment and enhance the quality of statistical evaluation, although their use remains limited and subject to important methodological and practical constraints [5-8]. While competent human reviewers should identify common statistical errors such as lack of power analysis, failure to adjust for multiple comparisons, small sample sizes, and inadequate justification of test choices, statistical peer review remains infrequent in many biomedical journals. Therefore, AI tools may serve as valuable complementary aids to support reviewers and help maintain scientific rigor. To explore whether artificial intelligence could assist in the detection of statistical flaws, a focused analysis was conducted using large language models (LLMs). Three biomedical articles addressing immunology-related topics were selected, each of which had been retracted due to statistical concerns [9-11]. These articles focused on chemotherapy-induced immunosuppression, inflammation-related metabolic regulation, and tumor–immune system interactions. The articles were selected non-randomly based on publicly available retraction notices citing statistical errors as a primary reason for retraction, with additional consideration given to their relevance to immunology. The full text of each publication was submitted to two publicly available AI models—ChatGPT (GPT-4o, OpenAI) and Gemini Advanced 2.0 (Google DeepMind) and each was asked to simulate a statistical peer review by answering the same set of structured queries concerning statistical errors, test assumptions, and reporting clarity. These tools were selected due to their public availability, broad language capabilities, and current prominence among general-purpose AI models capable of processing extended text inputs. Both have also been widely discussed in the context of scientific and academic applications. To standardize the evaluation, both models were prompted with the same set of structured queries, including: “Are there any statistical errors or weaknesses present in this study?”, “Are the statistical methods appropriate for the data and study design?”, “Have the assumptions of the applied statistical tests been addressed?”, and “Is the statistical reporting sufficiently clear and complete?” Both AI models identified multiple statistical issues in all three retracted papers. Recurring concerns included the absence of power analysis, failure to correct for multiple comparisons, small sample sizes, and insufficient justification for the selection of statistical tests. Additionally, vague p-value reporting and missing assumptions were flagged. For example, in the study on flaxseed supplementation and inflammation [9], the tools noted unjustified use of parametric tests for a small sample. In the pilot trial involving oral bismuth [10], the AI outputs questioned subgroup comparisons and pointed out missing effect sizes. The evaluation of the everolimus trial [11] highlighted a lack of verification for assumptions in survival analysis and the unexplained use of a one-sided log-rank test. Despite these observations, several critical issues were not detected by either tool. These included inconsistent formatting of results, lack of information on statistical software used, oversimplified abstracts, and missing or unclear descriptions of test allocation. In some cases, AI-generated feedback was not contextually appropriate, for instance, recommending formal interaction testing in pilot studies with limited sample sizes. Some flagged issues, such as small sample size, may be contextually justified, for example, in early-phase or exploratory studies. This highlights the importance of human expertise in interpreting methodological decisions within the appropriate research context. These limitations are consistent with earlier studies demonstrating that while AI models can perform basic statistical tasks, they frequently struggle with inferential reasoning and contextual judgment [5-8]. The AI tools detected several recurring statistical problems across the retracted articles, as shown in the summary (Table 1). However, important omissions remained, particularly in areas requiring nuanced methodological judgment. These findings indicate that although LLMs can contribute to identifying statistical shortcomings in biomedical manuscripts, they currently lack the reliability and precision needed to replace expert statistical review. Particularly in immunology, where study designs are often complex and findings may influence clinical or public health decisions, the role of trained human reviewers remains essential. This study has several limitations that should be acknowledged. The analysis was based on a small, nonrandom sample of retracted articles, which limits the generalizability of the findings. Additionally, the AI tools used, although publicly accessible and widely discussed, have not been specifically validated for the purpose of statistical peer review, and their outputs may vary depending on prompt phrasing, model updates, or system limitations. Furthermore, since these models were trained on large corpora of published literature, it is possible that they had prior exposure to the retracted articles included in this study, which may have influenced their ability to detect known statistical flaws. This potential bias highlights the need for future evaluations using blinded or novel datasets to more accurately assess the true performance and generalizability of AI tools in this context. In conclusion, AI tools such as ChatGPT and Gemini Advanced 2.0 show promise in detecting statistical flaws in retracted articles, but they should be regarded as complementary aids rather than substitutes for expert oversight. With further refinement, these tools could support efforts to improve the transparency and quality of scientific publishing, especially in fields where methodological rigor is critical. The author approved the final version of the manuscript as submitted and agreed to be accountable for all aspects of the work. The author declares no conflicts of interest. Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Radiomics and Machine Learning in Medical ImagingArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen