Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing the reliability of large language models for reviewing AI research in cardiac electrophysiology using the EHRA AI in EP checklist

2026·0 Zitationen·European Heart Journal - Digital HealthOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background The growing potential of Artificial Intelligence (AI) in cardiac electrophysiology (EP) led to an increasing number of publications exploring possible areas of AI-application in clinical settings. To improve reader understanding and to standardize reporting of AI-related research, the EHRA AI checklist was recently published. For its validation, studies from three EP domains were applied to the checklist by an expert panel. Aim of this study was to explore whether AI could assist in analysing and evaluating scientific manuscripts. Methods The study was conducted in May 2025. We used the Large Language Model (LLM) ChatGPT (GPT-4o) by OpenAI to apply all studies from the EHRA AI manuscript to the checklist and compared the generated responses by the LLM with those provided by the expert panel. Firstly, after ensuring that each item from the checklist (PDF file) could accurately be extracted by ChatGPT, we asked the LLM to create its own prompt which was then further refined by us through prompt engineering. Secondly, each study from the EHRA AI manuscript together with the checklist was successively uploaded to ChatGPT. For each analysis, the prompt remained the same: (1) asking the LLM to act as a reviewer in clinical EP, (2) short explanation of the two uploaded files, (3) a description of the task to decide for each item of the checklist whether it is "applicable" or "not applicable", (4) a description of the desired output format and (4) additional rules such as to not change or remove any item from the list. Results All 29 items from the checklist were analysed using one prompt for each of the 55 manuscripts. Results of the initial LLM assessment compared to the initial expert assessment are shown in the Figure. Per manuscript, none of ChatGPT’s results matched the human results completely, and the mean number of inconsistencies between human and LLM rating was 7 ± 2.8. Per checklist item, only item 11, which refers to the description of the data type used in the study, was rated identically to the human reviewer for every manuscript. Per EHRA AI checklist item, the mean number of inconsistencies among all papers was 13 ± 9.0. Conclusion Standards for reporting and understanding AI in EP are important. Even if LLMs can help with the evaluation and analysis of scientific manuscripts, there are currently still limitations in terms of reliability, meaning that human intelligence is still required to comply with standards.Comparison of human and LLM ratings

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationCardiac electrophysiology and arrhythmiasECG Monitoring and Analysis

Volltext beim Verlag öffnen

Assessing the reliability of large language models for reviewing AI research in cardiac electrophysiology using the EHRA AI in EP checklist

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen