Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of the performance of fully artificial intelligence generated vs human-authored abstracts at a large national cardiological congress
0
Zitationen
5
Autoren
2025
Jahr
Abstract
Abstract Background The increasing use of artificial intelligence (AI)-generated text in scientific writing raises concerns about the ability of peer reviewers to distinguish between authentic research and fabricated content. Purpose We aimed to investigate how fully AI-generated abstracts created by a large language model (ChatGPT) would perform in a blinded real-world review process. Methods We targeted approximately 10% of all submitted abstracts to be AI-generated. The number of abstracts historically submitted to the German Cardiac Society Congress for each subcategory was analyzed to ensure proportional representation. ChatGPT-4o was provided with simple prompts to generate fabricated abstracts, specifying word count, structure, and suitable categories. The AI-generated content was not altered by the authors in wording or data. Additionally, we used ChatGPT’s integrated Python tool to fabricate figures for 50% of the AI-generated abstracts. These abstracts were submitted alongside genuine ones. All reviewers, except for the study authors and the German Cardiac Society board, were blinded to the AI-generated submissions. After evaluation, all AI-generated abstracts were retracted immediately to prevent any influence on the congress proceedings. The primary outcome was the abstract rating on a scale from 1 (lowest) to 5 (highest). Results In 19 categories, a total of 1,348 abstracts were submitted, of which 136 (10%) were AI-generated, within 8 working hours. Overall, there was no significant difference in ranking between human-authored and AI-generated abstracts (human: 3.3±0.5, AI: 3.3±0.5, p=0.85, Figure 1A). AI-generated abstracts with fabricated figures did not perform significantly differently from those without figures (Figure 1B). Performance varied by category: AI-generated abstracts were rated lower than human-authored ones only in rhythmology (p<0.001, Figure 1D), whereas they were rated significantly higher in cardiovascular imaging, intensive care medicine, and psychocardiology (p=0.011, p=0.039, and p<0.001, respectively). In six categories, AI-generated abstracts were the highest rated, in three categories second highest rated and in four third highest rated (Figure 1, Table). Two AI-generated abstracts were flagged by reviewers as suspicious due to alleged prior publication, but no evidence of comparable studies was found. Conclusion AI-generated abstracts performed comparably to human-authored abstracts, with no overall inferiority but also no superiority. AI-generated abstracts ranked as the highest-rated submissions in six out of 19 categories. While they did not surpass human abstracts quantitatively, their high ratings in multiple categories highlight their potential to influence scientific discourse. AI-generated content presents a challenge to research integrity, emphasizing the need for tools to identify AI-fabricated abstracts in the future.Figure 1
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.250 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.109 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.482 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.434 Zit.