Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
When AI Meets the FDA: An Evaluation of Large Language Models Performance in Regulatory and Clinical Trial Data Extraction, Synthesis, and Analysis
0
Zitationen
8
Autoren
2025
Jahr
Abstract
Abstract Introduction: Clinical and population decision-making relies on the systematic evaluation of extensive regulatory evidence. The FDA drug reviews provide detailed information on clinical trial design, enrollment criteria, sample size, randomization, comparators, endpoints, and indications. However, extracting these data is resource-intensive and time-consuming. Generative Artificial Intelligence large language models (LLMs) may accelerate the extraction and synthesis of such information. This study compares the performance of three LLMs, ChatGPT-4o, Gemini 2.5 Pro, and DeepSeek R1, in extracting and synthesizing regulatory and clinical information using antibiotics approved for complicated urinary tract infections (cUTIs) between 2010 and 2025 as a case study. Methods: LLM models were evaluated using general (short, direct) and detailed (structured, guidance-referencing) prompts across five domains including accuracy (precision and recall), explanation quality, error type (hallucination rate, misclassification, and omission), efficiency (response time, correct answers per second, and seconds per correct answer), and consistency with responses generated in duplicate runs. Two investigators independently reviewed outputs against FDA guidance, resolving discrepancies by consensus. Statistical analyses included χ², Wilcoxon, and Kruskal–Wallis tests with false discovery rate correction. Results: Among 324 responses, accuracy differed significantly across models (χ², p <0.001) with Gemini 2.5 Pro achieving the highest accuracy (66.7%), followed by ChatGPT-4o (51.9%) and DeepSeek R1 (37.0%). General prompts outperformed detailed prompts (59.3% vs 44.4%; p =0.011). Gemini 2.5 Pro showed highest explanation quality and most consistent outputs, while ChatGPT-4o had shortest response times and highest efficiency. Hallucination was the most frequent error type across models. Conclusion: LLMs showed variable capability in extracting regulatory and clinical trial information. Gemini 2.5 Pro showed the strongest overall performance, while ChatGPT-4o was faster but less accurate, and DeepSeek R1 underperformed across most domains. These findings highlight both the promise and limitations of LLMs in regulatory science and support complementary use alongside human review to streamline evidence synthesis. Author Summary: Our research addresses a critical question in artificial intelligence for healthcare: how well do generative Generative Artificial Intelligence (GenAI) tools extract and synthesize regulatory and clinical information to inform decision-making? We assessed ChatGPT-4o, Gemini 2.5 Pro, and DeepSeek R1 performance in extracting and synthesizing information from regulatory documents and clinical trial data using all FDA approved antibiotics for the treatment of complicated urinary tract infections. We compared LLMs outputs directly with the original data sources. We assessed the models’ performance using both broad and detailed prompts across several areas, including accuracy of the information (precision and recall), quality of explanation, type of errors (hallucination, misclassification, and omission), efficiency and speed (response time, correct answers per second, and seconds per correct answer), and consistency of responses across repeated runs. The results suggest that while the models were generally fast and efficient extracting large volumes of information, they also produced errors and omissions that could limit their reliability. These findings highlight both the promise and the current limitations of GenAI, underscoring its potential value as a human supervised tool for safely supporting regulatory science and clinical decision-making.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.245 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.100 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.466 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.429 Zit.