Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

When AI Meets the FDA: An Evaluation of Large Language Models Performance in Regulatory and Clinical Trial Data Extraction, Synthesis, and Analysis

2025·0 Zitationen·medRxivOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Introduction: Clinical and population decision-making relies on the systematic evaluation of extensive regulatory evidence. The FDA drug reviews provide detailed information on clinical trial design, enrollment criteria, sample size, randomization, comparators, endpoints, and indications. However, extracting these data is resource-intensive and time-consuming. Generative Artificial Intelligence large language models (LLMs) may accelerate the extraction and synthesis of such information. This study compares the performance of three LLMs, ChatGPT-4o, Gemini 2.5 Pro, and DeepSeek R1, in extracting and synthesizing regulatory and clinical information using antibiotics approved for complicated urinary tract infections (cUTIs) between 2010 and 2025 as a case study. Methods: LLM models were evaluated using general (short, direct) and detailed (structured, guidance-referencing) prompts across five domains including accuracy (precision and recall), explanation quality, error type (hallucination rate, misclassification, and omission), efficiency (response time, correct answers per second, and seconds per correct answer), and consistency with responses generated in duplicate runs. Two investigators independently reviewed outputs against FDA guidance, resolving discrepancies by consensus. Statistical analyses included χ², Wilcoxon, and Kruskal–Wallis tests with false discovery rate correction. Results: Among 324 responses, accuracy differed significantly across models (χ², p <0.001) with Gemini 2.5 Pro achieving the highest accuracy (66.7%), followed by ChatGPT-4o (51.9%) and DeepSeek R1 (37.0%). General prompts outperformed detailed prompts (59.3% vs 44.4%; p =0.011). Gemini 2.5 Pro showed highest explanation quality and most consistent outputs, while ChatGPT-4o had shortest response times and highest efficiency. Hallucination was the most frequent error type across models. Conclusion: LLMs showed variable capability in extracting regulatory and clinical trial information. Gemini 2.5 Pro showed the strongest overall performance, while ChatGPT-4o was faster but less accurate, and DeepSeek R1 underperformed across most domains. These findings highlight both the promise and limitations of LLMs in regulatory science and support complementary use alongside human review to streamline evidence synthesis. Author Summary: Our research addresses a critical question in artificial intelligence for healthcare: how well do generative Generative Artificial Intelligence (GenAI) tools extract and synthesize regulatory and clinical information to inform decision-making? We assessed ChatGPT-4o, Gemini 2.5 Pro, and DeepSeek R1 performance in extracting and synthesizing information from regulatory documents and clinical trial data using all FDA approved antibiotics for the treatment of complicated urinary tract infections. We compared LLMs outputs directly with the original data sources. We assessed the models’ performance using both broad and detailed prompts across several areas, including accuracy of the information (precision and recall), quality of explanation, type of errors (hallucination, misclassification, and omission), efficiency and speed (response time, correct answers per second, and seconds per correct answer), and consistency of responses across repeated runs. The results suggest that while the models were generally fast and efficient extracting large volumes of information, they also produced errors and omissions that could limit their reliability. These findings highlight both the promise and the current limitations of GenAI, underscoring its potential value as a human supervised tool for safely supporting regulatory science and clinical decision-making.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareMeta-analysis and systematic reviews

Volltext beim Verlag öffnen

When AI Meets the FDA: An Evaluation of Large Language Models Performance in Regulatory and Clinical Trial Data Extraction, Synthesis, and Analysis

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen